Training Infinitely Deep and Wide Transformers
突破性研究:首次实现无限深和宽Transformer的可训练性,彻底解决深层网络训练瓶颈
arXiv:2605.17660v1 Announce Type: cross Abstract: Transformers have become the dominant architecture in modern machine learning, yet the theoretical u…
突破性研究:首次实现无限深和宽Transformer的可训练性,彻底解决深层网络训练瓶颈
arXiv:2605.17660v1 Announce Type: cross Abstract: Transformers have become the dominant architecture in modern machine learning, yet the theoretical u…
OpenAI训练像素序列的Transformer,实现图像生成与无监督分类,性能媲美顶级卷积网络。
We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can ge…
从贝叶斯几何视角重新阐释Transformer注意力机制,揭示其内在概率结构。
arXiv:2512.22471v5 Announce Type: replace Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously …
从理论层面揭示Transformer在噪声与任务级流形上的学习能力,近似与泛化分析带来新洞察
arXiv:2505.03205v3 Announce Type: replace Abstract: Transformers serve as the foundational architecture for large language and video generation models…
从零复现GPT核心机制,基于PyTorch实现简洁自回归语言模型,AI学习者必读的底层论文教程。
arXiv:2605.17398v1 Announce Type: cross Abstract: This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive langu…
揭秘大模型内部贝叶斯推断的几何结构,从小模型到生产级LLM的规模扩展规律
arXiv:2512.23752v5 Announce Type: replace Abstract: Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can im…
首次将无限头注意力融入硬件感知神经架构搜索,为边缘端百亿参数以下语言模型提供多后端高效部署方案。
arXiv:2605.17653v1 Announce Type: new Abstract: Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the…
几何引导的隐藏状态替换,揭秘扩散模型在语言模型中的最佳插入位置,DiHAL创新方案提升性能。
arXiv:2605.14368v1 Announce Type: cross Abstract: Continuous diffusion language models lag behind autoregressive transformers, partly because diffusio…
用Transformer时序建模处理量子探测器数据,实现动态鬼成像的前沿方法
arXiv:2605.10185v2 Announce Type: replace Abstract: Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating …
揭秘低资源语言大模型中的专家路由:对比Transformer与Mamba混合架构的MoE表现。
arXiv:2605.17598v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior acr…
突破性Transformer模型ArGEnT,高效学习复杂几何系统的解算子,助力设计优化、控制与反问题。
arXiv:2602.11626v2 Announce Type: replace-cross Abstract: Learning solution operators for systems with complex, varying geometries and parametric phys…
将Transformer深度视为离散时间,揭示残差流中的谱几何与网络拓扑耦合机制,为理解大模型计算传播提供新视角。
arXiv:2605.14258v1 Announce Type: cross Abstract: Large language models are remarkably capable, yet how computation propagates through their layers re…
揭示Transformer残差流不仅是优化通道,更是模型表示核心——两轴视角(序列位置×层深度)重构设计空间,自注意力与残差路径协同新理解。
arXiv:2603.16039v2 Announce Type: replace-cross Abstract: Recent work has made clear that the residual pathway is not mere optimization plumbing; it i…
类别无关的前馈模型,仅凭稀疏多状态RGB图像快速重建完整3D铰接物体。
arXiv:2512.14671v3 Announce Type: replace Abstract: We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward mode…
隐私保护+LoRA增强Transformer,多尺度特征学习实现精准眼病诊断
arXiv:2505.06982v3 Announce Type: replace Abstract: Accurate and privacy-preserving diagnosis of ophthalmic diseases remains a critical challenge in m…
MaTe:基于扩散Transformer的材料迁移,无需文本引导和参考网络,实现高效统一的图像处理
arXiv:2605.15660v1 Announce Type: new Abstract: Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architecture…
固定精度Transformer在描述语言时的简洁性,指数级优于线性时序逻辑和循环神经网络,理论证明其强大表达能力。
arXiv:2510.19315v3 Announce Type: replace-cross Abstract: We study succinctness as a measure of the expressive power of transformers. Succinctness -- …
不规则时间序列生成预训练新方法ITGPT,突破传统模型在多模态缺失数据上的局限。
arXiv:2605.16069v1 Announce Type: new Abstract: Timeseries regression models often struggle to leverage large volumes of labeled multimodal data, part…
揭示FFN架构稀疏性如何重塑注意力计算,影响小型Transformer模型学习机制。
arXiv:2605.09403v2 Announce Type: replace-cross Abstract: Architectural choices inside the Transformer feedforward network (FFN) block do not merely a…
提出Krause Attention机制,解决Transformer中全局softmax导致的同步动态与表示崩塌问题
arXiv:2602.11534v3 Announce Type: replace-cross Abstract: Self-attention in Transformers relies on globally normalized softmax weights, causing all to…