Strong Teacher Not Needed? On Distillation in LLM Pretraining
颠覆认知?弱教师模型也能有效蒸馏LLM,预训练阶段教师强度并非关键。
arXiv:2605.23857v1 Announce Type: new Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield b…
颠覆认知?弱教师模型也能有效蒸馏LLM,预训练阶段教师强度并非关键。
arXiv:2605.23857v1 Announce Type: new Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield b…
提出极简优化器设计,大幅降低大模型预训练内存占用,已被ICML 2026接收。
arXiv:2506.16659v3 Announce Type: replace-cross Abstract: Training large language models (LLMs) relies on adaptive optimizers such as Adam, which intr…
最新研究揭示数据时间顺序对LLM预训练的深刻影响,理解时序偏差是关键
arXiv:2605.22769v1 Announce Type: new Abstract: Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledg…
LLM预训练的隐藏能力:学习到的数据流形可跨模态迁移至时间序列任务,揭示通用表征机制。
arXiv:2605.20449v1 Announce Type: new Abstract: Can language-pretrained transformers become effective time-series forecasters, and why? In this paper,…
揭秘SGD在LLM预训练中不如Adam的根源:大有效学习率的关键作用。
arXiv:2605.17787v1 Announce Type: new Abstract: It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptiv…
OpenAI对比预训练方法,学习文本与代码的高质量嵌入表示
提出SMART框架,将预训练模型融入高维非参数变量选择,为微调提供理论基础。
arXiv:2604.12288v2 Announce Type: replace-cross Abstract: Fine-tuning is a widely used strategy for adapting pre-trained models to new tasks, yet its …
揭秘MoE大模型预训练中剪枝与蒸馏技术,SlimQwen优化效率与性能。
arXiv:2605.08738v2 Announce Type: replace Abstract: Structured pruning and knowledge distillation (KD) are typical techniques for compressing large la…
从视频中自动合成海量GUI交互轨迹,破解GUI Agent预训练数据稀缺难题,让智能体更好理解真实应用。
arXiv:2605.14747v1 Announce Type: cross Abstract: Recent advances in multimodal large language models have driven growing interest in graphical user i…
提出Token叠加技术,颠覆预训练效率瓶颈,大幅降低算力需求,LLM训练优化必读。
arXiv:2605.06546v2 Announce Type: replace Abstract: Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, r…
从BERT到T5,一篇扎实的NER微调实战对比,技术细节丰富。
arXiv:2605.18462v1 Announce Type: new Abstract: Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applicati…
探讨如何借鉴语言习得装置,通过合成语言预训练提升大模型的数据效率,为AI发展带来新思路。
arXiv:2605.16758v1 Announce Type: new Abstract: Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PP…
机器人基础模型新突破:通用姿态预训练让视觉-语言-动作策略泛化能力飙升,已被RSS 2026接收。
arXiv:2602.19710v2 Announce Type: replace-cross Abstract: Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low trai…
LLM预训练正从算力受限转向数据受限,这篇论文探讨如何从有机数据生成预训练token来突破规模瓶颈。
arXiv:2605.17849v1 Announce Type: cross Abstract: LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (orga…
构建大规模非冗余蛋白质折叠分类基准TEDBench,突破尺度瓶颈,助力生物大分子功能解析。
arXiv:2605.18552v1 Announce Type: new Abstract: Classifying protein topology is essential for deciphering biological function, but progress is held ba…
探讨心电图模型缩放定律:增大模型规模并非总能带来性能提升,挑战自然语言处理经验。
arXiv:2605.17276v1 Announce Type: new Abstract: While scaling laws have established a fundamental framework for foundation models in natural language …
最新研究通过量化预训练语料不确定性,实现动态优化检索增强生成策略,提升生成质量
arXiv:2512.19134v2 Announce Type: replace Abstract: Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to…
用预训练视觉编码器替代传统 VAE,系统性设计选择研究揭示三大简化改进思路。
arXiv:2605.18324v1 Announce Type: cross Abstract: Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this p…
将10秒心电基础模型扩展至更长时间窗口,研究时序模型泛化能力。
arXiv:2605.16975v1 Announce Type: new Abstract: Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, hav…
颠覆传统,用预训练大模型突破1-bit量化瓶颈,既省存储又保精度。
arXiv:2508.06974v2 Announce Type: replace Abstract: 1-bit LLM quantization offers significant advantages in reducing storage and computational costs. …