Strong Teacher Not Needed? On Distillation in LLM Pretraining
颠覆认知?弱教师模型也能有效蒸馏LLM,预训练阶段教师强度并非关键。
arXiv:2605.23857v1 Announce Type: new Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield b…
颠覆认知?弱教师模型也能有效蒸馏LLM,预训练阶段教师强度并非关键。
arXiv:2605.23857v1 Announce Type: new Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield b…
提出极简优化器设计,大幅降低大模型预训练内存占用,已被ICML 2026接收。
arXiv:2506.16659v3 Announce Type: replace-cross Abstract: Training large language models (LLMs) relies on adaptive optimizers such as Adam, which intr…
揭秘SGD在LLM预训练中不如Adam的根源:大有效学习率的关键作用。
arXiv:2605.17787v1 Announce Type: new Abstract: It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptiv…
LLM预训练正从算力受限转向数据受限,这篇论文探讨如何从有机数据生成预训练token来突破规模瓶颈。
arXiv:2605.17849v1 Announce Type: cross Abstract: LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (orga…
提出通过模型合并解耦数据混合搜索与训练,高效扩展LLM预训练的数据配比策略。
arXiv:2602.00747v2 Announce Type: replace-cross Abstract: Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-trai…
全新方法利用MoE正交生长,大幅节省LLM预训练成本,突破沉没成本陷阱。
arXiv:2510.08008v2 Announce Type: replace Abstract: As the computational demands for pre-training Large Language Models (LLMs) continue to surge, the …
最大规模伦理数据集Common Corpus发布,为LLM预训练提供高质量合规数据
arXiv:2506.01732v3 Announce Type: replace Abstract: Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and d…