牛哥精选 · 三个月

1

📝 深度技术 arXiv 机器学习 2026-07-09

GIFT: Geometry-Informed Low-precision Gradient Communication for LLM Pretraining

提出GIFT方法，利用梯度几何信息实现低精度通信，在不牺牲模型精度的前提下显著降低LLM预训练通信开销。

arXiv:2607.07494v1 Announce Type: cross Abstract: Gradient communication is a primary scaling bottleneck in large language model (LLM) pretraining. Co…

大模型训练梯度压缩低精度通信几何信息分布式训练

2

🤖 AI·大模型 arXiv AI 2026-07-07

Full-Stack FP4: Stable LLM Pretraining with Quantized Projections, Optimizers, and Attention

首次实现全栈FP4量化预训练，突破LLM训练速度和显存瓶颈

arXiv:2607.04422v1 Announce Type: cross Abstract: Recent NVFP4 pretraining methods mainly target transformer linear layers, leaving optimizer states, …

fp4量化 llm预训练量化投影量化优化器量化注意力

3

📝 深度技术 arXiv 机器学习 2026-06-24

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

多目标强化学习调度数据，让大模型预训练更高效。

arXiv:2606.24133v1 Announce Type: new Abstract: The composition of training data, governed by the diversity of sources and their mixing strategy, is a…

llm预训练数据调度多目标强化学习训练效率

4

🤖 AI·大模型 Hacker News LLM 2026-06-17

Common Corpus: The Largest Collection of Ethical Data for LLM PRE-Training

史上最大伦理数据集合Common Corpus发布，为LLM预训练提供合规开源新选择

Article URL: https://openreview.net/pdf?id=0wSlFpMsGb Comments URL: https://news.ycombinator.com/item?id=48567122 Points: 2 # Comments: 0

common cor 伦理数据 llm预训练开源数据集 ai数据治理

5

📝 深度技术 arXiv AI 2026-06-16

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining

用强化学习动态优化预训练数据配比，Actor-Critic框架让大模型学得更高效。

arXiv:2505.23878v2 Announce Type: replace-cross Abstract: Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mix…

ac-odm 数据混合 llm预训练样本效率强化学习

6

📝 深度技术 arXiv AI 2026-06-11

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

当标准探测准确率饱和时，引入“脆弱性”度量作为互补指标，为LLM预训练分析提供新视角。

arXiv:2606.11375v1 Announce Type: cross Abstract: Standard linear probing declares a property "encoded" when a classifier on hidden states achieves hi…

llm预训练分析线性探测脆弱性度量隐藏状态属性编码

7

📝 深度技术 arXiv 机器学习 2026-06-10

AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping

被ICML接收的自适应梯度裁剪方法，有效提升LLM预训练稳定性，AI训练优化的新突破

arXiv:2502.11034v3 Announce Type: replace Abstract: Loss spikes remain a persistent obstacle in large-scale language model pretraining. While previous…

adagc 自适应梯度裁剪 llm预训练预训练稳定性 icml 2026

8

📝 深度技术 arXiv AI 2026-06-10

Unifying Local Communications and Local Updates for LLM Pretraining

这篇论文提出统一本地通信与更新策略，旨在提升大模型预训练的通信效率，分布式训练的新视角。

arXiv:2606.11081v1 Announce Type: cross Abstract: Communication-efficient pre-training of LLMs is increasingly important as training draws on compute …

llm预训练通信效率分布式训练本地更新带宽优化

9

📝 深度技术 arXiv AI 2026-06-05

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

多项式预条件层PC Layer通过重塑权重矩阵奇异值谱，稳定大模型预训练过程，提升训练质量与收敛效率。

arXiv:2606.06470v1 Announce Type: cross Abstract: We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner tha…

多项式预条件权重参数化 llm预训练奇异值谱训练稳定性

10

🤖 AI·大模型 arXiv AI 2026-05-27

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

真实生产环境下的LLM预训练运维经验，504块GPU集群从故障检测到恢复的实证分析。

arXiv:2605.09370v2 Announce Type: replace-cross Abstract: Large-scale AI training is now fundamentally a distributed systems problem, and hardware fai…

llm预训练 gpu集群硬件故障分布式系统运维分析

11

🤖 AI·大模型 arXiv NLP 2026-05-26

NITP: Next Implicit Token Prediction for LLM Pre-training

提出NITP隐式Token预测新方法，革新LLM预训练范式，已被ICML 2026接收。

arXiv:2605.24956v1 Announce Type: new Abstract: Standard next-token prediction (NTP) supervises language models solely through discrete labels in the …

llm预训练隐式令牌预测 icml 2026 语言模型自监督学习

12

📝 深度技术 arXiv AI 2026-05-25

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

新论文提出ReCoVer系统，用容错集合和灵活工作负载增强LLM预训练弹性，减少训练中断损失。

arXiv:2605.11215v2 Announce Type: replace-cross Abstract: Pre-training large language models on massive GPU clusters has made hardware faults routine …

llm预训练容错系统分布式训练集体通信弹性训练

13

🤖 AI·大模型 arXiv 机器学习 2026-05-25

Strong Teacher Not Needed? On Distillation in LLM Pretraining

颠覆认知？弱教师模型也能有效蒸馏LLM，预训练阶段教师强度并非关键。

arXiv:2605.23857v1 Announce Type: new Abstract: Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield b…

大语言模型知识蒸馏预训练模型压缩弱到弱蒸馏

14

📝 深度技术 arXiv AI 2026-05-23

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

提出极简优化器设计，大幅降低大模型预训练内存占用，已被ICML 2026接收。

arXiv:2506.16659v3 Announce Type: replace-cross Abstract: Training large language models (LLMs) relies on adaptive optimizers such as Adam, which intr…

llm预训练内存优化优化器设计极简架构 icml 2026

15

📝 深度技术 arXiv 机器学习 2026-05-20

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

揭秘SGD在LLM预训练中不如Adam的根源：大有效学习率的关键作用。

arXiv:2605.17787v1 Announce Type: new Abstract: It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptiv…

llm预训练 adam优化器 sgd差距有效学习率深度学习

16

🤖 AI·大模型 arXiv 机器学习 2026-05-20

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

LLM预训练正从算力受限转向数据受限，这篇论文探讨如何从有机数据生成预训练token来突破规模瓶颈。

arXiv:2605.17849v1 Announce Type: cross Abstract: LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (orga…

llm预训练数据瓶颈有机数据生成token 规模定律

17

📝 深度技术 arXiv AI 2026-05-19

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

提出通过模型合并解耦数据混合搜索与训练，高效扩展LLM预训练的数据配比策略。

arXiv:2602.00747v2 Announce Type: replace-cross Abstract: Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-trai…

llm预训练数据混合模型合并解耦搜索

18

📝 深度技术 arXiv 机器学习 2026-05-19

Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts

全新方法利用MoE正交生长，大幅节省LLM预训练成本，突破沉没成本陷阱。

arXiv:2510.08008v2 Announce Type: replace Abstract: As the computational demands for pre-training Large Language Models (LLMs) continue to surge, the …

llm预训练 moe 正交生长效率提升计算成本

19

🤖 AI·大模型 arXiv NLP 2026-05-19

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

最大规模伦理数据集Common Corpus发布，为LLM预训练提供高质量合规数据

arXiv:2506.01732v3 Announce Type: replace Abstract: Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and d…

llm 数据集预训练伦理 commoncorp

🐂 牛哥精选