1
Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates
揭秘SGD在LLM预训练中不如Adam的根源:大有效学习率的关键作用。
arXiv:2605.17787v1 Announce Type: new Abstract: It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptiv…