1
Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
LLM预训练正从算力受限转向数据受限,这篇论文探讨如何从有机数据生成预训练token来突破规模瓶颈。
arXiv:2605.17849v1 Announce Type: cross Abstract: LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (orga…