1
TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
LLM推理训练新突破:通过训练时分解攻克零奖励难题,让模型从失败轨迹中学习!
arXiv:2606.09883v1 Announce Type: cross Abstract: Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by pos…