1
Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
聚焦长链思维训练中“答案正确但有害”的隐蔽问题,提出诊断方法助你避开模型训练陷阱。
arXiv:2605.29288v1 Announce Type: new Abstract: Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet …