1
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
无需外部奖励,LLM仅靠自身采样响应训练就能提升推理能力?SePT方法实现自我进化!
arXiv:2510.18814v3 Announce Type: replace-cross Abstract: Can language models improve their reasoning performance without external rewards, using only…