1
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
超越正确性:通过强化学习调和过程与结果奖励,为模型训练提供新视角
arXiv:2509.03403v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning …