1
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
用逆强化学习从推理轨迹中自动学习过程奖励模型,有效提升大语言模型的复杂推理能力。
arXiv:2602.07832v2 Announce Type: replace Abstract: Process rewards have been widely used in deep reinforcement learning to improve training efficienc…