1
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
最新研究揭示GRPO方法在线策略局限,Mu-GRPO通过离线策略提升LLM强化学习效率,降低计算成本。
arXiv:2605.17570v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement le…