1
Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning
非均匀令牌级信任区域优化,突破传统限制提升大模型强化学习训练稳定性。
arXiv:2606.10968v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasonin…