1
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
提出双令牌约束方法,稳定知识并提升推理能力,解决RLVR中令牌均匀优化问题
arXiv:2507.15778v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method…