1
CATPO: Critique-Augmented Tree Policy Optimization
CATPO方法通过批评增强的树策略优化,显著提升大语言模型推理中的密集奖励获取效率。
arXiv:2606.08346v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving t…