1
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
提出LambdaPO框架,用lambda算子优化推理语言模型策略,显著提升逻辑推理能力。
arXiv:2605.19416v1 Announce Type: new Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning ali…