Value-Gradient Hypothesis of RL for LLMs
从价值梯度假说揭秘PPO和GRPO为何有效,为LLM后训练提供新理论框架。
arXiv:2605.21654v1 Announce Type: cross Abstract: Reinforcement learning substantially improves pretrained language models, but it remains understudie…
从价值梯度假说揭秘PPO和GRPO为何有效,为LLM后训练提供新理论框架。
arXiv:2605.21654v1 Announce Type: cross Abstract: Reinforcement learning substantially improves pretrained language models, but it remains understudie…
一篇系统梳理LLM后训练中强化学习的综述,涵盖RLHF、DPO、RLVR等前沿方法
arXiv:2407.16216v4 Announce Type: replace Abstract: Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still pr…
本地LLM驱动,自动完成学术手稿提交检查清单,提升投稿效率。
arXiv:2605.16377v1 Announce Type: cross Abstract: Transparent and standardized reporting is essential for reproducible scientific research, yet adhere…
揭示DPO与PPO本质差异,挑战“监督学习vs强化学习”传统认知的深度技术论文。
arXiv:2512.00778v2 Announce Type: replace Abstract: Preference optimization (PO) is indispensable for large language models (LLMs), with methods such …
OpenAI发布新强化学习算法PPO,简单易调优且性能卓越,已成为默认算法。
We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of…
单个人类演示教会AI狂揽74,500分,刷新《蒙提祖马的复仇》最高分记录。
We’ve trained an agent to achieve a high score of 74,500 on Montezuma’s Revenge from a single human demonstration, better than any previously publishe…