1
General Preference Reinforcement Learning
NeurIPS 2026投稿,提出一种通用的偏好强化学习方法,为RLHF等领域提供更坚实的理论基础。
arXiv:2605.18721v1 Announce Type: new Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Onl…