1
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
提出TokenRatio方法,通过比值匹配实现原则性token级偏好优化,突破DPO的序列级局限,更精准对齐语言模型
arXiv:2605.12288v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language m…