1
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
新方法用DPO隐式奖励差距衡量样本难度,自动筛选高质量偏好数据,提升模型训练效率。
arXiv:2508.04149v2 Announce Type: replace-cross Abstract: Aligning large language models (LLMs) with human preferences is a critical challenge in AI r…