1
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
提出RLBFF,用二元灵活反馈桥接人类偏好与可验证奖励,提升大模型对齐效率。
arXiv:2509.21319v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable…