1
Beyond RLHF: A Unified Theoretical Framework of Alignment
一份超越RLHF的统一对齐理论框架,抽象形式化多种对齐算法并揭示内在联系,为AI安全提供新视角。
arXiv:2506.01523v2 Announce Type: replace Abstract: Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm f…