1
DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
大模型对齐新方法:通过风险约束解码感知人类偏好分歧,提升模型鲁棒性。
arXiv:2603.08145v2 Announce Type: replace Abstract: Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective,…