1
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
提出RUBRIC-ARROW方法,通过交替点对点标准奖励建模优化LLM在非可验证领域的后训练性能
arXiv:2605.29156v1 Announce Type: new Abstract: Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute s…