1
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
提出稀疏到稠密奖励原则,四阶段后训练流程更高效利用稀缺标注数据,为LLM推理优化提供新范式。
arXiv:2605.12483v2 Announce Type: replace-cross Abstract: When labeled verifiable training data is scarce, each checked example should be used where i…