1
Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
新方法通过logit averaging融合强化学习与监督微调,显著提升LLM后训练的性能和稳定性。
arXiv:2605.20555v1 Announce Type: new Abstract: We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a tr…