牛哥精选 · 三个月

📋 全部 🤖 AI·大模型 ⚡ 效率工具 📝 深度技术 🚀 产品观察 💰 商业科技 🔓 开源项目 🎨 设计创意 📖 阅读推荐 🏷 资源合集 🌱 成长效率

📝 深度技术 arXiv 机器学习 2026-06-30

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

一篇挑战LLM后训练中在线模仿学习优势的论文，深入剖析了非可实现性与时间跨度的关键作用

arXiv:2606.30445v1 Announce Type: new Abstract: Online imitation learning (IL), particularly on-policy distillation, has emerged as a strong LLM post-…

llm 在线模仿学习后训练监督微调可实现性

🤖 AI·大模型 arXiv AI 2026-05-29

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

探讨SFT如何优化自身并为强化学习做准备，揭示大模型训练策略的关键演进方向

arXiv:2602.01058v2 Announce Type: replace-cross Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline …

sft 监督微调强化学习大模型训练预训练策略

🤖 AI·大模型 arXiv 机器学习 2026-05-29

On-Policy Replay for Continual Supervised Fine-Tuning

提出On-Policy Replay方法，解决持续监督微调中的灾难性遗忘问题，为LLM高效增量学习提供新思路。

arXiv:2605.29495v1 Announce Type: new Abstract: Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs…

on-policy 持续监督微调灾难性遗忘大模型微调持续学习

📝 深度技术 arXiv AI 2026-05-28

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

来自ICLR2026的研究：强化学习让推理更紧凑，监督微调则让推理更发散，揭示两种范式在大模型推理上的本质差异。

arXiv:2509.21128v2 Announce Type: replace Abstract: Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable …

强化学习监督微调推理大语言模型 iclr2026 对比研究

📝 深度技术 arXiv 机器学习 2026-05-21

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

新方法通过logit averaging融合强化学习与监督微调，显著提升LLM后训练的性能和稳定性。

arXiv:2605.20555v1 Announce Type: new Abstract: We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a tr…

大语言模型强化学习监督微调 logit aver 后训练

🤖 AI·大模型 arXiv 机器学习 2026-05-20

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

一篇统一SFT、DAgger、离线RL和OPD视角的LLM蒸馏论文，解耦KL与轨迹，为模型优化提供新理论框架。

arXiv:2605.16826v1 Announce Type: new Abstract: Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood…

llm蒸馏 kl散度监督微调强化学习轨迹优化

📅 日期

2026-05-20 2026-05-19