What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
多轮对话代理只能“一刀切”蒸馏?这篇论文给出何时蒸馏、蒸馏什么的智能选择策略
arXiv:2605.19447v1 Announce Type: new Abstract: Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignme…