牛哥精选 · 三个月

📋 全部 🤖 AI·大模型 ⚡ 效率工具 📝 深度技术 🚀 产品观察 💰 商业科技 🔓 开源项目 🎨 设计创意 📖 阅读推荐 🏷 资源合集 🌱 成长效率

📝 深度技术 arXiv AI 2026-07-13

AlphaZero in Sparsely Rewarded Games: Limits and Auxiliary Supervision

DeepMind的AlphaZero在稀疏奖励游戏中表现有限，论文提出辅助监督方法提升性能。

arXiv:2607.08984v1 Announce Type: cross Abstract: AlphaZero has demonstrated that a neural-guided Monte Carlo Tree Search can achieve superhuman perfo…

alphazero 稀疏奖励辅助监督强化学习游戏ai

📝 深度技术 arXiv 机器学习 2026-06-16

ExpRL: Exploratory RL for LLM Mid-Training

用探索性强化学习改善LLM中间训练，解决基模型覆盖不足问题，提升推理能力。

arXiv:2606.17024v1 Announce Type: new Abstract: Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but …

强化学习 llm推理中间训练探索性方法稀疏奖励

🤖 AI·大模型 arXiv 机器学习 2026-06-08

Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

让大模型在稀疏奖励环境中引导强化学习策略，通过不确定性估计提升决策可靠性，有代码可复现。

arXiv:2606.06673v1 Announce Type: new Abstract: Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning…

不确定性感知大型语言模型策略塑造稀疏奖励强化学习

📝 深度技术 arXiv AI 2026-05-20

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

提出稀疏到稠密奖励原则，四阶段后训练流程更高效利用稀缺标注数据，为LLM推理优化提供新范式。

arXiv:2605.12483v2 Announce Type: replace-cross Abstract: When labeled verifiable training data is scarce, each checked example should be used where i…

llm后训练强化学习奖励设计 grpo 知识蒸馏

🔧 开发工具 arXiv AI 2026-05-19

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

速览强化学习稀疏奖励的半监督解决方案，来自arXiv最新研究

arXiv:2501.19128v5 Announce Type: replace-cross Abstract: In many real-world scenarios, reward signal for agents are exceedingly sparse, making it cha…

强化学习稀疏奖励半监督学习奖励塑造机器学习

📅 日期

2026-05-20 2026-05-19