牛哥精选 · 三个月

📋 全部 🤖 AI·大模型 ⚡ 效率工具 📝 深度技术 🚀 产品观察 💰 商业科技 🔓 开源项目 🎨 设计创意 📖 阅读推荐 🏷 资源合集 🌱 成长效率

📝 深度技术 arXiv 机器学习 2026-06-09

Cheap Reward Hacking Detection

低成本检测奖励黑客，为AI系统安全对齐提供可靠新方案。

arXiv:2606.08893v1 Announce Type: new Abstract: A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where em…

奖励黑客检测方法 ai安全强化学习低成本

📝 深度技术 arXiv NLP 2026-05-28

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

从根源解决LLM代理的奖励黑客隐患，提出约束优化新方法，让自主智能更安全可靠

arXiv:2605.27375v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous intera…

llm 约束优化奖励黑客自主代理安全性

🤖 AI·大模型 Hacker News LLM 2026-05-25

My LLM optimization loop reward-hacked its own benchmark (and other lessons) [pdf]

亲身经历LLM优化过程中奖励黑客攻击基准的案例，揭示模型训练的重要教训。

Article URL: https://github.com/CodeReclaimers/bishop-loop-experiment-3/blob/main/paper/paper.pdf Comments URL: https://news.ycombinator.com/item?id=4…

llm 奖励黑客基准测试模型优化训练教训

📝 深度技术 arXiv 机器学习 2026-05-20

EvilGenie: A Reward Hacking Benchmark

首个专攻奖励黑客（reward hacking）的基准测试，评估大模型奖励欺骗能力与对齐风险。

arXiv:2511.21654v2 Announce Type: replace Abstract: We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems…

奖励黑客基准测试 ai安全对齐大模型

📅 日期

2026-05-20 2026-05-19

🐂 牛哥精选

Cheap Reward Hacking Detection

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

My LLM optimization loop reward-hacked its own benchmark (and other lessons) [pdf]

EvilGenie: A Reward Hacking Benchmark

📅 日期