牛哥精选 · 半年

🤖 AI·大模型 arXiv AI 2026-05-19

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

一个发人深省的发现：任务成功率和承诺完整性在LLM Agent评估中常常背道而驰。NeuroState-Bench通过人类校准的侧查询探针，揭示了传统结果导向评估的盲区——你信赖的“最佳模型”可能无法始终如一地兑现承诺。

arXiv:2605.01847v3 Announce Type: replace Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitmen…

neurostate llm agent commitment benchmark human-cali

🐂 牛哥精选

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

📅 日期