AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
魔幻庄园式评测,专测LLM代理能否在陌生工具场景下自主推理与修正操作。
arXiv:2605.07926v2 Announce Type: replace Abstract: As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability…