1
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
突破静态评测局限,用可执行交互基准动态检验命令行智能体真实能力
arXiv:2605.14133v1 Announce Type: new Abstract: Interactive agent benchmarks face a tension between scalable construction and realistic workflow evalu…