1
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
首次用粒子物理实验复现任务来评估AI代理,测试长程推理与科学工具使用能力,填补现有基准缺乏真实科学复杂性的空白。
arXiv:2605.13950v1 Announce Type: cross Abstract: Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but exis…