MathAtlas: A Benchmark for Autoformalization in the Wild
首个大规模研究生级数学自动形式化基准,52k样本填补研究级数学空白。
arXiv:2605.14061v1 Announce Type: new Abstract: Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, whi…
首个大规模研究生级数学自动形式化基准,52k样本填补研究级数学空白。
arXiv:2605.14061v1 Announce Type: new Abstract: Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, whi…
首个专为遥感领域复杂推理设计的视觉语言基准VLRS-Bench,从认知、决策、预测三维度构建2000个高难度问答对,揭示现有MLLM在遥感推理上的显著瓶颈,为多模态AI在遥感应用的发展提供关键方向。
arXiv:2602.07045v2 Announce Type: replace-cross Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoni…
一个发人深省的发现:任务成功率和承诺完整性在LLM Agent评估中常常背道而驰。NeuroState-Bench通过人类校准的侧查询探针,揭示了传统结果导向评估的盲区——你信赖的“最佳模型”可能无法始终如一地兑现承诺。
arXiv:2605.01847v3 Announce Type: replace Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitmen…
首次用粒子物理实验复现任务来评估AI代理,测试长程推理与科学工具使用能力,填补现有基准缺乏真实科学复杂性的空白。
arXiv:2605.13950v1 Announce Type: cross Abstract: Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but exis…