牛哥精选 · 三个月

📋 全部 🤖 AI·大模型 ⚡ 效率工具 📝 深度技术 🚀 产品观察 💰 商业科技 🔓 开源项目 🎨 设计创意 📖 阅读推荐 🏷 资源合集 🌱 成长效率

📝 深度技术 arXiv NLP 2026-06-04

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

新论文提出Agent Planning Benchmark，用于诊断和评估LLM Agent的规划能力，填补了相关评测空白。

arXiv:2606.04874v1 Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason …

agent plan benchmark llm agents diagnostic 规划能力

💰 商业科技 TechCrunch 2026-06-04

Benchmark raises its first-ever growth fund as part of $2B capital raise

硅谷顶级风投Benchmark首次设立增长基金，加入20亿美元资本大军，投资策略迎来重大转向。

The legendary abandons its more than 20 year tradition of keeping its funds to about $425 million.

benchmark 风险投资增长基金硅谷科技融资

🤖 AI·大模型 arXiv 机器学习 2026-06-02

OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models

首个标准化EEG基础模型评估基准，覆盖多任务与多数据集，推动脑电AI研究规范化。

arXiv:2606.00815v1 Announce Type: new Abstract: Electroencephalography (EEG) supports a variety of brain-computer interface (BCI) tasks ranging from b…

eeg foundation benchmark evaluation standardiz

📝 深度技术 arXiv AI 2026-05-29

Benchmarking at the Edge of Comprehension

一篇探讨理解边界处基准测试的学术论文，揭示AI评估新视角。

arXiv:2602.14307v3 Announce Type: replace Abstract: As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they a…

基准测试理解边界 nlp 人工智能评估论文

🤖 AI·大模型 arXiv NLP 2026-05-27

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

多轮对话中记忆架构如何影响Text-to-SQL性能？基准与实证研究揭示关键设计差异。

arXiv:2605.26394v1 Announce Type: new Abstract: Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in singl…

text-to-sq 多轮对话记忆架构基准测试实证研究

🤖 AI·大模型 arXiv AI 2026-05-19

MathAtlas: A Benchmark for Autoformalization in the Wild

首个大规模研究生级数学自动形式化基准，52k样本填补研究级数学空白。

arXiv:2605.14061v1 Announce Type: new Abstract: Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, whi…

autoformal benchmark graduate m mathatlas ai for mat

🤖 AI·大模型 arXiv AI 2026-05-19

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

首个专为遥感领域复杂推理设计的视觉语言基准VLRS-Bench，从认知、决策、预测三维度构建2000个高难度问答对，揭示现有MLLM在遥感推理上的显著瓶颈，为多模态AI在遥感应用的发展提供关键方向。

arXiv:2602.07045v2 Announce Type: replace-cross Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoni…

vlrs-bench remote sen vision-lan reasoning benchmark

🤖 AI·大模型 arXiv AI 2026-05-19

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

一个发人深省的发现：任务成功率和承诺完整性在LLM Agent评估中常常背道而驰。NeuroState-Bench通过人类校准的侧查询探针，揭示了传统结果导向评估的盲区——你信赖的“最佳模型”可能无法始终如一地兑现承诺。

arXiv:2605.01847v3 Announce Type: replace Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitmen…

neurostate llm agent commitment benchmark human-cali

📝 深度技术 arXiv AI 2026-05-19

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

首次用粒子物理实验复现任务来评估AI代理，测试长程推理与科学工具使用能力，填补现有基准缺乏真实科学复杂性的空白。

arXiv:2605.13950v1 Announce Type: cross Abstract: Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but exis…

ai agents benchmark reproducti particle p lhc

📅 日期

2026-05-20 2026-05-19

🐂 牛哥精选

📅 日期