牛哥精选 · 三个月

1

🤖 AI·大模型 arXiv NLP 2026-07-08

EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

新基准EgoDyn-Bench专测自动驾驶视觉基础模型的自我运动理解能力，诊断漏洞提升安全性。

arXiv:2604.22851v2 Announce Type: replace-cross Abstract: While Vision-Language Models (VLMs) have advanced high-level reasoning in autonomous driving…

自动驾驶基础模型自我运动理解视觉感知评估基准

2

🤖 AI·大模型 arXiv AI 2026-07-07

K9-Bench: Evaluating Multimodal LLMs on Canine-Centric Videos

首个针对狗狗视频的多模态LLM评估基准，测模型到底多懂汪星人。

arXiv:2607.02680v1 Announce Type: cross Abstract: MLLMs have shown strong zero-shot capabilities across diverse inputs such as across images, video, a…

k9-bench 多模态大模型评估基准犬类视频 ai理解

3

📝 深度技术 arXiv 计算机视觉 2026-07-07

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

评估多模态大模型视觉知识理解新基准，揭示AI对物理与社会常识的认知短板

arXiv:2511.20272v2 Announce Type: replace Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they ofte…

多模态大模型视觉知识评估基准常识理解 mllm

4

📝 深度技术 arXiv AI 2026-06-19

Benchmarking Agentic Review Systems

开源与专有AI代理评审系统对比评测，揭示AI辅助科研时代评审新挑战

arXiv:2606.19749v1 Announce Type: new Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review s…

同行评审代理系统基准测试 ai辅助研究开源系统

5

📝 深度技术 arXiv NLP 2026-06-18

TW-LegalBench: Measuring Taiwanese Legal Understanding

首个专测台湾法律体系的大模型评估基准，填补法律AI地域性理解空白。

arXiv:2606.18699v1 Announce Type: new Abstract: Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their perfor…

台湾法律法律基准大语言模型法律推理评估测试

6

🤖 AI·大模型 arXiv AI 2026-06-12

WildIFEval: Instruction Following in the Wild

新基准WildIFEval带你从实验室走进真实世界的指令跟随能力评估

arXiv:2503.06573v3 Announce Type: replace-cross Abstract: Recent LLMs have shown remarkable success in following user instructions, yet handling instr…

wildifeval 指令遵循评估基准大语言模型自然场景

7

🤖 AI·大模型 arXiv NLP 2026-06-11

AI Coding Agents Can Reproduce Social Science Findings

探索AI编码代理能否系统性复现社会科学发现，最新研究填补评估空白。

arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provide…

ai编码代理社会科学复现研究评估基准系统评估

8

🤖 AI·大模型 arXiv AI 2026-06-10

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

揭秘SOTA模型在真实人类推理评估中的短板，新数据集RealMath-Eval挑战LLM评判能力。

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-schoo…

llm评估数学推理人类推理评估基准

9

🤖 AI·大模型 arXiv AI 2026-06-03

Evaluating Relational Reasoning in LLMs with REL

新基准REL系统性评估大模型关系推理能力，揭示现有LLM的短板与潜力。

arXiv:2604.12176v2 Announce Type: replace Abstract: Relational reasoning is the ability to infer relations that jointly bind multiple entities, attrib…

llm 关系推理 rel 评估基准推理能力

10

📝 深度技术 arXiv AI 2026-06-02

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

最新基准AgentProcessBench，精准诊断工具使用Agent每一步的过程质量，从结果到过程全面评估Agent可靠性

arXiv:2603.14465v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in lon…

agentproce 工具使用agent 步骤级过程质量 agent评估基准过程监督

11

🤖 AI·大模型 arXiv NLP 2026-06-02

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

LLM Agent评估新基准，针对复杂任务依赖与人类对齐的用户模拟环境

arXiv:2606.01815v1 Announce Type: new Abstract: Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect use…

llm agent 任务依赖用户模拟评估基准人机对齐

12

🤖 AI·大模型 arXiv AI 2026-06-02

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci基准首次系统评估LLM Agent在AI研究前瞻性判断上的表现

arXiv:2606.00644v1 Announce Type: new Abstract: AI research often requires decisions before future evidence exists: which bottleneck to attack, which …

llm代理前瞻性研究判断基准测试 foresci ai研究评估

13

📝 深度技术 arXiv NLP 2026-06-01

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

论文提出ValueGround基准，评估多模态大模型对不同文化背景下的视觉价值理解能力，揭示现有模型在文化适应性上的不足。

arXiv:2604.06484v3 Announce Type: replace Abstract: Cultural values are expressed not only through language but also through visual scenes and everyda…

多模态大模型文化价值观视觉理解评估基准 mllms

14

🤖 AI·大模型 arXiv AI 2026-05-28

ChildEval: When large language models meet children's personalities

大模型与儿童人格碰撞，首个针对儿童个性的评估基准ChildEval问世。

arXiv:2605.27805v1 Announce Type: cross Abstract: While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remai…

大模型儿童人格评估基准 childeval 个性化ai

15

📝 深度技术 arXiv NLP 2026-05-27

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

最新研究：现有代码智能体能否应对单仓库之外的复杂任务？实验揭示其能力边界与挑战。

arXiv:2603.03194v2 Announce Type: replace Abstract: Current code-agent benchmarks primarily evaluate localized issue resolution within a single target…

代码agent 跨仓库bug修复大模型软件开发自动化能力评估基准

16

🤖 AI·大模型 arXiv AI 2026-05-26

Stop Comparing LLM Agents Without Disclosing the Harness

呼吁LLM Agent对比必须公开评估框架，否则比较毫无意义，直击当前研究痛点。

arXiv:2605.23950v1 Announce Type: new Abstract: This position paper argues that, for long-horizon tasks evaluated across models with comparable fronti…

llm agent 评估框架可重复性透明度公平比较

17

🤖 AI·大模型 arXiv NLP 2026-05-25

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

LLM在开放域自然语言驱动分子生成中的表现评估，新基准揭示文本到分子对齐的挑战。

arXiv:2412.14642v4 Announce Type: replace Abstract: Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-drive…

llm评估分子生成自然语言驱动开放领域科学发现

18

🤖 AI·大模型 arXiv 机器学习 2026-05-23

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

多模态大模型能否胜任复杂路线规划？这篇论文构建MapTab基准，系统性评估MLLMs在异构图中的多标准路径规划能力。

arXiv:2602.18600v3 Announce Type: replace Abstract: Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artific…

mllms 多标准路径规划异构图路线规划评估基准

19

📝 深度技术 arXiv NLP 2026-05-21

MemGym: a Long-Horizon Memory Environment for LLM Agents

专为LLM智能体设计的长期记忆测试环境MemGym，填补长周期任务基准空白。

arXiv:2605.20833v1 Announce Type: new Abstract: Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory ben…

memgym llm agent 长期记忆评估基准记忆环境

20

🤖 AI·大模型 arXiv 机器学习 2026-05-20

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

KDD 2026发布最新基准MirrorBench，重新定义对话代理拟人化评估标准，推动人机交互研究新高度

arXiv:2601.08118v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used as human simulators, both for evaluating …

mirrorbenc 对话代理人机交互拟人化评估基准

🐂 牛哥精选