牛哥精选 · 半年

1

🔓 开源项目 Hacker News AI 2026-07-12

I built a free tool to evaluate AI agent outputs (human labels and LLM judges)

一个开源免费工具，帮你用人工标注和LLM裁判双维度评估AI Agent输出质量。

Article URL: https://github.com/AntoineF23/verdict Comments URL: https://news.ycombinator.com/item?id=48875259 Points: 1 # Comments: 0

ai评估 llm评判人工标注 agent输出开源工具

2

📝 深度技术 arXiv 机器学习 2026-07-07

Auditing the Audit: Five Failure Modes in Benchmark-Validity Audits

基准测试的审计本身也可能出错？这篇论文揭示了基准有效性审计的五种典型失败模式，对于构建更可靠的AI评估体系是关键警示。

arXiv:2607.02586v1 Announce Type: new Abstract: Governance frameworks ask AI providers and auditors for documented evaluation evidence, and perturbati…

基准测试审计失败模式 ai评估可靠性

3

🤖 AI·大模型 arXiv NLP 2026-07-07

A multilingual hallucination benchmark: MultiWikiQHalluA

多语言大模型幻觉评估新基准，覆盖多语种测试AI真实性与可靠性

arXiv:2605.02504v2 Announce Type: replace Abstract: Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to l…

多语言幻觉基准测试 ai评估大模型论文

4

🤖 AI·大模型 arXiv AI 2026-07-02

Measuring the Gap Between Human and LLM Research Ideas

一篇探讨人类与LLM在研究想法层面差距的论文，为评估AI科研潜力提供新视角。

arXiv:2607.01233v1 Announce Type: cross Abstract: LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge indiv…

llm 研究想法人机对比科研能力 ai评估

5

🤖 AI·大模型 arXiv AI 2026-07-02

LLMs in the Real World: Evaluating "AI" in Emergency Contexts

真实世界应急场景下LLM评估的紧迫性呼吁，一篇值得行业关注的行动指南。

arXiv:2607.00019v1 Announce Type: cross Abstract: This paper offers a call to action. We urge our colleagues in the research community to play a great…

llm 应急场景 ai评估现实世界行动呼吁

6

🤖 AI·大模型 IT 之家 2026-07-02

OpenAI 推出 GeneBench-Pro 基准测试，用于评估 AI 模型生物学计算能力

OpenAI推出GeneBench-Pro基准测试，专测AI在真实生物科研环境中的实用性，面对杂乱数据能分析决策。

IT之家 7 月 1 日消息，OpenAI 宣布推出 GeneBench-Pro 基准测试，主要用于评估 AI 模型在生物学计算任务中的真实研究能力，重点衡量模型面对杂乱数据时的分析判断、方法选择，以及研究结果是否足以支撑后续决策。据介绍，相比传统基准测试通常聚焦在“模型是否记住知识”或“能否按固…

推出基准测试用于评估模型生物学计算能力

7

📝 深度技术 arXiv 机器学习 2026-06-30

A Neuroimaging Simulation Framework for Developing and Evaluating Causal AI

基于神经影像的模拟框架，为因果AI开发与评估提供新方案，代码开源。

arXiv:2606.28684v1 Announce Type: cross Abstract: Causally linking disease-related factors to image-derived biomarkers provides a powerful pathway to …

神经影像因果ai 模拟框架 ai评估生物医学信息学

8

🤖 AI·大模型 arXiv 计算机视觉 2026-06-30

The Human Creativity Benchmark

新论文提出人类创造力基准，为评估AI与人类创造力提供量化标尺，推动创造性AI研究。

arXiv:2606.30561v1 Announce Type: cross Abstract: Modern AI evaluation frameworks treat evaluator disagreement as noise to be resolved. In creative do…

创造力基准人类创造力 ai评估研究论文

9

🤖 AI·大模型 arXiv AI 2026-06-29

Psychometric Comparability of LLM-Based Digital Twins

研究LLM数字孪生在心理测量上的可比性，为AI人格模拟提供科学评估框架

arXiv:2601.14264v2 Announce Type: replace-cross Abstract: Large language models (LLMs) act as digital twins for human respondents, yet their psychomet…

llm 数字孪生心理测量 ai评估

10

🤖 AI·大模型 arXiv AI 2026-06-25

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

系统评估LLM作为裁判时的偏见缓解策略，揭示不同方法的有效性，为构建公平AI评估体系提供关键指南。

arXiv:2604.23178v2 Announce Type: replace Abstract: LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM jud…

llm-as-a-j 偏见缓解系统评估公平性 ai评估

11

📝 深度技术 arXiv AI 2026-06-23

Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs

多模态大模型持续学习基准够用吗？ICML 2026 Workshop论文深度剖析现有评估体系的不足。

arXiv:2606.20961v1 Announce Type: cross Abstract: Continual adaptation is essential for multimodal large language models (MLLMs) deployed across evolv…

mllms 持续学习基准评估多模态大语言模型持续适应

12

🤖 AI·大模型 Hacker News LLM 2026-06-23

How good a detective is an AI? A Sherlock Holmes board game as an LLM-agent eval

用福尔摩斯桌游测试AI推理能力，Claude模型竟与侦探持平，结果还有反转。

Article URL: https://alexweil.github.io/sherlock-agent-eval/ Comments URL: https://news.ycombinator.com/item?id=48644467 Points: 3 # Comments: 0

ai评估 llm代理推理能力福尔摩斯桌游

13

📝 深度技术 arXiv AI 2026-06-23

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

多模态大模型当评委也有偏见？论文揭示文化模糊下校准失败的关键漏洞。

arXiv:2606.20676v1 Announce Type: cross Abstract: MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is …

mllm 模型校准文化模糊性 ai评估多模态

14

📝 深度技术 arXiv AI 2026-06-23

Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

零标签+代码驱动，用Litmus自动化定义AI评估指标，告别传统人工标注。

arXiv:2606.23403v1 Announce Type: new Abstract: As agentic LLM systems move from prototypes to deployment across increasingly diverse domains, evaluat…

litmus 零标签代码驱动 ai评估指标规范

15

📝 深度技术 arXiv AI 2026-06-23

The AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value

概念框架揭示AI系统风险管理的缺失环节——评估能力缺口，提出“评估即证据充分性”的新视角。

arXiv:2606.21015v1 Announce Type: new Abstract: Organizations deploying AI face two fundamental governance challenges: managing AI risk and sustaining…

ai评估风险管理概念框架证据充分性可评估性差距

16

📝 深度技术 arXiv AI 2026-06-19

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

揭示多智能体LLM中评估偏差像病毒一样通过网络传播，影响AI协作的可靠性。

arXiv:2606.20493v1 Announce Type: cross Abstract: When large language models serve as evaluators in multi-agent systems, their systematic evaluation b…

多智能体llm 评估偏差偏差传播 ai评估器系统性偏差

17

📝 深度技术 arXiv NLP 2026-06-12

GENIE: A Fine-Grained Measure for Novelty

一篇提出GENIE细粒度新颖性度量方法的论文，为AI生成内容评估提供新视角。

arXiv:2606.12790v1 Announce Type: new Abstract: Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. …

新颖性度量 genie 细粒度方法 ai评估论文

18

🤖 AI·大模型 arXiv AI 2026-06-10

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

LLM自述心理特质与实际行为大相径庭，25个模型验证存在“言行不一”的自我报告-行为鸿沟。

arXiv:2606.09843v1 Announce Type: cross Abstract: Large language models (LLMs) produce stable self-reports on personality inventories, but these self-…

llm 心理测量行为差距自我报告 ai评估

19

🤖 AI·大模型 arXiv 计算机视觉 2026-06-09

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

首个面向真实田野工作的AI智能体基准测试，27页详细任务设计与评估，为Agentic AI在复杂场景下的能力提供全新标尺。

arXiv:2505.19662v4 Announce Type: replace-cross Abstract: This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field …

fieldworka agentic ai 基准测试田野工作 icpr 2026

20

🤖 AI·大模型 arXiv 计算机视觉 2026-06-08

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

世界级多模态推理基准WorldBench，以视觉多样性挑战AI大模型综合能力。

arXiv:2606.06538v1 Announce Type: new Abstract: In real-world applications, models are expected to perform reliably across diverse settings. Yet, many…

多模态推理基准测试视觉多样性 ai评估大模型

🐂 牛哥精选

I built a free tool to evaluate AI agent outputs (human labels and LLM judges)

Auditing the Audit: Five Failure Modes in Benchmark-Validity Audits

A multilingual hallucination benchmark: MultiWikiQHalluA

Measuring the Gap Between Human and LLM Research Ideas

LLMs in the Real World: Evaluating "AI" in Emergency Contexts

OpenAI 推出 GeneBench-Pro 基准测试，用于评估 AI 模型生物学计算能力

A Neuroimaging Simulation Framework for Developing and Evaluating Causal AI

The Human Creativity Benchmark

Psychometric Comparability of LLM-Based Digital Twins

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs

How good a detective is an AI? A Sherlock Holmes board game as an LLM-agent eval

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

Litmus: Zero-Label, Code-Driven Metric Specification for Evaluating AI Systems

The AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

GENIE: A Fine-Grained Measure for Novelty

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

📅 日期