牛哥精选 · 本周

1

🤖 AI·大模型量子位 2026-05-25

Claude 通过率不到4%，SaaS-Bench撕碎了Computer-Use的「全自动办公」幻想

Claude Opus 4.7在106个真实办公任务中完全通过率仅3.8%，这场残酷评测撕碎了AI全自动办公的幻想。

UniPat AI 发布 SaaS-Bench 评测，Claude 等主流大模型在真实办公任务中完全通过率最高仅 3.8%，AI 全自动办公远未落地。

通过率不到撕碎了全自动办公幻想 claude

2

🚀 产品观察 TechCrunch 2026-05-23

We tried Google’s AI glasses and they’re almost there

谷歌AI眼镜原型机外部测试，体验接近成熟，智能穿戴新突破。

Google demoed prototype Android XR glasses that overlay Gemini-powered translation, navigation, and other information directly into your field of view…

ai眼镜谷歌原型测评可穿戴设备 google

3

🤖 AI·大模型 arXiv NLP 2026-05-22

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

提出Agentic CLEAR框架，自动化评估LLM Agents多层次能力，提升评估效率与客观性。

arXiv:2605.22608v1 Announce Type: new Abstract: Agentic systems are becoming more capable: agents define strategies, take actions, and interact with d…

llm agent 评估自动化多层级评估 ai安全模型评测

4

☁️ 云服务 IT 之家 2026-05-21

【IT之家评测室】绿联 NAS 私有云 DXP4800 GT 深度体验：四盘位配双万兆，这配置太“GT”了

四盘位NAS配双万兆网口和ECC内存，绿联这款“GT”版配置直接拉满，适合高性能本地存储和影视工作室。

近年来，移动设备性能的飞速提升，伴随而来的是个人数据量的急剧膨胀。曾经 64GB 的手机存储空间就足够使用，如今即便是 512GB 也常感不足。这种趋势促使越来越多的用户开始关注并转向“NAS”（网络附加存储），即个人私有云解决方案。个人私有云的核心在于“私有性”。与依赖远程服务器的公共云服务不同…

之家评测室绿联私有云深度体验四盘位配双万

5

🎨 设计工具 IT 之家 2026-05-21

宝马 Speedtop 猎装车量产版实车谍照曝光，全球限量 70 台已售罄

IT之家是科技资讯聚合平台，每天更新海量IT数码消息，支持个性化订阅与社区讨论，助您快速掌握科技动态。

IT之家 5 月 21 日消息，两天后，就是宝马推出 Speedtop 的一周年纪念日了。这款双门猎装跑车基于宝马 M8 打造，于 2025 年埃斯特庄园优雅车展正式亮相，是宝马继前年 Skytop 之后推出的又一限量定制车型。作为纯正的猎装轿跑，这款颜值出众的定制豪车如今已现身纽博格林赛道开展路试…

宝马猎装车量产版实车谍照曝光全球限量台已售罄

6

📝 深度技术 Hacker News Show 2026-05-21

Show HN: Llama CPU Benchmarks

TurboQuant号称8倍速，实测CPU端到端慢2.2倍，Qwen准确率还降17个百分点，别被合成数据骗了。

Article URL: https://deemwar-products.github.io/llama-cpu-benchmarks/ Comments URL: https://news.ycombinator.com/item?id=48212222 Points: 1 # Comments…

llama cpu基准测试模型量化 turboquant 性能评测

7

🤖 AI·大模型 arXiv AI 2026-05-21

BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction

首个专为工程建造设计的LLM基准，物理对齐交互测试揭示大模型真实建造能力。

arXiv:2510.16559v5 Announce Type: replace Abstract: Engineering construction automation aims to transform natural language specifications into physica…

buildarena 物理对齐交互基准工程施工大模型评估

8

🤖 AI·大模型 arXiv AI 2026-05-21

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

首个专门评估LLM集成知识图谱生成的基准测试发布，填补该领域关键空白。

arXiv:2605.19518v1 Announce Type: new Abstract: Generating Knowledge Graphs (KGs) remains one of the most time-consuming and labor-intensive tasks for…

大模型知识图谱基准测试 llm评估知识构建

9

🤖 AI·大模型阮一峰网络日志 2026-05-21

科技爱好者周刊（第 397 期）：财富正在向 AI 集中

实测四大AI模型估算食物碳水，结果偏差惊人，最差波动达429克，连包装标注都猜不准。

这里记录每周值得分享的科技内容，周五发布。...

科技爱好者周财富正在向集中 ai模型评测食物识别

10

🤖 AI·大模型 36氪 2026-05-21

ArtificialAnalysis：千问3.7问鼎国产模型冠军，全球前五

千问3.7在ArtificialAnalysis评选中登顶国产模型，跻身全球前五，国产AI实力再获认可。

36氪获悉，5月21日，三方机构ArtificialAnalysis公布了最新的全球大模型榜单，阿里新发布的旗舰模型Qwen3.7-Max得分56.6分，性能接近GPT、Claude、Gemini的最强模型，位列全球第五、国产第一。据了解，Qwen3.7-Max即将上线阿里云百炼对外提供API服务。

千问问鼎国产模型冠军全球前五千问3.7

11

🤖 AI·大模型 arXiv 机器学习 2026-05-20

OpenCompass: A Universal Evaluation Platform for Large Language Models

开源大模型评估平台OpenCompass，统一基准测试助力模型选型与比较。

arXiv:2605.19276v1 Announce Type: cross Abstract: In recent years, the field of artificial intelligence has undergone a paradigm shift from task-speci…

大模型评估开源平台通用评测基准测试 llm

12

📝 深度技术美团技术团队 2026-05-20

LARYBench 发布：定义具身动作表征 ImageNet，首次度量从人类视频学习的泛化表征

首次定义具身动作表征的ImageNet基准，揭示人类视频数据可驱动机器人泛化学习。

LARYBench （Latent Action Representation Yielding Benchmark），一个指引从大规模的视觉数据学习到通用的隐式动作表征的系统化评测基准。实验结果表明：在动作泛化和控制精度上，通用视觉模型的表现均显著优于专门为具身智能设计的动作专家模型，具身动作表征…

larybench 具身智能动作表征人类视频泛化学习

13

📝 深度技术美团技术团队 2026-05-20

用Agent评测思路管理AI Coding —— 31万行代码AI重构的实践

31万行代码AI重构实战，关键在于用Agent评测思路对齐人机共识，三个经验直击AI Coding管理痛点。

当 90% 以上代码由 AI 生成，决定系统走向的不是谁写得更快，而是约束 AI 的能力。没有统一规范，AI 只会成倍放大混乱。本文基于 31 万行代码重构实践，分享我们如何用 Agent 评测思路管理 AI Coding——通过技术债梳理、建设Rule、重构 SOP 和 Pre-PR 机制，把重构…

ai coding 代码重构技术债 agent评测人机对齐

14

📝 深度技术 arXiv 机器学习 2026-05-20

The Evaluation Game: Beyond Static LLM Benchmarking

超越静态基准，探索LLM评估新范式，引入游戏化思想打破传统测试局限。

arXiv:2605.19377v1 Announce Type: new Abstract: As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered …

llm评估基准测试动态评估游戏化人工智能评测

15

📝 深度技术 arXiv 机器学习 2026-05-20

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

用认知嵌入高效筛选评估子集，大幅降低大模型评测成本，保持预测准确性。

arXiv:2510.26384v2 Announce Type: replace-cross Abstract: The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks …

大模型评估子集选择计算效率认知嵌入 llm评测

16

📝 深度技术 arXiv 机器学习 2026-05-20

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

通过证据校准的查询聚类精准捕捉大模型能力边界，为LLM评估提供全新方法论。

arXiv:2605.17110v1 Announce Type: cross Abstract: Query clustering organizes queries into groups that reflect shared latent capability demands, enabli…

llm 查询聚类能力评估证据校准大模型评测

17

📝 深度技术 arXiv AI 2026-05-20

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

视觉物体移除评估新基准，解决感知一致性难题，比现有指标更贴近人类判断。

arXiv:2605.14534v1 Announce Type: cross Abstract: Evaluating object removal in images and videos remains challenging because the task is inherently on…

感知一致性视觉媒体物体移除评估基准计算机视觉

18

🤖 AI·大模型 arXiv AI 2026-05-20

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

突破静态评测局限，用可执行交互基准动态检验命令行智能体真实能力

arXiv:2605.14133v1 Announce Type: new Abstract: Interactive agent benchmarks face a tension between scalable construction and realistic workflow evalu…

clawforge 命令行智能体交互基准测试可执行基准 ai 评测

19

🚀 产品观察 Wired 2026-05-20

Foreo Discount Codes and Deals: Up to 50% Off

Foreo美容仪最高5折优惠，还有LED面罩和发梳新品抢先体验

Save on Foreo favorites, including LUNA cleansing brushes, BEAR microcurrent devices, and masks and accessories to level up your daily skincare routin…

foreo 折扣美容仪 led面罩发梳

20

🚀 产品观察 The Verge 2026-05-20

Mixtape is a musical portrait of teenage life

一款以音乐为背景的青少年生活游戏，平淡中透着怀旧与青春的真实感。

Playing Mixtape is like playing a video game version of a high school movie. Kids banter about the meaning of life and the theme songs that would play…

游戏青少年音乐怀旧评测

🐂 牛哥精选

Claude 通过率不到4%，SaaS-Bench撕碎了Computer-Use的「全自动办公」幻想

We tried Google’s AI glasses and they’re almost there

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

【IT之家评测室】绿联 NAS 私有云 DXP4800 GT 深度体验：四盘位配双万兆，这配置太“GT”了

宝马 Speedtop 猎装车量产版实车谍照曝光，全球限量 70 台已售罄

Show HN: Llama CPU Benchmarks

BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

科技爱好者周刊（第 397 期）：财富正在向 AI 集中

ArtificialAnalysis：千问3.7问鼎国产模型冠军，全球前五

OpenCompass: A Universal Evaluation Platform for Large Language Models

LARYBench 发布：定义具身动作表征 ImageNet，首次度量从人类视频学习的泛化表征

用Agent评测思路管理AI Coding —— 31万行代码AI重构的实践

The Evaluation Game: Beyond Static LLM Benchmarking

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

Foreo Discount Codes and Deals: Up to 50% Off

Mixtape is a musical portrait of teenage life

📅 日期