牛哥精选 · 所有

1

🤖 AI·大模型 IT 之家 2026-07-15

金山办公 CEO 章庆元：模型能力差距正在缩小，今年 6 月内部开发已全部切换为国产模型

国产模型实力追赶，金山办公已将内部开发全面切换，行业趋势迎来转折点。

IT之家 7 月 15 日消息，在今日举行的金山办公 2026 AI 生产力大会上，金山办公首席执行官章庆元就当前人工智能大模型行业的发展趋势发表了自己的看法。他指出，大模型的稀缺性正在降低，各模型之间的能力差距正逐步缩小，没有谁能形成绝对壁垒。章庆元称：“过去三年，行业一度担心 OpenAI 会…

金山办公章庆元模型能力差距正在缩小今年

2

📝 深度技术 arXiv NLP 2026-07-14

Are LLMs ready for HardChoices?

大模型能否应对"HardChoices"？这篇Konvens 2026论文给出了前瞻性测评。

arXiv:2607.11471v1 Announce Type: new Abstract: A lot of research attention has been devoted to checking whether large language models (LLMs) are poli…

llm hardchoice 评估计算语言学大模型能力

3

🤖 AI·大模型 Hacker News Ask 2026-07-11

Ask HN: What was the last task where only a frontier model could do it?

前沿模型与开源模型差距实测：哪些任务只有Opus/GPT能搞定？

ive been seeing a recurring claim that open (weight) models 6 months behind the frontier are good enough for the majority of ‘work’. if you've had a c…

ai模型对比前沿模型开源模型任务场景模型能力

4

🤖 AI·大模型 arXiv NLP 2026-06-30

The Effect of Scripts and Formats on LLM Numeracy

实证研究揭示脚本与格式对LLM数值推理能力的显著影响，为优化大模型数学表现提供新视角。

arXiv:2601.15251v2 Announce Type: replace Abstract: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling hu…

llm 数值推理脚本格式大模型能力

5

🤖 AI·大模型 Hacker News Ask 2026-06-24

Ask HN: Best prompt to show that AI isn't ready to take over

用一条简单汽车清单提问，揭示当前AI模型常识推理的短板。

Claude: please give me a list of famous cars named after fish, like the Plymouth Barracuda for example. Comments URL: https://news.ycombinator.com/ite…

ai局限性常识推理 prompt测试 hackernews 大模型能力边界

6

🤖 AI·大模型 Wired 2026-06-17

‘Dangerous’ AI Models Are Coming No Matter What

即便警告危险，Anthropic仍推出能发现与利用漏洞的AI模型，双刃剑效应凸显。

The US government crackdown on Anthropic’s Claude Fable 5 and Mythos 5 hides a glaring truth: AI models with advanced hacking capabilities will soon b…

anthropic mythos 安全漏洞双重用途 ai风险

7

📝 深度技术 arXiv NLP 2026-06-12

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

提出基于低秩因子的LLM评估新范式，突破传统基准分数局限，揭示模型真实能力。

arXiv:2507.20208v2 Announce Type: replace Abstract: Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchm…

大语言模型评估低秩因子基准测试模型能力新范式

8

📝 深度技术 arXiv 机器学习 2026-06-09

Scaffold Effects on GAIA: A Controlled Comparison

对照实验揭示GAIA基准中脚手架如何混淆模型真实能力，为agent评估提供关键修正。

arXiv:2606.08529v1 Announce Type: cross Abstract: Published agent capability scores conflate what a model can do with what its scaffold lets it do, an…

gaia scaffold agent评估基准测试模型能力

9

🤖 AI·大模型 Vercel Blog 2026-06-09

Claude Fable 5 now available on AI Gateway

Anthropic新模型Claude Fable 5在Vercel AI Gateway上线，内置安全分类器抵御高风险用途。

Claude Fable 5 from Anthropic is now available on AI Gateway . A Mythos-class model, Fable 5 is a notable step up over prior Claude models on long-run…

claude fab ai gateway 安全分类器模型能力滥用风险

10

🤖 AI·大模型 arXiv AI 2026-06-02

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

探索LLM如何将工具推理内化，摆脱对外部文档的依赖，提升工具掌握效率的新框架。

arXiv:2604.10788v2 Announce Type: replace-cross Abstract: Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Lang…

大型语言模型工具集成推理内部化推理外部工具模型能力提升

11

🤖 AI·大模型量子位 2026-05-25

Claude 通过率不到4%，SaaS-Bench撕碎了Computer-Use的「全自动办公」幻想

Claude Opus 4.7在106个真实办公任务中完全通过率仅3.8%，这场残酷评测撕碎了AI全自动办公的幻想。

UniPat AI 发布 SaaS-Bench 评测，Claude 等主流大模型在真实办公任务中完全通过率最高仅 3.8%，AI 全自动办公远未落地。

通过率不到撕碎了全自动办公幻想 claude

12

🤖 AI·大模型 arXiv NLP 2026-05-20

Exploring Lightweight Large Language Models for Court View Generation

轻量级大模型在法律AI中展现潜力，这篇论文系统探索了小于2B参数模型在法院观点生成任务上的表现。

arXiv:2605.16770v1 Announce Type: new Abstract: Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), i…

轻量级大模型法律ai 法院观点生成案件事实模型能力

🐂 牛哥精选