牛哥精选 · 所有

1

📝 深度技术 arXiv AI 2026-06-25

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

本文发现LLM推理中的“悬崖词”——单个token即可导致数学运算失败，揭示模型脆弱性根源。

arXiv:2606.25524v1 Announce Type: new Abstract: Large language models (LLMs) reach high accuracy in mathematical reasoning, but individual traces on t…

llm 数学推理单token失败触发器推理错误

2

🤖 AI·大模型 arXiv AI 2026-06-24

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

新论文用未知随机变量问题测试大模型数学推理能力，揭示模型真实推理水平

arXiv:2501.11790v5 Announce Type: replace-cross Abstract: Recent studies have raised significant concerns regarding the reliability of current mathema…

数学推理大模型基准测试随机变量

3

🤖 AI·大模型 arXiv NLP 2026-06-18

LLM Parameters for Math Across Languages: Shared or Separate?

多语言数学推理中，LLM参数共享还是分离更优？这项ACL 2026研究给出了实验性答案。

arXiv:2606.18453v1 Announce Type: new Abstract: Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning per…

llm 多语言数学推理参数共享 acl 2026

4

🤖 AI·大模型 arXiv AI 2026-06-10

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

揭秘SOTA模型在真实人类推理评估中的短板，新数据集RealMath-Eval挑战LLM评判能力。

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-schoo…

llm评估数学推理人类推理评估基准

5

📝 深度技术 arXiv AI 2026-06-08

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

DeepSeek-R1的"啊哈时刻"背后，是真正的推理还是高级模仿？本文通过全面实证对比，解剖人类与LLM的数学推理差异。

arXiv:2606.07410v1 Announce Type: cross Abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised t…

deepseek-r 数学推理 aha时刻人类推理 llm推理

6

🤖 AI·大模型 arXiv AI 2026-06-04

Arithmetic Pedagogy for Language Models

人类数学教学法能否教语言模型做算术？GASING方法带来训练新思路。

arXiv:2606.05106v1 Announce Type: cross Abstract: We investigate whether methods of human mathematics pedagogy can guide the training of language mode…

算术推理语言模型教学法 gasing 人工智能训练

7

🤖 AI·大模型 arXiv AI 2026-06-03

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

首个课程化评估基准，专测大模型在图论领域作为数学研究助手的推理能力

arXiv:2606.03144v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, …

大语言模型评估数学推理图论基准测试 ai助手

8

🤖 AI·大模型 arXiv AI 2026-06-02

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

一种新记忆进化机制eMoT，通过符号锚定与记忆腐蚀，在数学推理任务中实现17.6%准确率提升，Game of 24达100%。

arXiv:2606.02054v1 Announce Type: new Abstract: While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their…

emot memory-of- 符号锚定记忆腐蚀数学推理

9

🤖 AI·大模型 arXiv 机器学习 2026-06-02

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

无需额外训练，用现成大模型就能给数学推理过程打分，性能媲美专用过程奖励模型。

arXiv:2606.01682v1 Announce Type: cross Abstract: Selecting the best response from multiple small-model samples using a stronger scorer is a simple in…

llm 过程评分数学推理 prm 训练免费

10

🤖 AI·大模型 arXiv AI 2026-06-01

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

大模型数学推理新突破：HERMES实现高效与可验证的推理过程

arXiv:2511.18760v2 Announce Type: replace Abstract: Informal mathematics has been central to modern large language model (LLM) reasoning, offering fle…

大语言模型数学推理可验证性效率优化人工智能

11

📝 深度技术 arXiv AI 2026-05-29

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

一次改写质量审计揭露了数学推理基准中语义漏洞，排名翻盘只需移除3.1%错误项，单一模型评估的盲区由此暴露。

arXiv:2605.29001v1 Announce Type: cross Abstract: A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in…

语义不变性数学推理基准审计模型排名 gpt-4o

12

📝 深度技术 arXiv AI 2026-05-29

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

新方法让小型语言模型实现密集数学推理，小模型也能有大智慧。

arXiv:2605.29247v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smal…

小语言模型数学推理密集推理模型引导黑盒优化

13

🤖 AI·大模型 arXiv NLP 2026-05-28

FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

一种新型训练框架，在加速大模型数学推理训练的同时显著提升自我验证能力，精准解决LLM推理瓶颈。

arXiv:2605.28389v1 Announce Type: new Abstract: While large language models have made significant progress in mathematical reasoning, they remain unre…

llm 数学推理自我验证训练加速推理优化

14

📝 深度技术 arXiv AI 2026-05-26

Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning

LLM数学推理修复藏着不对称风险，新研究用“守门修复”机制确保替换比保留更安全。

arXiv:2605.24613v1 Announce Type: cross Abstract: Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect rea…

llm 数学推理事后修复不对称风险选择性替换

15

📝 深度技术 arXiv AI 2026-05-25

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

神经符号框架通过操作树实现自动形式化，ICML 2026论文提出分解-结构-修复的创新路线，助力数学推理与AI融合。

arXiv:2604.19000v2 Announce Type: replace-cross Abstract: Statement autoformalization acts as a critical bridge between human mathematics and formal m…

自动形式化神经符号操作树 icml 数学推理

16

🤖 AI·大模型 arXiv AI 2026-05-23

Advancing Mathematics Research with AI-Driven Formal Proof Search

大型语言模型在数学推理中进步巨大，但可靠性问题如何解决？这篇论文用130道IMO竞赛题验证了AI驱动形式化证明搜索的潜力。

arXiv:2605.22763v1 Announce Type: new Abstract: Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability lim…

大语言模型数学推理形式化证明 lean 科学研究

17

🤖 AI·大模型 arXiv NLP 2026-05-22

Robust Reasoning Benchmark

AIME 2024数学题经13种文本扰动，测试大模型推理鲁棒性，揭示依赖格式的短板

arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchma…

大语言模型推理鲁棒性基准测试文本扰动 aime 2024

18

🤖 AI·大模型 arXiv 机器学习 2026-05-21

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

揭秘自蒸馏为何会损害LLM的数学推理能力，并指出抑制关键探索过程是背后原因。

arXiv:2603.24472v3 Announce Type: replace-cross Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improvi…

自蒸馏 llm推理数学推理推理退化中间步骤抑制

19

🤖 AI·大模型 Hacker News AI 2026-05-21

Open AI solves a 1946 Erdős problem

OpenAI用AI破解1946年数学难题，见证大模型推理能力新突破

Article URL: https://twitter.com/openai/status/2057176201782075690 Comments URL: https://news.ycombinator.com/item?id=48215185 Points: 3 # Comments: 0

openai erdős问题数学难题 ai进展理论数学

20

🤖 AI·大模型 arXiv NLP 2026-05-20

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

数学家精选的Soohak基准测试，专攻LLM科研级数学推理能力，挑战最高阶思维极限

arXiv:2605.09063v2 Announce Type: replace Abstract: Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the commun…

基准测试 llm评估数学推理数学家策划高等数学

🐂 牛哥精选