Robust Reasoning Benchmark
AIME 2024数学题经13种文本扰动,测试大模型推理鲁棒性,揭示依赖格式的短板
arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchma…
AIME 2024数学题经13种文本扰动,测试大模型推理鲁棒性,揭示依赖格式的短板
arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchma…
OpenAI用AI破解1946年数学难题,见证大模型推理能力新突破
Article URL: https://twitter.com/openai/status/2057176201782075690 Comments URL: https://news.ycombinator.com/item?id=48215185 Points: 3 # Comments: 0
数学家精选的Soohak基准测试,专攻LLM科研级数学推理能力,挑战最高阶思维极限
arXiv:2605.09063v2 Announce Type: replace Abstract: Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the commun…
将工具调用与执行解耦,提出隐式层次化GRPO框架,显著提升数学推理中的工具集成效率与泛化能力。
arXiv:2605.18500v1 Announce Type: new Abstract: Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning ca…
结合Lean与理论计算机科学,可规模生成形式-非形式配对的定理证明挑战,助力AI数学推理研究。
arXiv:2508.15878v2 Announce Type: replace-cross Abstract: Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoni…
针对静态基准测试的局限,推出MathArena平台,用动态、可扩展的评估助力LLM数学能力衡量
arXiv:2605.00674v2 Announce Type: replace Abstract: Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but sta…