1
Robust Reasoning Benchmark
AIME 2024数学题经13种文本扰动,测试大模型推理鲁棒性,揭示依赖格式的短板
arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchma…
AIME 2024数学题经13种文本扰动,测试大模型推理鲁棒性,揭示依赖格式的短板
arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchma…