1
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
针对静态基准测试的局限,推出MathArena平台,用动态、可扩展的评估助力LLM数学能力衡量
arXiv:2605.00674v2 Announce Type: replace Abstract: Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but sta…