1
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
专为专业领域设计的评估基准,多域知识、高难度问答,考验大模型真实能力
arXiv:2510.18941v2 Announce Type: replace-cross Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of…