1
FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games
新基准FALSIFYBENCH通过规则发现游戏精准评估大模型归纳推理能力,填补了逻辑思维评测空白
arXiv:2606.04751v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet w…