On the Limits of LLM-as-Judge for Scientific Novelty Assessment
揭秘LLM在科学新颖性评估中的致命短板,颠覆对AI判别的盲目信任
arXiv:2606.12071v1 Announce Type: cross Abstract: LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a c…
揭秘LLM在科学新颖性评估中的致命短板,颠覆对AI判别的盲目信任
arXiv:2606.12071v1 Announce Type: cross Abstract: LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a c…
最新研究揭示:AI分析体育比赛表现远逊人类,准确率几乎靠猜,体育主播饭碗暂时安全
IT之家 6 月 6 日消息,据外媒 Futurism 今天(6 日)晚间报道,北卡罗来纳大学教堂山分校和美国东北大学研究人员的一项新研究发现,主流 AI 模型在分析职业体育比赛时 表现很差 。这项研究目标是考察热门 AI 模型在感知、推理、模拟和自主行动能力四个方面的表现,现有测试方法很难准确评估…
为什么LLM聊天机器人每天让人失望?作者亲历吐槽AI的局限与缺陷。
Article URL: https://umrashrf.github.io/llm-ai-chatbots-are-letting-me-down-every-single-day/ Comments URL: https://news.ycombinator.com/item?id=48406…
火箭引擎初创公司Impulse获5亿美元融资,却坚持用真人而非AI搞研发,对技术模拟的清醒认知值得深思。
Engineering physical systems still depends on human talent, according to Impulse Space president Eric Romo.
用停机问题检验LLM的极限:这篇论文通过理论分析和实验对比,揭示了大语言模型在程序终止性推理上的能力与局限
arXiv:2601.18987v5 Announce Type: replace-cross Abstract: Determining whether a program terminates is a central problem in computer science. Turing's …
用阿姆达尔定律直击LLM生成代码的核心瓶颈:再高效的生成也逃不过逐行审计的低效,根本性限制了加速比。
LLMs may theoretically be able to generate millions of correct lines of code. But for any important code the only way to know that it's correct is to …
最新研究揭示大语言模型在复杂多步规则遵循上的显著失败,挑战现有能力边界
arXiv:2605.02028v2 Announce Type: replace Abstract: Large language models are highly capable of answering difficult questions by retrieving, recombini…