牛哥精选 · 本月

📋 全部 🤖 AI·大模型 ⚡ 效率工具 📝 深度技术 🚀 产品观察 💰 商业科技 🔓 开源项目 🎨 设计创意 📖 阅读推荐 🏷 资源合集 🌱 成长效率

📝 深度技术 arXiv AI 2026-05-25

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

揭示长上下文LLM因位置偏差导致推理失败的盲点，挑战现有基准评估体系。

arXiv:2605.23170v1 Announce Type: cross Abstract: Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULE…

长上下文llm 位置偏差推理基准盲点评估方法

📝 深度技术 arXiv NLP 2026-05-21

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

深入剖析AI Agent记忆结构的分类法与系统局限，为构建更智能的代理提供实证分析。

arXiv:2602.19320v2 Announce Type: replace Abstract: Agentic memory systems enable large language model (LLM) agents to maintain state across long inte…

agentic me 记忆分类法 ai agent 评估方法系统限制

🤖 AI·大模型 arXiv NLP 2026-05-20

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

新基准MixRea评估LLM在显式与隐式推理上的表现，揭示推理能力的短板。

arXiv:2605.20128v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by…

大语言模型推理基准显式推理隐式推理评估方法

🚀 产品观察 Hacker News LLM 2026-05-20

Agentic evals or LLM as a judge? considering cost, time and quality

Agentic评估 vs LLM评判：在成本、时间和质量之间如何权衡？看这篇讨论。

Comments URL: https://news.ycombinator.com/item?id=48144995 Points: 1 # Comments: 0

agentic评估 llm评判成本时间质量

📝 深度技术 OpenAI 官方博客 2026-05-19

Why language models hallucinate

OpenAI最新研究揭示语言模型幻觉根源，用更优评估提升AI可靠性与安全性。

OpenAI’s new research explains why language models hallucinate. The findings show how improved evaluations can enhance AI reliability, honesty, and sa…

语言模型幻觉 ai可靠性 openai研究评估方法

🤖 AI·大模型 arXiv 机器学习 2026-05-19

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

新方法LPDS通过保留逻辑改变实体，精准测试大模型鲁棒性，避免模型因细节变化而翻车。

arXiv:2605.15393v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversigh…

llm鲁棒性评估方法 lpds 逻辑保持难度缩放

🤖 AI·大模型 arXiv AI 2026-05-19

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

现有LLM记忆评估靠最终准确率，但会掩盖关键失败模式，本文提出新视角

arXiv:2605.15384v1 Announce Type: cross Abstract: Memory plays a central role in enabling large language models (LLMs) to operate over sequential task…

llm记忆评估顺序任务聚合指标新评估方法

📅 日期

2026-05-20 2026-05-19

🐂 牛哥精选