The Evaluation Trap: Benchmark Design as Theoretical Commitment
AI基准测试暗藏理论假设,窄化进步定义,警惕评估陷阱重塑能力概念
arXiv:2605.14167v1 Announce Type: new Abstract: Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. W…
AI基准测试暗藏理论假设,窄化进步定义,警惕评估陷阱重塑能力概念
arXiv:2605.14167v1 Announce Type: new Abstract: Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. W…
首个衡量组织AI整合深度与影响力的复合指标IIQ,超越简单访问量,为AI落地提供量化评估新思路。
arXiv:2605.14455v1 Announce Type: new Abstract: The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which A…
LeanSearch v2提出全局前提检索,一次性找出Lean 4定理所需全部引理,突破现有单步或语义匹配局限。
arXiv:2605.13137v2 Announce Type: replace-cross Abstract: Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whos…
Tailwind UI 全面升级为 Tailwind Plus,保留终身买断制,还计划新增 Tailwind Play 账户等独家功能。
We just shipped a huge rebrand project, turning what was previously known as Tailwind UI into Tailwind Plus. Tailwind Plus is the all same high-qualit…
让图像分词更接近文本语义,提出新方法优化多模态大语言模型的融合效果。
arXiv:2605.17954v1 Announce Type: cross Abstract: Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a …
多模态大模型训练新范式:阶段感知稀疏性动态消除冗余,大幅提升效率而保持性能。
arXiv:2509.18150v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variet…
将次线性图算法的先验知识理论引入LLM测试时优化,开辟AI效率提升新路径。
arXiv:2510.16609v3 Announce Type: replace Abstract: Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depen…
用迭代奖励引导后训练,让表格语言模型也能自我进化、持续提升性能。
arXiv:2604.18966v2 Announce Type: replace Abstract: Tabular language models can generate synthetic tables by modeling rows as token sequences, but the…
神经符号框架,将一阶逻辑自动转化为自然语言语句,革新语义解析与定理验证
arXiv:2605.18155v1 Announce Type: new Abstract: Translating formal language into natural language is a foundational challenge in NLP, driving various …
一篇探讨将代码作为智能体(Agent)驱动框架的前沿论文,为AI代理开发提供新思路与理论基础。
arXiv:2605.18747v1 Announce Type: new Abstract: Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generat…
综述RAG系统可信度挑战,涵盖事实性、鲁棒性与公平性等关键维度。
arXiv:2409.10102v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the develo…
用2D高斯函数做在线高精地图,新方法GSMap兼顾速度与精度
arXiv:2605.09619v2 Announce Type: replace Abstract: Accurate High-Definition (HD) map construction is critical for autonomous driving, yet existing me…
GPT-3和Codex新增编辑与插入功能,不再局限于续写,文本交互更灵活。
We’ve released new versions of GPT-3 and Codex which can edit or insert content into existing text, rather than just completing existing text.
OpenAI分析揭示:神经网路训练效率每16个月翻倍,远超市摩尔定律,AI算力成本已降至44倍以下。
We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classificat…
58位作者、30家机构联合发布报告,提出10种机制来增强AI系统安全、公平、隐私方面的可验证性。
We’ve contributed to a multi-stakeholder report by 58 co-authors at 30 organizations, including the Centre for the Future of Intelligence, Mila, Schwa…
OpenAI分享用人类反馈微调GPT-2(774M参数)的实践,发现模型学会复制原文来迎合标注者偏好,揭示了偏好对齐中的反直觉现象。
We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external…
OpenAI发布政策研究论文,提出四种策略促进AI安全合作,应对竞争压力下的集体行动困境
We’ve written a policy research paper identifying four strategies that can be used today to improve the likelihood of long-term industry cooperation o…
OpenAI发文论证,长期AI安全研究亟需社会科学家参与,以解决人类心理、偏见与理性不确定性,促进ML与社科跨界协作。
We’ve written a paper arguing that long-term AI safety research needs social scientists to ensure AI alignment algorithms succeed when actual humans a…
OpenAI提出用辩论机制训练AI安全,让智能体互辩、人类判胜负,创新思路令人耳目一新
We’re proposing an AI safety technique which trains agents to debate topics with one another, using a human to judge who wins.
OpenAI发布强化学习泛化新基准,加速AI在复杂环境中的快速适应能力。