Brain-LLM Alignment Tracks Training Data, Not Typology
跨语言验证发现大脑语言网络与LLM的对齐主要受训练数据驱动,而非语言类型学差异。
arXiv:2605.23032v1 Announce Type: cross Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomi…
跨语言验证发现大脑语言网络与LLM的对齐主要受训练数据驱动,而非语言类型学差异。
arXiv:2605.23032v1 Announce Type: cross Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomi…
新研究用可证明方式保护微调大模型免遭训练数据窃取,同时维持模型效能,隐私与实用兼得。
arXiv:2602.00688v2 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as trainin…
LLM对任务时间的估算为何总比实际长?可能源于训练数据中的人类估算偏差,引发网友热议。
I have my suspicion: they estimate how long people would have taken to implement some feature, becasue they were trained on such data. I consistently …
论文提出ACC方法,编译智能体轨迹以高效进行长上下文训练,为AI大模型的长文本处理提供新思路。
arXiv:2605.21850v1 Announce Type: new Abstract: Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, …
通过玩小游戏收集训练数据,让AI学习变得像玩游戏一样有趣
As an AI you must gather training data by playing microgames Discussion | Link
京东借百万员工采集千万小时家务数据,开启具身智能训练新纪元。
IT之家 5 月 20 日消息,京东宣布全国首个具身智能数据采集社区已在宿迁正式运行。这是京东今年 3 月宣布建设全球最大具身数据采集中心以来,在具身智能数据基础设施建设方面的又一重要进展,也标志着京东距两年内积累超 1000 万小时人类真实场景视频数据的目标更进一步。 据介绍,具身数据采集社区位于…
开源项目Stera将普通iPhone升级为研究级空间数据采集系统,并开源10M帧数据集,为具身AI世界模型提供高质量训练数据。
We are releasing Project Stera - an open source, end-to-end pipeline that turns a commodity iPhone into a research-grade capture system for embodied A…
专为EU AI Act合规设计的CC0训练数据集,提供文档级溯源和IP赔偿,助您规避2026年8月前的高额罚款。
Hey, i always had problems with finding CC0 data that quality. So i wanted to share that i generated and gathered it and published it for free. All of…
由相似图构建加权独立集,平衡样本质量与多样性,为高效数据选择提供新框架。
arXiv:2605.15691v1 Announce Type: new Abstract: Data selection seeks to identify a compact yet informative subset from large-scale training corpora, b…
别像当年屏蔽搜索引擎一样屏蔽AI爬虫——理解三种AI机器人流量类型,才能让AI成为你的增长飞轮,而非错失的机遇。
AI bot traffic is growing across the web. We track this in real-time , and the data reveals three types of AI-driven crawlers that often work independ…
无需知晓下游任务的具体数据,仅凭反馈即可动态优化训练数据混合——DUET算法将影响函数与贝叶斯优化结合,理论保证收敛到最优混合比例,为LLM数据选择开辟了全新范式。
arXiv:2502.00270v3 Announce Type: replace-cross Abstract: The performance of an LLM depends heavily on the relevance of its training data to the downs…