牛哥精选 · 三个月

1

🤖 AI·大模型 IT 之家 2026-07-15

出版商与作者集体起诉谷歌，指控其盗用版权内容训练 Gemini AI

出版商和作家组团起诉谷歌，指控其未经授权用版权作品训练Gemini AI，类似案件曾引发天价赔偿，版权与AI的博弈再升级。

IT之家 7 月 15 日消息，一群出版商和作家已对谷歌提起集体诉讼，指控这家科技巨头未经授权使用他们的版权作品训练其人工智能平台 Gemini。原告方包括 Hachette Livre、Cengage Group、Elsevier、作家 Scott Turow 以及 S.C.R.I.B.E . …

出版商与作者集体起诉谷歌指控其盗用版权内容训练谷歌

2

🤖 AI·大模型 Hacker News AI 2026-07-10

News outlets urge a judge to sanction OpenAI in a high-stakes AI copyright fight

AI巨头OpenAI被指控在版权资料搜索上长期误导，新闻媒体联手要求法官制裁，这场高 stakes 版权战牵动整个AI行业。

Article URL: https://apnews.com/article/openai-new-york-times-ai-copyright-lawsuit-7ce19c7a25aad60d4c94556d36e96cc9 Comments URL: https://news.ycombin…

openai 版权诉讼 ai训练数据新闻媒体法律制裁

3

🤖 AI·大模型 TechCrunch 2026-07-09

Why this CEO thinks video games make better training data than the internet

游戏数据比互联网更适合训练AI？这家CEO提出颠覆性观点，值得关注

When it comes to achieving artificial general intelligence (AGI), large language models just don’t have what it takes. Models like ChatGPT a…

ai训练数据视频游戏 ceo观点机器学习数据质量

4

🤖 AI·大模型 arXiv 机器学习 2026-07-02

Prototype Language Models

提出原型语言模型，直击LLM训练数据影响追溯难题，或革新模型可解释性。

arXiv:2607.00510v1 Announce Type: new Abstract: Knowing which training examples drive outputs is fundamental to auditing, correcting, and understandin…

语言模型原型方法可解释性训练数据追溯模型审计

5

📝 深度技术 arXiv NLP 2026-06-30

Labeling Training Data for Entity Matching Using Large Language Models

研究利用大语言模型自动为实体匹配任务标注训练数据，探索提升标注效率与准确性的新方法。

arXiv:2606.28823v1 Announce Type: new Abstract: Recent large language models (LLMs) achieve strong performance on entity matching without requiring ta…

实体匹配训练数据标注大语言模型数据标注 llm

6

📝 深度技术 arXiv AI 2026-06-26

Benchmarking Open-Weight Foundation Models for Global AI Technical Governance

开源大模型在AI治理中表现不均衡，揭露训练数据偏差对全球代表性的影响

arXiv:2606.26099v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in artificial intelligence (AI) governance an…

ai治理大语言模型训练数据偏差开源模型基准测试

7

🤖 AI 工具 Hacker News AI 2026-06-23

People training new AI models admit they just get chatbots to do it

一键识别AI生成文本，防止模型训练数据污染，确保数据集纯净可靠，守护AI未来质量

Article URL: https://www.newscientist.com/article/2531050-people-training-new-ai-models-admit-they-just-get-chatbots-to-do-it/ Comments URL: https://n…

ai检测内容真实性数据质量训练数据反作弊

8

🤖 AI·大模型 arXiv NLP 2026-06-19

Characterizing Narrative Content in Web-scale LLM Pretraining Data

详解LLM预训练数据中的叙事篇章特征，揭示模型如何理解和生成故事结构，对改进训练数据质量有重要启示

arXiv:2606.19468v1 Announce Type: new Abstract: The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though …

预训练数据叙事内容大语言模型数据特征网络规模

9

🚀 产品观察 Product Hunt 2026-06-19

Are you in the Weights?

查查你的名字是否被大模型永久记住，趣味AI小工具

Find out if you live forever in the brain of the LLMs Discussion | Link

大模型训练数据权重查询工具趣味ai

10

🤖 AI·大模型 Hacker News Ask 2026-06-19

Ask HN: Opus and regression with patterns not included in trainng data

Claude Opus因训练数据限制，无法识别新颖代码模式，陷入重构循环。

I'm having a terrible problem with claude opus constantly clobering some of my codebase ..because I'm doing some things that are novel and arent incld…

claude opu 训练数据局限性代码重构新颖模式 ai行为问题

11

💰 商业科技 TechCrunch 2026-06-18

Collecting robot training data is dirty, unglamorous work. Some AI labs are already paying XDOF to do it.

AI实验室付费收集机器人训练数据，揭示物理世界智能背后的脏活与行业竞赛

If physical AI is going to match the accomplishments of LLMs, there's a data problem that needs to be solved.

机器人训练数据数据收集 ai实验室物理世界操作机器人项目

12

📝 深度技术 arXiv 机器学习 2026-06-17

Loss Landscape Poisoning: Targeted Extraction of Unseen Training Data from LLMs

揭穿LLM数据隐私漏洞：通过训练数据投毒定向提取未见样本，AI安全新威胁。

arXiv:2606.17110v1 Announce Type: cross Abstract: Large Language Models are increasingly trained on proprietary or sensitive data, from private health…

llm隐私数据投毒攻击训练数据提取 ai安全深度技术

13

📝 深度技术 arXiv NLP 2026-06-11

Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models

聚焦日本LLM预训练数据中的敏感个人信息检测，为模型训练隐私安全提供新思路

arXiv:2606.12114v1 Announce Type: new Abstract: Sensitive personal information can appear in large-scale pre-training corpora for large language model…

日本语料库敏感信息检测预训练数据大语言模型隐私安全

14

📄 文档手册 IT 之家 2026-06-11

被诉擅用 YouTube 歌曲训练 Lyria 音乐 AI，谷歌拒不承认

谷歌遭独立音乐人集体诉讼，被指控擅自用YouTube歌曲训练Lyria AI，官方矢口否认并试图驳回。

IT之家 6 月 11 日消息，据 The Verge 报道，如果你曾在 YouTube 上传过歌曲，那么谷歌几乎肯定会将你的视频用于其音乐 AI 模型 Lyria 的训练，只是目前对此矢口否认。 IT之家注意到，一群独立音乐人已对谷歌提起诉讼，指控该公司擅自利用他们上传至 YouTube 的歌曲，…

被诉擅用歌曲训练音乐谷歌拒不承认谷歌

15

🤖 AI·大模型 arXiv NLP 2026-06-10

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

揭秘LLM「说谎」根源：论文提出用检索预训练数据替代纯参数知识，实现更诚实的AI输出。

arXiv:2601.21218v2 Announce Type: replace Abstract: Large language models (LLMs) are highly capable of answering questions, but they are often unaware…

大语言模型检索增强诚实性预训练数据知识幻觉

16

🤖 AI·大模型 arXiv NLP 2026-06-09

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

提出黑盒场景下掩码语料级预训练数据检测新方法，助力LLM数据泄露评估

arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pret…

预训练数据检测黑盒llm 数据泄露掩码方法语料级检测

17

🤖 AI·大模型 arXiv AI 2026-06-08

Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

提出LoRA-MINT方法，专为审计领域适应型LLM的训练数据，有效追踪数据来源与隐私风险。

arXiv:2606.06946v1 Announce Type: cross Abstract: We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large…

lora 训练数据审计领域适应大语言模型数据隐私

18

🤖 AI·大模型 Hacker News LLM 2026-06-07

SourceHut Disrupted by LLM Crawlers

LLM训练引发僵尸网络爬虫攻击，SourceHut陷入服务中断——一个AI数据渴求下的真实案例

Article URL: https://status.sr.ht/issues/2026-06-06-llms-again/ Comments URL: https://news.ycombinator.com/item?id=48433196 Points: 7 # Comments: 2

sourcehut llm爬虫僵尸网络服务中断 ai训练数据

19

🤖 AI·大模型 IT 之家 2026-06-06

微软 MAI 系列 AI 模型训练数据曝光，“仅商业授权”说法存在出入

微软MAI模型训练数据被曝含未授权开放网络数据，与其宣称的“仅商业授权”相矛盾。

IT之家 6 月 6 日消息，科技媒体 The Decoder 昨日（6 月 5 日）发布博文，报道称微软最新发布的 MAI 系列 AI 模型部分使用未获授权的开放网络数据训练，与其此前“仅采用企业级、干净且商业授权数据”的说法不一致。 IT之家此前报道，在宣传 MAI 系列模型时，微软号称“完…

微软系列模型训练数据曝光仅商业授权

20

📝 深度技术 arXiv AI 2026-06-05

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

大模型训练数据泄露风险新视角：引入倾向性评估，揭示记忆化并非模型本意

arXiv:2606.06286v1 Announce Type: cross Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly meas…

llm 训练数据泄露记忆化倾向性评估隐私安全

🐂 牛哥精选