牛哥精选 · 半年

📋 全部 🤖 AI·大模型 ⚡ 效率工具 📝 深度技术 🚀 产品观察 💰 商业科技 🔓 开源项目 🎨 设计创意 📖 阅读推荐 🏷 资源合集 🌱 成长效率

🤖 AI·大模型 arXiv AI 2026-06-03

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

新提出的TriEval管道，以低资源消耗高效评估大模型的偏见、毒性与真实性。

arXiv:2606.03036v1 Announce Type: new Abstract: LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthca…

llm评估偏见检测毒性分析真实性评估资源高效

📝 深度技术 arXiv AI 2026-06-01

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

揭示大语言模型产生毒性幻觉的内部机制，通过扰动提示词并追踪神经网络电路路径，为AI安全提供新思路。

arXiv:2605.30913v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in conversational settings where user tone ra…

llm 毒性幻觉提示扰动神经电路追踪安全性

📝 深度技术 arXiv AI 2026-05-29

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

高效多任务安全分类器，低成本实时检测毒性、越狱、仇恨言论等，专为LLM安全防护设计。

arXiv:2605.29659v1 Announce Type: cross Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can…

llm安全多任务分类毒性检测越狱检测有害内容过滤

📝 深度技术 arXiv NLP 2026-05-26

Toxicity in Twitch Chats: An LLM-Based Analysis Across Gaming Communities

用大模型分析Twitch各游戏社区聊天毒性差异，揭示不同社区的独特模式。

arXiv:2605.24000v1 Announce Type: new Abstract: Toxicity in online gaming communities remains a persistent challenge, manifesting across genres, platf…

twitch 在线游戏毒性分析 llm应用社交网络

📝 深度技术 arXiv NLP 2026-05-22

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

一项针对LLM微调毒性风险的新型防御框架Optimus，在保持对话效用的同时有效减轻有害行为。

arXiv:2507.05660v3 Announce Type: replace-cross Abstract: Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecti…

llm安全微调防御毒性缓解对抗攻击会话ai

🤖 AI·大模型 arXiv NLP 2026-05-21

Toxic Subword Pruning for Dialogue Response Generation on Large Language Models

LLM对话生成新防御：剪除有毒子词，提升模型安全性

arXiv:2410.04155v2 Announce Type: replace Abstract: How to defend large language models (LLMs) from generating toxic content is an important research …

大语言模型对话生成毒性子词模型安全 nlp

🤖 AI·大模型 arXiv 机器学习 2026-05-19

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

一篇对LLM毒性问题进行系统性复制研究的最新论文，验证了现有测量与缓解方法、揭示关键发现，值得关注。

arXiv:2605.14087v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) trained on web-scale corpora inherently absorb toxic patterns f…

llm 毒性安全复制研究 arxiv

📅 日期

2026-05-20 2026-05-19

🐂 牛哥精选