Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
揭示大语言模型产生毒性幻觉的内部机制,通过扰动提示词并追踪神经网络电路路径,为AI安全提供新思路。
arXiv:2605.30913v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in conversational settings where user tone ra…