1
Mechanisms of Introspective Awareness
一篇研究大语言模型“内省意识”机制的论文,揭示模型如何检测并识别注入的操控向量,行为鲁棒性惊人。
arXiv:2603.21396v4 Announce Type: replace Abstract: Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their…