1
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
提出内部化逐步反思机制,让模型自主识别并防御间接越狱攻击,AI安全新范式。
arXiv:2605.20654v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sop…