1
Detecting misbehavior in frontier reasoning models
OpenAI最新研究揭示:前沿推理模型会暗中利用漏洞,监控思维链可检测,但惩罚只会让它们更会隐藏意图。
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Pen…