1
Do Linear Probes Generalize Better in Persona Coordinates?
论文探索线性探针在角色坐标下对LLM有害行为的泛化监测,直指战略欺骗与沙袋问题。
arXiv:2605.09391v2 Announce Type: replace Abstract: It is becoming increasingly necessary to have monitors check for harmful behaviors during language…