1
MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment
多轮对话评估揭示AI在动物福利对齐上的隐蔽失败,压力下模型会背离初始立场。
arXiv:2605.16301v1 Announce Type: cross Abstract: Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measur…