1
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
针对LLM在分布外场景下的对齐失败问题,提出基准测试和改进监测器的新方法,提升AI安全可靠性。
arXiv:2605.21602v1 Announce Type: new Abstract: Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (O…