1
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
用Delta Debugging自动定位并修复大模型过度拒绝行为,提升安全性解释性。
arXiv:2606.03601v1 Announce Type: cross Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they …