1
In-Training Defenses against Emergent Misalignment in Language Models
针对语言模型训练中突发性错位问题,提出在训练阶段进行防御的新方法,来自arXiv最新预印本。
arXiv:2508.06249v3 Announce Type: replace-cross Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domain…