1
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
稀疏自编码器成功从Claude 3 Sonnet中提取3400万个可解释特征,验证了字典学习方法在大模型上的可扩展性。
arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a pro…