论文认为稀疏自编码器作为LLM控制工具并非此前认为的那么差,失败源于特征标注方式与模型内部实际因果不匹配。作者提出用监督管道替代模糊标签,验证特征活动是否真实追踪数据标签,使特征具有因果权重。例如,强制“酒精”特征增强可使模型输出转向酒精话题。论文还发现极高稀疏度并非必要。与提示工程相比,提示更强(模型经训练服从提示),而特征控制更像直接拨动机器。
The paper argues that sparse autoencoders may not be bad steering tools after all, and much of the earlier failure may have come from choosing and naming the wrong features.
The problem is that earlier work made sparse autoencoders look weak because their features were labelled in a way that may not match what those features actually cause inside the model.
A sparse autoencoder is a small helper model that breaks an LLM's hidden activity into many possible "features," such as a topic, style, or concept.
So a sparse autoencoder finds directions inside a model, but an unnamed direction is not yet a usable control knob.
The authors replace vague or inherited labels with a supervised pipeline that asks whether one feature's activity reliably tracks a real label in data.