Rohan Paul@rohanpaul_ai

2026-06-11 12:00·21天前

AI 摘要

论文认为稀疏自编码器作为LLM控制工具并非此前认为的那么差，失败源于特征标注方式与模型内部实际因果不匹配。作者提出用监督管道替代模糊标签，验证特征活动是否真实追踪数据标签，使特征具有因果权重。例如，强制“酒精”特征增强可使模型输出转向酒精话题。论文还发现极高稀疏度并非必要。与提示工程相比，提示更强（模型经训练服从提示），而特征控制更像直接拨动机器。

The paper argues that sparse autoencoders may not be bad steering tools after all， and much of the earlier failure may have come from choosing and naming the wrong features.

The problem is that earlier work made sparse autoencoders look weak because their features were labelled in a way that may not match what those features actually cause inside the model.

A sparse autoencoder is a small helper model that breaks an LLM's hidden activity into many possible "features，" such as a topic， style， or concept.

So a sparse autoencoder finds directions inside a model， but an unnamed direction is not yet a usable control knob.

The authors replace vague or inherited labels with a supervised pipeline that asks whether one feature's activity reliably tracks a real label in data.

As to the mechanism， if a feature fires on "alcohol，" and forcing that feature upward makes the model talk about alcohol， the label is no longer just descriptive； it has causal weight.

The paper also finds that very high sparsity may not be necessary， meaning the feature does not need to be extremely rare to be useful for steering.

Also to note here， both prompting and feature steering are ways to push an LLM toward a desired behavior.

Prompting remains stronger because the model was trained to obey prompts， while feature steering is more like pressing directly on the machinery and hoping the rest stays intact. Prompting says "write about alcohol" in the input； feature steering instead turns up the model's internal "alcohol-related" feature and sees whether the output changes in that direction.

----

Link - arxiv. org/abs/2605.31183

Title： "Steering LLMs？ Actually， Sparse Autoencoders can outperform simple baselines"

安全/对齐论文/研究

在 X 查看原推导出 Markdown

Rohan Paul@rohanpaul_ai · X

55导出 Markdown