Anthropic 提出回合平均稀疏自编码器（Turn-Averaged SAE）

2026-07-01 06:04·5小时前

AI 摘要

Anthropic 对每个对话回合所有 token 的残差流取平均后训练 SAE，大幅减少需解析的特征数量。实验使用 Qwen-2.5-7B-Instruct 和 LMSYS-Chat-1M 数据集，回合平均特征更关注模型行为的高层特性（如错误答案），每 token SAE 侧重数值推理等细节。Sonnet 4.6 评测显示：回合平均 SAE 在从 10 个回合中唯一识别目标（区分度）为 74%，低于每 token SAE 的 95%；但在全面描述回合（覆盖度）上以 77% 胜出。该方法可外推至训练平均长度 150 倍长的回合。

原文 · 未翻译

Transformer Circuits Thread Circuits Updates - June 2026 We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper. New Posts Anthropic Fellows Program: Turn-Averaged Sparse Autoencoders Anthropic Fellows Program: Turn-Averaged Sparse Autoencoders Kevin Der, Harish Kamath, Ben Thompson; edited by Nick Turner Sparse autoencoders (SAEs) have become a valuable tool for characterizing important safety behaviors in our models. Typically, we discover model transcripts that exhibit concerning behaviors and identify important active features within those transcripts to investigate. These features often describe more general model propensities that we should be aware of, such as emotions like distress or evaluation awareness. After identifying these features, we can use them in downstream applicationsFor example as probes to monitor other transcripts, or to build attribution graphs to paint a more complete picture of any given model behavior. However, identifying important features has two practical problems: SAEs operate on activations for a single token position. Even short transcripts can produce thousands to millions of feature activations to interpret.Many SAE features are “boring” — they focus on safety-irrelevant parts of the model's computation like syntax or the specific tokens used to articulate a decision. We've tried workarounds with varying degrees of success,For example, in the Opus 4.6 system card, we used contrastive prompts to isolate groups of relevant features but studying transcripts with SAE feature activations continues to be unwieldy. As part of the Anthropic Fellows Program, we experimented with Turn-Averaged SAEs. The concept is simple: we average the residual stream of all tokens in a single Human or Assistant turn, and train a SAE to reconstruct that representation. For a given turn, the turn-averaged SAE will surface L0 active features, whereas a per-token SAE will surface n_tokens × L0 features. Since turns can extend for hundreds to thousands of tokens, turn-averaged SAEs substantially decrease the number of feature activations to interpret. This idea was motivated by prior work which showed that averaged residual streams are useful to identify abstract model representations, such as persona vectors. We find that turn-averaged features capture more of the high-level characteristics of a transcript than per-token features. For example, consider the following turn, where the Assistant answers a simple numerical puzzle incorrectly: User: What is the highest number below 100 which does not contain 9? Assistant: The highest number below 100 that does not contain the digit 9 is 95. We trained both a per-token SAE and a turn-averaged SAE on the middle layer of Qwen-2.5-7B-Instruct across the LMSYS-Chat-1M dataset, and studied this prompt with their activations. The highest activating features from per-token SAEs concentrate on numerical reasoning (e.g. arithmetic statements, digits, numbering systems), whereas the highest activating turn-averaged SAE feature directly identifies features related to incorrect answers in number puzzles. This improvement in feature quality extrapolates to turns 150× longer than the average length of turns seen during training. To validate this method, we compare a turn-averaged SAE head-to-head against a per-token SAE. We ask Sonnet 4.6 to judge a set of turns with two criteria: How well each set of features can be used to uniquely identify a given turn out of a set of 10 random turns. (discrimination)How often the judge prefers turn-averaged features over per-token features to completely describe a given turn. (coverage) Turn-averaged features perform reasonably well on the discrimination metric (74%), but worse than per-token features (95%) which often contain specific phrases or tokens present in the original turn. However, turn-averaged features are preferred 77% of the time to per-token features in the coverage metric. Our full paper contains more details and experiments, including: A nested SAE architecture that combines turn-averaged and per-token features in a single modelApplying turn-averaged SAEs to attribution graphs: A case study tracing an interpretable circuit through a 14-turn conversation from the LMSYS datasetCompleteness/replacement metrics and intervention experiments validating that attribution graphs from nested SAEs are equal to or better than graphs constructed with per-token SAEs at measuring causal influence. A contrastive prompt pipeline that identifies turn-averaged features corresponding to safety-relevant personas and other behavioral traits. We're generally excited to use turn-averaged SAEs and other techniques that make auditing model behaviors simpler for both human and automated analysis.

Anthropic：Transformer Circuits（可解释性研究）

51导出 Markdown

Anthropic 提出回合平均稀疏自编码器（Turn-Averaged SAE）

2026-07-01 06:04·5小时前

阅读原文· transformer-circuits.pub

AI 摘要

原文 · 保持原样，未翻译

Anthropic 提出回合平均稀疏自编码器 （Turn-Averaged SAE）

Anthropic 提出回合平均稀疏自编码器 （Turn-Averaged SAE）

Anthropic 提出回合平均稀疏自编码器（Turn-Averaged SAE）

Anthropic 提出回合平均稀疏自编码器（Turn-Averaged SAE）