# 2024年4月机制可解释性研究动态与团队招聘计划

- 来源：Anthropic：Transformer Circuits（可解释性研究）
- 发布时间：2024-04-15 08:00
- AIHOT 分数：76
- AIHOT 标记：精选
- AIHOT 链接：https://aihot.virxact.com/items/cmoegbh730070slxx0lm696ln
- 原文链接：https://transformer-circuits.pub/2024/april-update/index.html

## 精选理由

可解释性研究揭示AI内部机制，助力构建更安全可靠的AI产品。

## AI 摘要

Anthropic可解释性团队分享了2024年4月的研究进展与招聘规划。团队现有17人，预计2024至2025年将持续大规模扩张，重点招聘管理、研究科学家和工程师等职位。研究方面，团队探讨了字典学习的扩展规律，分析了计算资源分配与稀疏自编码器（SAE）训练效果的关系，并以一个具体案例展示了通过大规模超参数扫描寻找最优配置的过程。团队强调，这些成果属于初步分享，类似于实验室会议上的非正式交流。

## 正文

Transformer Circuits Thread

Circuits Updates - April 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Open Roles In Interpretability

Scaling Laws for Dictionary Learning

Update on how we train SAEs

How Strongly do Dictionary Learning Features Influence Model Behavior?

Interpretability Architectures Project

Caloric and the Utility of Incorrect Theories

Open Problem: Attribution Dictionary Learning

Isolating Circuits Paths of Different Lengths

Research By Other Groups

Open Roles in Mechanistic Interpretability

Chris Olah, Shan Carter, Adam Jermyn, Josh Batson, Tom Henighan

Mechanistic interpretability is a small field, although growing quite quickly. We estimate there are perhaps 50 full time positions focused on this topic. The Anthropic interpretability team is now 17 people, so we represent a significant fraction of these positions. As a result, we felt that providing some visibility into our hiring plans might be valuable for people considering careers in this space. (However, please note that while these are our present expectations, they are subject to change.)

Over the course of 2023 we hired 10 people. We’ve continued hiring in 2024, and expect to continue growing the team substantially, both this year and into 2025. We expect this to involve a few different roles:

Managers - We see this as the most important role that we’re hiring for right now.

Our growth is likely to be bottlenecked on management capacity, and finding the right fit for the team could make a huge difference to our long-term success.

Filling this role has been challenging because we’re looking for someone with experience in a research or engineering environment, who is excited about and experienced with people and project management, and who is enthusiastic about our research agenda and mission.

Research Scientists - We're looking for strong scientists, not necessarily experienced machine learning researchers. Most of our team comes from other backgrounds (astrophysics, condensed matter, mathematics, neuroscience). We do want to see evidence of engagement with mechanistic interpretability, as well as sufficient coding ability to implement and run ambitious experiments.

Research Engineers - We're looking for strong software engineers. Experience with machine learning is a plus, but not necessary – we're excited to consider strong software engineers who want to grow into research. Our team has a track record of having several such people (eg. Nelson Elhage, Tristan Hume) joining and quickly growing to perform state of the art research. Comfort with linear algebra, multivariate calculus, and basic engagement with machine learning and mechanistic interpretability is a strong plus if you don't have an ML background.

A few notes:

Internally, we don't really distinguish between research engineers and scientists. All members of our team do both research and engineering. However, people are often stronger at one than the other, and we try to hire a mixture of these strengths for team balance.

If you are interested in our new interpretability architectures project (see below), please apply to the research scientists or research engineer role and mention this in your application.

If you’re excited about our work and think you might be a fit for one of these roles, please apply!

Scaling Laws for Dictionary Learning

Jack Lindsey, Tom Conerly, Adly Templeton, Jonathan Marcus, Tom Henighan

Training sparse autoencoders (SAEs) for dictionary learning on larger models can be computationally intensive. It is important to understand (1) the extent to which using additional compute improves dictionary learning results, and (2) how that compute should be allocated. Here we analyze these questions in depth. As a case study, we consider SAEs trained on the residual stream following the third layer of a four-layer transformer.

Though we lack a gold-standard method of assessing the quality of a dictionary learning run, we have found that the loss function used during training – a weighted combination of reconstruction mean-squared error (MSE) and an L1 penalty on feature activations – is a useful proxy. Unless otherwise indicated, we use \mathrm{MSE} + 5\mathrm{L1} as our loss function during training and subsequent analysis. However, we find qualitatively similar results when we use other linear combinations of MSE and L1, or when we track linear combinations of MSE and the L0 “norm” of feature activations. We note that losses with different L1 coefficients are not comparable – ultimately, we select the L1 coefficients that produce the most useful features for downstream interpretability analyses.

Once we have chosen a loss function of interest, it allows us to treat dictionary learning as a standard machine learning problem, to which we can apply the “scaling laws” framework for hyperparameter optimization (see e.g. Kaplan et al. 2020, Hoffman et al. 2022). In an SAE, compute usage primarily depends on two key hyperparameters, the number of features being learned, and the number of steps used to train the autoencoder. The compute (in FLOPS) scales with the product of these parameters, if the input dimension and other hyperparameters are held constant. We conducted a thorough sweep over these parameters, fixing the values of other hyperparameters (learning rate, batch size, optimization protocol, etc.).

We are especially interested in keeping track of the compute-optimal values of the loss function and parameters of interest; that is, the lowest loss that can be achieved using a given number of FLOPS, and the number of training steps / features that achieve this minimum.

We have made the following observations:

Over the ranges we tested, loss functions decrease approximately according to a power law with respect to FLOPS, given the compute-optimal choice of training steps and number of features.

As the compute budget increases, the optimal allocations of FLOPS to training steps and number of features both scale approximately as power laws.

In general, the optimal number of features appears to scale somewhat more quickly than the optimal number of training steps. For instance, as the number of features increases, the corresponding compute-optimal number of training steps scales as a function of the number of features with an exponent between 0.5 and 1. The specific parameters of the scaling trends vary between different loss functions.

Emphasizing the sparsity penalty more in the loss function (i.e. increasing the L1 coefficient) leads to a larger number of training steps being compute-optimal.The compute-optimal number of training steps increases for loss functions that place a greater relative emphasis on the sparsity penalty. This suggests that reconstruction loss is optimized more quickly than sparsity over the course of SAE training, which we have observed empirically.

In these experiments, we also investigate the parameters needed to optimize for L0-based loss functions (i.e. linear combinations of MSE and L0 “norm” of feature activations). Since these cannot be directly optimized with gradient descent, we instead sweep over a range of L1 penalty coefficients during training and select the value that minimizes the L0-based loss function. We find that minimizing L0-based loss functions requires a greater number of training steps for a given number of features, compared to minimizing L1-based loss functions. Though we have not precisely characterized the source of this difference, we suspect it arises because additional training steps allow the SAE to fully zero out small feature activations.

The details of these trends are likely to vary depending on the underlying model, the layer of the model being probed, and other optimization details. Optimizing other hyperparameters (such as learning rate) jointly with training steps and number of features may influence the scaling trends. However, we expect many of these qualitative trends to be broadly applicable. We suggest that conducting similar analyses will be useful to other groups working with SAEs, particularly as computational cost increases. Extrapolating trends inferred from smaller experiments enables more informed choices of hyperparameters for resource-intensive dictionary learning runs. We are also careful to note that qualitative inspection of SAE features remains important, as the relationship between SAE loss and qualitative usefulness of SAE features is imperfect and may break down at sufficient scale.

Update on how we train SAEs

Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, Tom Henighan

We’ve made improvements to how we train SAEs since Towards Monosemanticity with the goal of lowering the SAE loss. While the new setup is a significant improvement over what we published in Towards Monosemanticity we believe there are further improvements to be made. We haven’t ablated every decision so it’s likely some simplifications could be made. This work was explicitly focused on lowering loss and didn’t grapple with loss not being the ultimate objective we care about. Here’s a summary of our current SAE training setup:

Let n be the input and output dimension and m be the autoencoder hidden layer dimension. Let s be the size of the dataset. Given encoder weights W_e \in \mathbb{R}^{m \times n}, decoder weights W_d \in \mathbb{R}^{n \times m}, and biases \mathbf{b}_e \in \mathbb{R}^{m}, \mathbf{b}_d \in \mathbb{R}^{n}, the operations and loss function over a dataset X \in \mathbb{R}^{s,n} are:

\begin{aligned}\mathbf{f}(x) &= \text{ReLU}( W_e \mathbf{x}+\mathbf{b}_e ) \\ \hat{\mathbf{x}} &= W_d \mathbf{f}(x)+\mathbf{b}_d \\ \mathcal{L} &= \frac{1}{|X|} \sum_{\mathbf{x}\in X} ||\mathbf{x}-\hat{\mathbf{x}}||_2^2 + \lambda\sum_i |\mathbf{f}_i(x)| ||W_{d,i}||_2 \end{aligned}

Note that the columns of W_d have an unconstrained L2 norm (in Towards Monosemanticity they were constrained to norm one) and the sparsity penalty (second term) has been changed to include the L2 norm of the columns of W_d. We believe this was the most important change we made from Towards Monosemanticity.

\mathbf{b}_e and \mathbf{b}_d are initialized to all zeros. The elements of W_d are initialized such that the columns point in random directions and have fixed L2 norm of 0.05 to 1 (set in an unprincipled way based on n and m, 0.1 is likely reasonable in most cases). W_e is initialized to W_d^T.

The rows of the dataset X are shuffled. The dataset is scaled by a single constant such that \mathbb{E}_{\mathbb{x} \in X}[||x||_2] = \sqrt{n}. The goal of this change is for the same value of \lambda to mean the same thing across datasets generated by different size transformers.

During training we use Adam optimizer beta1=0.9, beta2=0.999 and no weight decay. Our learning rate varies based on scaling laws, but 5e-5 is a reasonable default. The learning rate is decayed linearly to zero over the last 20% of training. We vary training steps based on scaling laws, but 200k is a reasonable default. We use batch size 2048 or 4096 which we believe to be under the critical batch size. The gradient norm is clipped to 1 (using clip_grad_norm). We vary \lambda during training, it is initially 0 and linearly increases to its final value over the first 5% of training steps. A reasonable default for \lambda is 5.

We do not use resampling or ghost grads because less than 1% of our features are dead at the end of training (dead means not activating for 10 million samples). We don’t do any fine tuning after training.

Conceptually a feature’s activation is now \mathbf{f}_i ||W_{d,i}||_2 instead of \mathbf{f}_i. To simplify our analysis code we construct a model which makes identical predictions but has an L2 norm of 1 on the columns of W_d. We do this by W_e' = W_e ||W_d||_2, b_e' = b_e ||W_d||_2, W_d' = \frac{W_d}{||W_d||_2} and b_d'=b_d.

Potential areas for improvement

Our initialization likely needs improvement. As we increase m the reconstruction loss at initialization increases. This may cause problems for sufficiently large m. Potentially with improved initialization we could remove gradient clipping.

We haven’t seen improvements in loss from resampling or ghost grads, but it’s possible resampling “low value” features would improve loss.

It’s plausible some sort of post training (for example Addressing Feature Suppression in SAEs) would be helpful.

Improving shrinkage is an area for improvement.

There are likely other areas for improvement we don’t know about.

Results

Given a fixed dataset X as we increase m the loss consistently decreases. We’ve been able to increase m to single digit millions without issues. This holds across a variety of transformer sizes and mlp activations or the residual stream. Our setup from Towards Monosemanticity would frequently have higher loss, many dead features, or many nearly identical features when run with large values of m.

We make changes to our training setup by looking at loss across a variety of values of \lambda, m, transformer sizes, and mlp or residual stream runs. We’re generally excited by a change that consistently decreases loss by at least 1%, or a change with roughly equal loss that simplifies our training setup. With our setup, comparing runs on (L0, MSE) or (L0, % of MLP loss recovered) requires care because L0 can be unstable. For example we’ve had cases where training twice as long with half the number of features leads to a <1% change in MSE and L1 but a 30% decrease in L0.

Here are some results from small models. All runs have 131,072 features, 200k train steps, batch size 4096. Note that L1 of f depends on our specific normalization of activations.

Type of Run

Lambda

L0(f)

L1(f)

Normalized MSE

Frac Cross Entropy Loss Recovered

1L MLP

2

99.62386

17.22560

0.03054

0.98305

1L MLP

5

38.68729

11.59591

0.06609

0.96398

1L MLP

10

20.06870

7.12194

0.13120

0.91426

4L MLP (layer 2)

2

264.02930

95.03488

0.06824

0.96824

4L MLP (layer 2)

5

69.92758

56.92384

0.12546

0.92904

4L MLP (layer 2)

10

26.48456

39.42661

0.18485

0.88438

4L Residual Stream (layer 2)

2

81.58595

30.37323

0.09543

0.9572

4L Residual Stream (layer 2)

5

33.23121

19.12259

0.16295

0.90443

4L Residual Stream (layer 2)

10

8.71466

12.53889

0.25455

0.83883

How Strongly do Dictionary Learning Features Influence Model Behavior?

Jack Lindsey

Features uncovered by sparse autoencoders are optimized to reconstruct model activity while remaining sparsely active. Our team and others have observed that these features often appear to encode specific, interpretable concepts. However, a potential concern about using these features for an interpretability agenda is that, despite their semantic significance to humans, these features may not capture the abstractions that the model uses for its computation. We have conducted preliminary experiments that suggest that models do in fact “listen” to feature values significantly more than would be expected by chance.

Our experiment works as follows: We train a sparse autoencoder (SAE) on the residual stream following the third layer of a trained four-layer transformer. For each SAE feature, we take a representative sample of datapoints for which that feature has a nonzero activation, scale the value of that activation by a factor of either 0 (“feature ablation”) and 2 (“feature doubling”), and propagate the updated value through to the model according to the same procedure as in Towards Monosemanticity (the case of scaling factor equal to zero corresponds to a feature ablation). We compute the average increase in the model’s loss following this procedure.

Our goal is to determine whether the loss increase from rescaling feature activation values is especially high compared to other model perturbations with similar statistics. If so, it would provide evidence that feature directions exert “special” influence on downstream computation in the model. To this end, we compare feature rescaling to several controls:

Apply a random perturbation to the residual stream activity at the same layer, matching the magnitude of the random perturbation to the magnitude of the perturbation induced by the feature rescaling (“random perturbation, model activations” in the figure).

This control is meant to test whether feature ablations are more significant than random perturbations. Arguably, this is a weak baseline, as the variance of model activations is likely not isotropic, and thus some dimensions of residual stream activity may be less consequential to model behavior. SAE feature vectors are trained to reconstruct model activity, and as a result probably concentrate in more important dimensions. Thus, as a stronger baseline, we tried the following:

Apply a random perturbation of the same magnitude as the feature rescaling on the feature activations vector (“random perturbation, feature activations” in the figure), and propagate that perturbation through to the model according to the same procedure used for the feature rescalings.

These experiments revealed several interesting findings:

We found that feature ablations have significantly greater impact on model performance than either of the controls.

Interestingly, ablating the feature activation has a substantially greater effect on model performance than amplifying the feature by the same amount, suggesting that the influence of features on model outputs may (partially) saturate at higher feature activations.

We also compared the effect of feature perturbations to other, more structured forms of perturbations. In all cases we match the magnitude of the perturbation in model activation space to be equal to that of the corresponding feature ablation.

Controlling for perturbation magnitude, applying perturbations in a direction antiparallel to the SAE reconstructions (“dampen feature activations” in the figure) – equivalent to “spreading out” a feature ablation across all features, in proportion to their activity – produces similar effects as single-feature ablations.

Controlling for magnitude, perturbations antiparallel to the model activity (“dampen model activations” in the figure) have less impact than feature dampening. Note that dampening model activations is different than dampening feature activations, as model activations include two components – the bias term of the SAE, and the error vector left unexplained by the SAE – that are not affected by feature dampening. This result indicates that the feature dampening effect is not explainable simply due to its effect on activation norm. In fact, in later layers, the effect of dampening model activations almost vanishes.

Consistent with the results of Gurnee (discussed in detail elsewhere in this update), magnitude-controlled perturbations along the reconstruction error direction -(x - SAE(x)) (“perturb along residual” in the figure) are also substantially more impactful than random perturbations, though less impactful than feature ablations. While not conclusive, these results suggest that Gurnee’s findings are consistent with a model in which SAE reconstruction errors lie primarily along feature directions, as this would explain their greater-than-random impact on model outputs.

The outsized impact of feature ablations is more pronounced for residual stream features than MLP layer features. This suggests that residual stream and MLP features may play different functional roles in the network, though our understanding of this result is limited.

In this figure, results are averaged over contexts, tokens, and features, and error bars indicate standard error of the mean over features.

These results are preliminary, but generally support the idea that feature directions uncovered by SAEs are high-leverage “levers” for influencing model outputs.

Interpretability Architectures Project

Chris Olah, Adam Jermyn

From time to time, we've noticed aspects of transformer architecture that make our lives on interpretability more difficult. For instance, layernorm makes circuit analysis and attribution more difficult, and we've invested significant effort in trying to get rid of it over the years. Similarly, SoLU was an attempt at making models more interpretable, although we ultimately believe it wasn't the right approach to that specific problem.

We believe it's possible that investing in model architecture now may save a lot of interpretability effort in the future. For this reason, we’re starting an experimental working group to explore more interpretable architectures. This working group will investigate architectural decisions that might make interpretability easier, and will collaborate with the Pretraining team to support their implementation. For now, this working group will be smaller than the main interpretability teams (dictionary learning, attention, and circuits). This working group will be embedded in both Interpretability and Pretraining, and members will sometimes contribute to projects on both of these broader teams.

If you’re interested in this new working group, please apply to join our team and indicate interest in working on interpretable architectures (see above).

Caloric and the Utility of Incorrect Theories

Tom Henighan

“Caloric theory” is an outdated theory of heat which can be summarized as follows:

There is a massless, self-repelling substance called “caloric” which increases the temperature of whatever matter it inhabits.

It’s easy to scoff at this theory and regard it as silly with the benefit of hindsight. But in fact, this theory could explain many, if not most, thermodynamic measurements that were available at the time. Consider the phenomenon of heat flow as a concrete example. Under caloric theory, a hot (high temperature) object contains a lot of caloric. That caloric is self-repelling, but cannot spread out more because it’s constrained by the boundaries of the hot object. If we put this hot object into contact with a cold object, the caloric spreads into the cold object. The hot object cools and the cold object heats as caloric flows from the former to the latter.

I now invite the reader to put themselves in the shoes of a 17th century scientist and design an experiment which can disprove caloric theory. Further, imagine how much more challenging this would be if you didn’t know about the kinetic theory of heat, which eventually supplanted caloric theory. This exercise gave me personally a lot of empathy for the calorists of old.

What I find most interesting about caloric theory is that although it was wrong, it yielded insights which we still hold true today.

One notable example is the Carnot cycle. Quoting wikipedia:

Sadi Carnot, who reasoned purely on the basis of the caloric theory, developed his principle of the Carnot cycle, which still forms the basis of heat engine theory. Carnot's analysis of energy flow in steam engines (1824) marks the beginning of ideas which led thirty years later to the recognition of the second law of thermodynamics.

In other words, the road to the heat engine theory and eventually the second law of thermodynamics was paved, in part, by an incorrect theory of heat. Another success of caloric theory was a correction to Newton’s calculation of the speed of sound in air, which held for nearly a century afterward.

Implications for Interpretability

I think there are many lessons we as interpretability researchers can learn from the history of caloric theory. Our initial theories will probably be wrong, and we should be willing to change our theories in the face of experimental evidence. Designing experiments which demonstrate that those theories are wrong will be a central challenge for us. But the more subtle point that I want to emphasize is that wrong theories can still provide real utility. Even if we think the superposition hypothesis will be disproven in the future, which it may very well be, using it is not a fool’s errand. There is still hope that it will be “correct enough” to illuminate practical safety wins and even scientific understanding which outlive the superposition hypothesis itself.

Open Problem: Attribution Dictionary Learning

Chris Olah, Adly Templeton, Trenton Bricken, Adam Jermyn

Ordinary dictionary learning only considers activations. It ignores gradients and weights. It seems like we should be able to make it much more efficient if we didn't do this.

More fundamentally, it seems like features have a dual nature. Looking backwards towards the input, they are "representations". Looking forwards towards the output, they are "actions". Both of these should be sparse – that is, they should sparsely represent the activations produced by the input, and also sparsely affect the gradients influencing the output. Ultimately, it seems like they should be a kind of conjunction of these two kinds of sparsity.

A Concrete Proposal

One operationalization of this might be to change the dictionary learning to ask that the linear attribution be sparse.

Consider a dictionary learning problem x \simeq x' = Dy where x is the dense activations, y the sparse feature activations, and D the dictionary. A traditional SAE loss might be:

L_{SAE}

=

||x - x'||^2

+

\lambda||y||_1

Reconstruction error

Activation sparsity penalty

Recall that the linear attribution vector A_x = x \odot \nabla_xL_{LLM} gives an approximation of how each element of x affects the LLM loss. An analogous attribution vector can be computed for y, A_y = y \odot \nabla_yL_{LLM}.Note that the gradient of y with respect to the LLM loss is easily computed if we save the gradient of the loss with respect to the input activations since \nabla_yL_{LLM} = D^T\nabla_xL_{LLM} where D is the dictionary.

We can then add terms to the SAE loss to encourage this to be sparse and to fully explain the attribution:

L_{SAE}

=

||x - x'||^2

+

\lambda||y||_1

+

\alpha||A_y||_1

+

\beta\left(\sum A_{x-x'} \right)

=

||x - x'||^2

+

\lambda||y||_1

+

\alpha||y \odot \nabla_yL_{LLM}||_1

+

\beta|(x\!-\!x')\cdot \nabla_xL_{LLM}|

Reconstruction error

Activation sparsity penalty

Attribution sparsity penalty

Unexplained attribution penalty

This directly optimizes the sparsity of the attribution vector we recently used to study feature circuits in Using Features For Easy Circuit Identification (see also attribution in Marks et al. on circuits of features, Kramár et al. on attribution patching, and Olah et al. on attribution to neurons in vision models).

We briefly investigated the features produced by this loss in a one-layer transformer. At first glance, they seemed about equally good to our normal features in that context. But we don't consider this at all dispositive. We plan to revisit this at some point in the future, but it may not be for a few months, and could be an interesting subject for someone else to investigate in the interim.

Isolating Circuits Paths of Different Lengths

Chris Olah

In A Mathematical Framework for Transformer Circuits, we briefly described an algorithm for studying paths of length at most k:

However, we didn't really explain why this algorithm works, and it was easy to miss. We think there are some cases where this algorithm can be interesting, and this update provides a more intuitive explanation.

Let's picture what happens as we apply this algorithm:

It's easy to see that on the first iteration, we isolate paths of length 0 (ie. we only use the direct path on the residual stream). But also note that we're saving the paths of length 1.

On the next iteration, we use the paths of length 1, and save the paths of length 2. And so on.