# 机制性异常检测研究进展

- 来源：EleutherAI：Blog
- 发布时间：2024-08-06 00:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnxbjhcv003usln0yt8sspe2
- 原文链接：https://blog.eleuther.ai/mad_research_update

## AI 摘要

这是一份关于机制性异常检测研究项目的中期进展报告，总结了该领域正在进行的工作。目前报告处于 interim 阶段，重点介绍通过理解模型内部机制来识别异常行为的技术路径，但尚未披露具体的技术突破、实验数据或性能指标。后续完整版本将提供更详细的方法论和实证结果。

## 正文

Results Online detectors Aggregated AUROC by online score and features: all datasets Aggregated AUROC by online score and features: by dataset Layerwise AUROC by online score and features: by dataset Offline detectors Aggregated AUROC by offline score and features: all datasets Aggregated AUROC by offline score and features: by dataset Layerwise AUROC by offline score and features: by dataset

Online detectors Aggregated AUROC by online score and features: all datasets Aggregated AUROC by online score and features: by dataset Layerwise AUROC by online score and features: by dataset

Aggregated AUROC by online score and features: all datasets

Aggregated AUROC by online score and features: by dataset

Layerwise AUROC by online score and features: by dataset

Offline detectors Aggregated AUROC by offline score and features: all datasets Aggregated AUROC by offline score and features: by dataset Layerwise AUROC by offline score and features: by dataset

Aggregated AUROC by offline score and features: all datasets

Aggregated AUROC by offline score and features: by dataset

Layerwise AUROC by offline score and features: by dataset

Adversarial image detection

Visualising features Population Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28 Sentiment Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28 Discovering functional elements of the network with edge attribution patching

Population Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28

Activations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

Attention head mean ablations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

Probe shift Layer 4 Layer 16 Layer 28

Layer 4

Layer 16

Layer 28

Sentiment Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28

Activations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

Attention head mean ablations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

Probe shift Layer 4 Layer 16 Layer 28

Layer 4

Layer 16

Layer 28

Discovering functional elements of the network with edge attribution patching

We are planning to experiment with sparse autoencoders

Outlook

Appendix: Tables of Results Online methods Addition results: online methods Hemisphere results: online methods Modular addition results: online methods Multiplication results: online methods NLI results: online methods Population results: online methods Sciq results: online methods Sentiment results: online methods Squaring results: online methods Subtraction results: online methods Offline results Addition: offline results Hemisphere: offline results Modular addition: offline results Multiplication: offline results NLI: offline results Population: offline results Sciq: offline results Sentiment: offline results Squaring: offline results Subtraction: offline results

Online methods Addition results: online methods Hemisphere results: online methods Modular addition results: online methods Multiplication results: online methods NLI results: online methods Population results: online methods

Addition results: online methods

Hemisphere results: online methods

Modular addition results: online methods

Multiplication results: online methods

NLI results: online methods

Population results: online methods

Sciq results: online methods

Sentiment results: online methods

Squaring results: online methods

Subtraction results: online methods

Offline results Addition: offline results Hemisphere: offline results Modular addition: offline results Multiplication: offline results NLI: offline results Population: offline results Sciq: offline results Sentiment: offline results Squaring: offline results Subtraction: offline results

Addition: offline results

Hemisphere: offline results

Modular addition: offline results

Multiplication: offline results

NLI: offline results

Population: offline results

Sciq: offline results

Sentiment: offline results

Squaring: offline results

Subtraction: offline results

In December 2023, the Eleuther team published Eliciting Latent Knowledge from Quirky Language Models. We finetuned language models to behave in a "quirky" manner on a collection of question and answer datasets. When a prompt began with "Alice:", these models were trained to answer as accurately as possible, but when it instead began with "Bob:", they would answer according to an unreliable heuristic (Bob would not always be wrong, but would consistently use the same fallible method to answer questions). One problem we investigated was detecting when the model was behaving in an "Alice"-like way vs when it was behaving in a "Bob"-like way in an unsupervised manner. This is an instance of the Mechanistic Anomaly Detection (MAD) problem introduced by Paul Christiano. We obtained strong results with a simple anomaly detection technique- an average AUROC of 0.95 across all quirky datasets, and close to 0.99 for those datasets where the models learned to be strongly quirky.

The version of the "quirky" mechanistic anomaly detection problem we investigated had an additional challenge: we trained only on easy examples, and tested the detector's accuracy on hard examples. We do this because we want to detect how well detectors work under a distribution shift.

Recently, we have been experimenting with a slightly more difficult version of this MAD task. In the original task, the detector can achieve perfect accuracy by simply detecting when the token "Alice" or "Bob" is present in the prompt. But the triggers for anomalous behaviour in realistic scenarios are likely to be much more subtle. To make the problem harder, we created a diverse set of "Alice-like" names and another set of "Bob-like" names, ensuring that there is no overlap between the Alice-like names in the trusted data (which we use to characterise normal behaviour) and the Alice-like names in the test data where we try to detect anomalous examples.

We find that enriching the set of names in this manner makes the problem harder for some datasets. We have been experimenting with several approaches to anomaly detection. We've tested a collection of different detectors to detect anomalies based on model activations. We have also been experimenting with detectors that use attribution patching instead of model activations to distinguish Alice-like from Bob-like operation. We have found:

For some datasets, hard examples yielded a significantly different distribution of activations to easy examples, so we were successful in inducing a distribution shift between training and test examples.

Adding additional names makes the problem harder.

We can achieve very high classification accuracy on some datasets if we have access to test set statistics (AUROC >0.999) and moderately high accuracy without them (AUC >0.95).

For some datasets high accuracy depends on hyperparameters that may be difficult to tune and on others we do not achieve high accuracy at all.

Achieving uniformly high classification accuracy is difficult.

We compared methods based on model activations and attribution patching. Activation based methods offered the highest classification accuracy, though some methods based on attribution patchings were less sensitive to layer choice.

Combining activations and effects estimated from attribution patching did not significantly improve detection accuracy.

Visualisations suggest that attribution patching may benefit from more sophisticated scoring methods.

Activation based anomaly detection can also detect adversarial examples in non-robust vision models but not in robust models.

Attribution based anomaly detection can detect adversarial examples in robust image models at a rate slightly above chance.

Experimental setup#

We fine tuned Mistral 7B v0.1 on 12 quirky datasets with LoRA, early stopped on overall validation loss computed on both Alice and Bob examples. The number of epochs used is listed in the table below. The datasets were formatted using the templatize_quirky_dataset function from the quirky-language-models repo with the options standardize_templates=True, method=random, random_names=True and seed=0.

templatize_quirky_dataset

standardize_templates=True

method=random

random_names=True

seed=0

Dataset Max Epochs Notes capitals 120.0 hemisphere 15.0 population 30.0 sciq 30.0 Alternate version provided by Eric Jenner with additional examples and no examples missing support sentiment 60.0 nli 60.0 authors 60.0 addition 15.0 subtraction 15.0 multiplication 15.0 modularaddition 30.0 squaring 15.0

We also fine tuned Llama 3 and 3.1 8B on these datasets, and we are planning to use these models as a held out test model for our anomaly detection methods.

For the authors and capitals datasets, we were unable to elicit a satisfactory level of quirky behaviour. Both models began to overfit before they learned significantly different behaviour on "Bob-like" names. We therefore exclude them from our analysis.

Each dataset was partitioned four ways:

Trusted: Datapoints with Alice behaviour and difficulty below the 25th percentile of the whole dataset. One of the first four "Alice names" was used in the prompts for these datapoints

Alice Untrusted: A dataset with identical parameters to trusted but different examples

Bob Untrusted: Datapoints with Bob behaviour and difficulty below the 25th percentile of the whole dataset. One of the first four "Bob names" was used in the prompts

Test: Datapoints with Alice or Bob behaviour above the 75th percentile difficulty. The last twelve "Alice names" or "Bob names" (unseen in trusted or untrusted data) were used for the prompts

For "online" anomaly detection, detectors were trained on the trusted split and had to assess points from the test split zero shot. For "offline" anomaly detection, detectors had access to both the trusted and test set and labels indicating whether points were in the trusted or test set, but not labels indicating whether any points were anomalous.

Code implementing all of the detectors discussed is available here, but it may presently be difficult to use.

Additional names make the problem harder#

In Eliciting Latent Knowledge, we had a two-step process for mechanistic anomaly detection:

We trained a collection of probes on the activations of each layer to predict the model output given only Alice's easy examples

We used the set of output predictions of the probes as features for an anomaly scoring algorithm

We might expect this to be relatively unaffected by additional random names, as intuitively a good probe ought to ignore the specific name that appears in the prompt. However, in practice we find significantly worse performance on the randomised name dataset, as shown by Figure 1.

Figure 1: Comparison of MAD AUC with randomised and nonrandomised names

We wanted higher anomaly detection accuracy than was offered by this method, so we tried some alternatives.

Methods#

There are two different settings where we could imagine employing MAD:

An "online" setting, where we want to detect anomalous behaviour as soon as it occurs and take appropriate action

An "offline" setting, where we are able to collect many examples of untrusted behaviour and use the collection to identify anomalies

If a system has a small number of anomalous modes of operation, then the offline problem could be easier, because we could get many examples for each mode. Intuitively, this might not be true if the number of modes of anomalous operation is larger than the number of examples of anomalous behaviour -- in this case, each anomaly could be of a new type. In our quirky datasets, Bob has a single unusual answering strategy, which plausibly corresponds to a single anomalous mode of operation, so the offline problem may be easier.

We tried anomaly detection methods suited to each setting. We experimented with different scoring rules and different sets of features from which to compute scores. The feature sets we tried included:

Activations: We take "activations" from the residual stream at various internal layers of the model MLP activations: For a subset of scores, we experimented with taking activations from the MLP outputs rather than the residual stream

MLP activations: For a subset of scores, we experimented with taking activations from the MLP outputs rather than the residual stream

Attribution: We measure the approximate effect of ablating the model's activations on the final difference in logits between "Yes" and "No" tokens using attribution patching. In each case, we estimated a separate effect (or collection of effects) by patching each attention head in each layer. We experiment with several different ablations: Mean: We replace the activations of each of the attention heads in a given layer with the mean activation of that attention head Principal components: We replace the activations of each of the attention heads with the activations given by subtracting the difference between the activation's projection on each of the top 10 principal components and the mean of activations projected on the same principal component Grad norm: We take the norm of the gradient for each attention head in a layer as a collection of features

Mean: We replace the activations of each of the attention heads in a given layer with the mean activation of that attention head

Principal components: We replace the activations of each of the attention heads with the activations given by subtracting the difference between the activation's projection on each of the top 10 principal components and the mean of activations projected on the same principal component

Grad norm: We take the norm of the gradient for each attention head in a layer as a collection of features

Probe shift: Using attribution patching, we take the difference between the estimated effect of an intervention on the network's output, and the estimated effect on the output of a linear probe trained to predict the network's output on an intermediate layer. All of the above ablations are also applicable to this method. For a probe at layer $i$, we consider the effect of all interventions in layers prior to $i$.

Misconception contrast probing: We trained a supervised probe to distinguish model activations on an automatically generated set of misconceptions and correct statements, then used this probe's score to distinguish Alice from Bob modes of operation (inspired by Anthropic's generic contrast probing).

Activation + attribution: We concatenate the activations and approximate effects computed from attribution patching

Confidence probing via iterative prompt rephrasing: We examined how sensitive the model results were to few shot rephrasing. Specifically, we automatically generated rephrasings of the form "{Question} one answer is No. {Repeat question}", and measured how much the model's answers shifted with rephrasings like this. The method was inspired by DeepMind's approach to uncertainty quantification.

More details on attribution patching are given further down. We also tried combining attribution patching and activations in two ways:

Merge: We concatenate the features before computing anomaly scores

Ensemble: We compute anomaly scores for each set of features and then average the scores

Finally, we experimented with cross layer concatenation of effects estimated from attribution patching:

Concatenate: Features from each layer are concatenated before score is calculated

Given these features, we experimented with a number of different scoring methods, some of them "online" (that don't require batch statistics for the test set) and "offline" (which do). Online scores we experimented with:

Raw average: Relevant to contrast probing and prompt rephrasing: we simply average the confidence the probe assigns to the example being misleading

Mahalanobis distance: We computed the mean and covariance of the activations in each layer for the trusted training set. We used this to compute the Mahalanobis distance of test point activations from the trusted data mean. Mahalanobis-PCA: We also tried a variant of the Mahalanobis distance that computed the distance only along the top principal components.

Mahalanobis-PCA: We also tried a variant of the Mahalanobis distance that computed the distance only along the top principal components.

Local outlier factor: We saved the activations in each layer for the trusted training set, and computed a local outlier factor score for the activations of each test point.

Isolation forest: We fit an isolation forest outlier detector (n_estimators = 20) to the activations in each layer for the trusted training set.

For the offline setting, we experimented with:

Quantum entropy: Based on the SPECTRE paper, we calculated a "quantum entropy" score that interpolates between the Mahalanobis distance and the distance along the largest principal component of the test data covariance matrix, which we calculate after the test data has been whitened using the covariance matrix of the trusted data (credit to Eric Jenner for this method).

Likelihood ratio: We calculated the ratio of the likelihood of the test data under a Gaussian fit to the trusted data only to the likelihood of the test data under a Gaussian fit to the test data. Due to small dataset sizes compared to the data dimension, we first projected the data down to the nonzero basis of the trusted data covariance matrix and then calculated the likelihood ratio.

Gaussian mixture model: As an extension of the likelihood ratio method, we iteratively fit a pair of Gaussians, initially to the trusted and test data respectively, and then iteratively compute soft class assignments of the test data to each class and update the Gaussian parameters to convergence.

In addition, we visualised activations and estimated effects per layer using UMAP and PCA plots for the top four principal components.

Results#

Among online scores, the highest average AUROC (0.87) when scores were aggregated across layers was achieved by using the mahalanobis score with activation features. Among offline scores, the highest average AUROC (0.94) was achieved by using the quantum entropy score together with activation features.

Averages mask substantial structure in scores:

All methods struggled to consistently distinguish Alice from Bob on the sciq and sentiment datasets

sciq

sentiment

Activation based methods performed well on all datasets except nli

nli

Where activation based methods performed well, they typically performed very well. Activations-quantum entropy achieved an AUROC of ~1 on the remaining datasets, while the lowest AUROC for activations-Mahalanobis distance on the remaining datasets distance was 0.988 on hemisphere

hemisphere

Attribution based methods typically achieved AUROC of around 0.9 on "easy" datasets, but performed notably better than activation based methods on nli

nli

Probe-shift based methods were particularly insensitive to the choice of layer

The sciq dataset exhibited extreme label imbalance in the training and test sets. Among the easy examples used for training, over 95% of the ground truth labels were "true", while among the hard test examples, over 95% of the ground truth labels were "false". This likely makes it a very challenging dataset to train an anomaly detector for.

sciq

We did not have success with contrast probing.

Online detectors#

features score aggregated_mean_auc_roc aggregated_mean_auc_roc_agree aggregated_mean_auc_roc_disagree activations lof 0.846 0.841 0.807 activations mahalanobis 0.871 0.866 0.856 activations pca mahalanobis 0.796 0.799 0.800 attribution isoforest mean 0.756 0.782 0.780 attribution lof mean 0.849 0.841 0.841 attribution lof pcs 0.852 0.846 0.839 attribution mahalanobis concat mean 0.755 0.762 0.777 attribution mahalanobis mean 0.805 0.797 0.798 attribution mahalanobis pcs 0.661 0.646 0.687 attribution activations mahalanobis mean 0.850 0.853 0.869 attribution activations mahalanobis mean ensemble 0.843 0.839 0.837 iterative rephrase 0.506 0.482 0.478 misconception contrast 0.642 0.629 0.639 probe lof concat mean 0.840 0.857 0.810 probe lof concat pcs 0.809 0.820 0.788 probe lof mean 0.853 0.855 0.845 probe mahalanobis concat mean 0.837 0.845 0.821 probe mahalanobis mean 0.848 0.847 0.842 probe activations mahalanobis mean ensemble 0.854 0.854 0.847

Figure 2: Performance of different online detectors where scores are aggregated across layers

Figure 3: Performance of different online detectors by layer

Offline detectors#

features score aggregated_mean_auc_roc aggregated_mean_auc_roc_agree aggregated_mean_auc_roc_disagree activations que 0.936 0.926 0.964 attribution que concat mean 0.791 0.785 0.837 attribution que grad norm 0.688 0.713 0.714 attribution que pcs 0.840 0.837 0.856 probe que concat mean 0.864 0.886 0.858 probe que mean 0.860 0.878 0.854 probe que pcs 0.860 0.866 0.872

Figure 4: Performance of different offline detectors where scores are aggregated across layers

Figure 5: Performance of different offline detectors by layer

Adversarial image detection#

We conducted an exploratory experiment applying the same detectors to the task of detecting adversarial images for image classifiers. We focused on the Mahalanobis distance, because it worked well in comparison to other detectors for the quirky models task and because it is quite simple. We found that an activation-mahalanobis detector was able to detect adversarial images with an AUC of around 0.8 (layer-dependant) in a standard image classifier. It could not perform above chance for a robust classifier, however. Using attribution patching to approximate mean ablation along top principal components, on the other hand, had highly layer-dependent accuracy for a standard classifier. Interestingly, exactly the same approach performed marginally better than chance in a relatively layer independent manner for the tested robust detector. We did not test how robust the detectors themselves were to adversarial attack.

Figure 6: Performance of different adversarial image detectors by layer and image model. Standard is a non-robust classifier, while Carmon2019Unlabeled is a robust model

Visualising features#

In addition to testing various anomaly detectors, we also visualised per-layer activations and activation patching based effect estimates using UMAP and principal component plots. For "easy" dataset feature combinations (such as activations on the population dataset), we saw clear separation between normal and anomalous points among the top principal components in middle to late layers. We often (though not always) saw similar cluster separations in both principal component and UMAP plots.

population

Population#

At layer 1, activations show a strong clustering pattern in both PCA and UMAP plots that does not correspond to any of the classes we have distinguished (true/false, trusted/test, normal/anomalous, agree/disagree).

Figure 7: PCA plot of population activations, layer 1

Figure 8: UMAP plot of population activations, layer 1

In the layer 16 activations, the PCA plots show clear separation of anomalous and normal examples in the direction of the top principal component. In the direction of the second principal component, there is substantial separation between the trusted data used to train the detector and the normal test examples, showing that there is a distribution shift between the easy and hard examples. The population dataset consists of questions like input: "Does {character} think that {city} has a population greater than {population}?", and the difficulty metric is -log(population), so it is possible that the second principal component tracks log population.

input: "Does {character} think that {city} has a population greater than {population}?"

-log(population)

While the PCA plots for layer 16 activations show different directions that seem to correspond to normal/anomalus, easy/hard and true/false, the UMAP plot separates all of these classes at once.

Figure 9: PCA plot of population activations, layer 16

Figure 10: UMAP plot of population activations, layer 16

By layer 28, clusters visible at layer 16 appear to have separated further. We also see the emergence of further structure that does not correspond to any of the classes we distinguish by colour.

Figure 11: PCA plot of population activations, layer 16

Figure 12: UMAP plot of population activations, layer 16

For estimated mean ablation effects at layer 1, we see that anomalous examples are far more variable along the top principal component than normal examples, though they do not separate into different classes along any of the top principal components.

The UMAP plot, on the other hand, clearly distinguishes normal from anomalous examples. The anomalous examples have an interesting "double spike" shape. This might reflect the fact that interventions in the first layer might ablate the "Bob" label, and perhaps the spiky shapes reflect the sizes of the errors Bob made (we haven't checked this). Given this structure, it is surprising that the local outlier factor score with attribution based features does not perform better for the population dataset in layer 1, given that both local outlier factor and UMAP depend on the neighbourhoods of points.

Figure 13: PCA plot of population mean ablations, layer 1

Figure 14: UMAP plot of population mean ablations, layer 1

While clustering by the classes we have coloured is evident in layer 16 mean ablation effect estimates, it is far less defined than the clustering evident in layer 16 activations.

Figure 15: PCA plot of population mean ablations, layer 1

Figure 16: UMAP plot of population mean ablations, layer 1

By layer 28, clusters visible at layer 16 appear to have separated further. We also see the emergence of further structure that does not correspond to any of the classes we distinguish by colour.

Figure 17: PCA plot of population activations, layer 28

Figure 18: UMAP plot of population activations, layer 28

The probe shift method preserves the structure visible in layer 4 of the mean ablation UMAP plot throughout every layer of the network. This may be attributable to the possibility that there is an intervention in layer 1 that ablates the "Bob" label, and this has a significant effect on the accuracy of probes throughout the network.

Figure 19: PCA plot of population probe shift features, layer 4

Figure 20: UMAP plot of population probe shift features, layer 4

Figure 21: PCA plot of population probe shift features, layer 16

Figure 22: UMAP plot of population probe shift features, layer 16

Figure 23: PCA plot of population probe shift features, layer 28

Figure 24: UMAP plot of population probe shift features, layer 28

Sentiment#

While population is a relatively easy dataset, sentiment is more challenging. Attribution based methods performed worse than change on it, while activation based detectors could not reliably distinguish normal from anomalous examples.

Like the activations for population, layer 1 activations for sentiment form clusters unrelated to our labelling. We do not see much separation between trusted and test examples.

population