# Circuits 更新 - 2025年9月

- 来源：Anthropic：Transformer Circuits（可解释性研究）
- 发布时间：2025-09-15 08:00
- AIHOT 分数：73
- AIHOT 标记：精选
- AIHOT 链接：https://aihot.virxact.com/items/cmoegbh73006fslxxt136v8n1
- 原文链接：https://transformer-circuits.pub/2025/september-update/index.html

## 精选理由

揭示语言模型随上下文深化理解的机制，助力可解释性研究进展。

## AI 摘要

Anthropic 可解释性团队在月度更新中分享了关于大语言模型跨语言表征的新发现。研究显示，模型在不同语言间的特征相似性（通过交并比IoU衡量）会随文本样本长度增加而上升。通过对比英法双语段落的首句与末句，团队发现末句的IoU显著高于首句，且无关文本的首句间重叠度高于末句。这表明模型在较长上下文中能构建更丰富的跨语言理解，而非由虚假激活主导。相关发现支持了模型随上下文积累深化语义表征的观点。

## 正文

Transformer Circuits Thread

Circuits Updates - September 2025

In these monthly updates we report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Features & In-context learning

Features & In-context learning

Adam Jermyn, Wes Gurnee; edited by Joshua Batson

In On the Biology of a Large Language Model, we studied how similar text translated into different languages is represented in language models. We found that:

Larger models represent different languages more similarly than smaller ones;

The similarity is greatest near the middle layers of the model;

More closely-related languages are represented more-similarly than more distantly-related ones.

Recently, we revisited those results and found a curious phenomenon: the similarity in active features (as measured by intersection-over-union, IoU) increases with increasing sample length.

The upward trend with sample length suggests that either:

The IoU score is being dominated by spurious activations (e.g. where a feature fires with small activations because the model has low confidence, or where the feature fires because of imperfections in our dictionaries) or else activations unrelated to the meaning of the sentences (e.g. features relating to position-in-text, which would be present regardless of any overlap in meaning).

The model takes some time in the context to build up a representation and builds up a fuller representation across languages in longer contexts. This is almost certainly true in the first few tokens, but would have to continue occurring on longer horizons to explain the observed effect.

On priors we suspected explanation (2) because so many of the original findings aligned with our intuitions about models and language. But, just to be sure, we decided to investigate further.

We calculated the IoU score for the first and last sentences in paired English/French paragraphs. If (1) is occurring these should be similar, since the sentence length is not on average varying across the context, whereas if (2) is occurring the last sentence should show a higher IoU score than the first sentence.

The differences between these are shown below. The distribution is skewed strongly to the right, meaning that the final sentence has a higher IoU score than the first. This is what explanation (2) predicts and is contrary to (1). We think this is evidence in favor of the theory that the model just has a richer understanding later in the context.

We also studied a baseline for this experiment where we take unrelated samples from the two languages. That is, we compare unrelated first sentences from English with unrelated first sentences from French, and likewise for last sentences. If (1) were occurring these should again be similar, whereas if (2) is occurring we should expect more overlap between unrelated first sentences than unrelated last sentences.

The distribution, shown below, skews to the left, meaning there's more overlap between unrelated first sentences than between unrelated last sentences. We think this is again consistent with a story where the scaling with context length is about the model developing a richer understanding.