上下文学习与归纳头

2022-03-08 00:00·1578天前

AI 摘要

研究发现，Transformer语言模型中的“归纳头”可能是其实现上下文学习能力的主要机制。归纳头是一种能够识别并复制序列模式的内部电路，其功能是在序列中寻找当前令牌的先前出现，并预测相同的后续令牌。在训练早期，模型会经历一个“相位变化”，此时归纳头迅速形成，同时上下文学习能力显著提升。通过架构扰动、直接剔除等六条互补的证据链表明，这种电路不仅存在于小型模型中，也可能构成了大型模型上下文学习的核心机制。这一机制性解释为理解模型内部计算、系统化解决安全问题提供了新途径。

原文 · 未翻译

As Transformer generative models continue to scale and gain increasing real world use , addressing their associated safety problems becomes increasingly important. Mechanistic interpretability – attempting to reverse engineer the detailed computations performed by the model – offers one possible avenue for addressing these safety issues. If we can understand the internal structures that cause Transformer models to produce the outputs they do, then we may be able to address current safety problems more systematically, as well as anticipating safety problems in future more powerful models. Note that mechanistic interpretability is a subset of the broader field of interpretability, which encompasses many different methods for explaining the outputs of a neural network. Mechanistic interpretability is distinguished by a specific focus on trying to systematically characterize the internal circuitry of a neural net.

In the past, mechanistic interpretability has largely focused on CNN vision models, but recently, we presented some very preliminary progress on mechanistic interpretability for Transformer language models. Specifically, in our prior work we developed a mathematical framework for decomposing the operations of transformers, which allowed us to make sense of small (1 and 2 layer attention-only) models and give a near-complete account of how they function. Perhaps the most interesting finding was the induction head, a circuit whose function is to look back over the sequence for previous instances of the current token (call it A), find the token that came after it last time (call it B), and then predict that the same completion will occur again (e.g. forming the sequence [A][B] … [A] → [B]). In other words, induction heads “complete the pattern” by copying and completing sequences that have occurred before. Mechanically, induction heads in our models are implemented by a circuit of two attention heads: the first head is a “previous token head” which copies information from the previous token into the next token, while the second head (the actual “induction head”) uses that information to find tokens preceded by the present token. For 2-layer attention-only models,Note that induction heads don’t occur in 1 layer models, because they require a composition of attention heads in different layers. we were able to show precisely that induction heads implement this pattern copying behavior and appear to be the primary source of in-context learning.

Anthropic：Transformer Circuits（可解释性研究）

63导出 Markdown

上下文学习与归纳头

2022-03-08 00:00·1578天前

阅读原文· transformer-circuits.pub

AI 摘要

原文 · 保持原样，未翻译