Bag of Dims:通过维度级符号模式实现免训练的机制可解释性
阅读原文· arxiv.org本文提出Transformer隐藏状态的标准基已构成免训练的通用特征基础。每个维度以符号(+/-1)编码语义、幅度编码置信度,可作为独立二进制寄存器。在语言(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B、Qwen3-32B)、视觉(DINOv2、ViT-Base)和音频(AST)共7个模型上验证:仅符号模式可保留60-93% top-5 next-token准确率;单token缓存(一次前向传播,无上下文无标签)通过符号一致性检测175个类别,AUC达0.97-0.99,训练探针仅提升0.018 AUC。特征可因果操作:实时前向传播中翻转符号可抑制对应概念。该结构同样适用于自监督视觉(9/12 ImageNet超类)、监督视觉(11/12)和音频(50/50 ESC-50类别),反映Transformer训练的普遍特性。
We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.