Bag of Dims：通过维度级符号模式实现免训练的机制可解释性

2026-06-17 08:00·16天前

AI 摘要

本文提出Transformer隐藏状态的标准基已构成免训练的通用特征基础。每个维度以符号（+/-1）编码语义、幅度编码置信度，可作为独立二进制寄存器。在语言（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B、Qwen3-32B）、视觉（DINOv2、ViT-Base）和音频（AST）共7个模型上验证：仅符号模式可保留60-93% top-5 next-token准确率；单token缓存（一次前向传播，无上下文无标签）通过符号一致性检测175个类别，AUC达0.97-0.99，训练探针仅提升0.018 AUC。特征可因果操作：实时前向传播中翻转符号可抑制对应概念。该结构同样适用于自监督视觉（9/12 ImageNet超类）、监督视觉（11/12）和音频（50/50 ESC-50类别），反映Transformer训练的普遍特性。

原文 · 未翻译

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

HuggingFace Daily Papers（社区热门论文）

52导出 Markdown

Bag of Dims：通过维度级符号模式实现免训练的机制可解释性

2026-06-17 08:00·16天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译