稀疏自编码器特征稳定性研究:不稳定特征与可复现子空间
阅读原文· arxiv.org稀疏自编码器(SAE)广泛用于解释神经网络表征,但特征是否跨训练运行可复现影响其效用。研究者通过特征稳定性量化每个特征在独立训练中再次出现的概率。大规模实验显示,稳定特征承载大部分重建与预测相关信号;不稳定特征个体影响微弱,主要由低频表面形式触发,主导自动解释结果。几何上,不稳定特征集中于可复现的低秩子空间,表明种子依赖性反映激活空间共享区域内的基模糊性而非纯噪声。通过合并跨种子独特特征,可构建更稳定SAE并保持解释方差。
Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through feature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.