# ICALens：无需训练字典即可解读语言模型表示

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：66
- AIHOT 链接：https://aihot.virxact.com/items/cmq8ullyj05xsslldrvbbb3z8
- 原文链接：https://arxiv.org/abs/2606.11722

## AI 摘要

ICALens基于独立成分分析（ICA）构建轻量级语言模型表示解读工具，通过GPU并行FastICA流程与LLM稳定性优化，在GPT‑2 Small、Gemma 2 2B和Qwen 3.5 2B Base上高效恢复紧凑、可解释的方向，无需逐层梯度训练字典。在SAEBench上，ICA在稀疏探测任务中与公开SAE性能相当，并在中小预算目标探针扰动中优于SAE。结果表明ICA应被视为解读语言模型表示的高效互补首选透镜。

## 正文

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.