# 作者身份信号在编码器语言模型中的涌现位置

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-19 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpdu9wre071cslk10s7i722h
- 原文链接：https://arxiv.org/abs/2605.19908

## AI 摘要

该研究发现，使用相同预训练编码器、数据和损失函数微调的作者归属模型，仅因评分机制不同，性能差异可达四倍。研究利用机械可解释性工具揭示了这一差距的来源：词长、标点密度、功能词频率等风格特征在所有模型的每一层中均等可得，因此差距并非源于表征质量差异。因果介入实验表明，评分器决定了编码器在哪个层次整合作者身份信号——均值池化迫使信号在早期至中期层整合，而晚期交互则将其推迟到更晚的层。这一差异源于各评分器的梯度结构不同。

## 正文

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.
