# HeRA：面向多模态大语言模型的逐头表示对齐方法

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-22 08:00
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmr0zbl2e006psldxygcw8sgw
- 原文链接：https://arxiv.org/abs/2606.23885

## AI 摘要

HeRA在单个注意力头级别执行跨模态对齐，基于柏拉图表示假说，利用互K近邻（MKNN）度量定义对比损失，作为匹配局部拓扑结构的可微代理。训练时选择MKNN对齐分数最低的注意力头进行对齐，发现对齐最差的头反而带来最大收益。在多个MLLM和18项基准上的评估表明，HeRA一致提升视觉密集任务性能，并通过自然抑制对语言先验的过度依赖，有效缓解视觉幻觉。代码已开源。

## 正文

Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.
