# 密集检索器的位置偏见是内建的，还是从数据中学来的？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmpqhg9d60514slnoqukt036s
- 原文链接：https://arxiv.org/abs/2605.26578

## AI 摘要

本研究探讨了密集检索器位置偏见的成因，聚焦于训练数据中证据位置分布的影响。通过构造证据位于文档开头、中间或结尾的合成训练集，并对8种架构的预训练模型进行微调，实验发现：偏斜的训练分布会使模型偏好相应位置的信息。在位置敏感的评测基准上，位置平衡训练可降低57%–87%的位置敏感性，且检索性能具有竞争力。表示层分析表明，微调能重塑模型的位置偏好，但部分模型中预训练或架构固有的倾向仍会持续。该研究指出训练数据的位置分布是影响检索位置偏见的主要可控因素，平衡数据编排是一种有效的缓解策略。

## 正文

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.