基于LLM的密集检索器鲁棒性研究:泛化性与稳定性系统分析
阅读原文· arxiv.org本文首次系统研究开源LLM密集检索器的鲁棒性,从泛化性与稳定性双维度,在覆盖30个数据集的4项基准上评估。发现指令微调模型虽整体优异,但针对复杂推理优化的模型存在"专业化税",泛化能力受限。稳定性测试表明,LLM检索器对拼写错误和语料库投毒攻击比仅编码器基线更鲁棒,但对同义词替换等语义扰动仍敏感。嵌入几何结构(如角度均匀性)可预测词汇稳定性,且扩大模型规模通常能提升鲁棒性。
Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.