SemBridge：通过多语义桥实现稀疏编码器中的语言迁移

2026-05-25 08:00·39天前

AI 摘要

提出SemBridge，一种为稀疏编码器跨语言适配设计的嵌入初始化方法。该方法利用多语义桥模型建立源语言与目标语言词汇间的语义对齐，通过选择少量语义相关的源语言词来初始化目标语言词，过滤语义噪声，从而加速微调收敛并提升训练效率。在五种语言和四种稀疏架构上的大量实验表明，SemBridge在零样本检索中性能优越，并在微调后能持续提升检索效果，为在多语言环境中部署高性能稀疏检索系统提供了实用方案。

原文 · 未翻译

Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.

HuggingFace Daily Papers（社区热门论文）

65导出 Markdown

SemBridge：通过多语义桥实现稀疏编码器中的语言迁移

2026-05-25 08:00·39天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译