# 具体性驱动的对比负样本挖掘用于组合理解

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo8hlee105i5slmledtgfn61
- 原文链接：https://arxiv.org/abs/2604.13313

## AI 摘要

视觉-语言模型在组合推理中常受词序和属性绑定脆弱性困扰，源于对比预训练中缺乏区分细微语义变化的信息样本。本研究建立词汇具体性作为负样本有效性的决定因素，提出ConcretePlant方法系统操作感知概念，通过修改高具体性术语产生显著结构差异。针对InfoNCE梯度不平衡问题，提出Cement损失函数，采用基于边界的方法关联心理语言学分数与样本难度，动态校准惩罚强度。集成框架Slipform在组合评估基准、跨模态检索及线性探测任务上均达到最先进准确率。

## 正文

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.
