# SOCO：视觉基础模型中的语义物体对应基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpwmljjs0499slsn9aqcljrr
- 原文链接：https://arxiv.org/abs/2605.31597

## AI 摘要

为评估视觉基础模型对物体部件的细粒度理解能力，研究提出了新基准SOCO。该基准建立了语义对应类型分类体系，提供了跨100个类别、超过100万对应对的统一关键点标注，并包含关键点语言描述，以支持对大型视觉-语言模型的评估。实验发现，视觉基础模型虽编码了强语义结构，但在相关类别间的对应关系迁移能力较弱；大型视觉-语言模型在文本提示的部件定位上优于视觉参考的跨图像匹配；且对应性能比分更能预测分割、跟踪等密集预测下游任务的表现。

## 正文

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.
