InstructSAM：基于任意指令的多实例分割框架

2026-05-25 08:00·39天前

AI 摘要

本文提出InstructSAM，一个用于在任意指令下执行多实例分割的统一框架。该方法将问题形式化为集合结构的查询预测任务，通过在视觉语言模型中注入可学习实例查询，并设计混合注意力机制与SAM3交互，实现了在单次前向传播中完成多实例分割。论文同时构建了大规模指令实例分割数据集与基准Inst2Seg。实验表明，仅2B规模的InstructSAM在相关基准上取得了优异性能，优于此前的端到端方法和SAM3的智能体流程。

原文 · 未翻译

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

HuggingFace Daily Papers（社区热门论文）

65导出 Markdown

InstructSAM：基于任意指令的多实例分割框架

2026-05-25 08:00·39天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译