MAOAM:统一对象与材质选择的视觉语言模型框架
阅读原文· arxiv.orgMAOAM是一个统一图像选择框架,通过文本或点击交互精确选择对象和材质。它利用视觉语言模型(VLM)与分割头生成像素级掩码。针对缺少带文本标注的材质选择数据集,作者提出可扩展的数据生成流水线:收集真实与合成图像及材质掩码,用VLM生成富含视觉语义的描述。模型以多任务目标同时训练点击与文本选择,并引入辅助VQA任务加深材质理解。实验表明,MAOAM在多种对象、材质和交互场景下实现准确连贯的选择,且推理时结合文本与点击可产生涌现式提升。
Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.