# PixelEyes：解耦感知与推理实现精准视觉证据定位

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-30 08:00
- AIHOT 分数：45
- AIHOT 链接：https://aihot.virxact.com/items/cmr3fdmun00d9sllxujnwdc3x
- 原文链接：https://arxiv.org/abs/2607.00115

## AI 摘要

PixelEyes是一种多轮视觉推理智能体，通过显式解耦推理与感知解决MLLMs因定位不准导致的冗余轨迹问题。推理器决定查找目标，专用感知工具采用掩码引导视觉搜索（Mask-guided Visual Search）和语义区域广度优先搜索（Semantic-region BFS）提供精确定位，消除重复裁剪错误子区域的循环。基于PixelEyes-6K数据集训练，并引入Pinpoint-Bench零提示视觉搜索基准，用于分离定位失败与推理失败。代码和模型已开源。

## 正文

This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions. To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.
