# Visual-Seeker：一种通过主动视觉推理实现的视觉原生多模态深度搜索智能体

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-13 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmqi29vj504vmslf01nxgnews
- 原文链接：https://arxiv.org/abs/2606.15231

## AI 摘要

Visual-Seeker 是一种视觉原生多模态深度搜索智能体，通过主动视觉推理而非将视觉视为静态输入，动态收集细粒度视觉证据以完成多跳跨模态搜索。研究人员设计了主动视觉推理数据流水线，合成了 5K 高质量多模态轨迹用于模型训练。在五个具有挑战性的多模态搜索基准上，Visual-Seeker 达到最先进性能，甚至超越部分闭源模型。代码和数据集已开源。

## 正文

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.
