# ProMSA：渐进式多模态搜索智能体用于知识型视觉问答

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-26 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmqynkmvl0056slo1v7pkvnzo
- 原文链接：https://arxiv.org/abs/2606.27974

## AI 摘要

ProMSA是一种渐进式多模态搜索智能体，用于知识型视觉问答（KB-VQA）。给定图像-问题对，智能体在明确的工具调用预算和去重机制下，迭代选择图像搜索、文本搜索或停止。训练先通过拒绝采样SFT学习有效工具使用格式，再使用TN-GSPO序列级RL目标优化，该目标按生成长度和工具交互深度归一化更新。在E-VQA和InfoSeek基准上，ProMSA持续优于强RAG和智能体基线，提升了检索和端到端准确率。代码已开源。

## 正文

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.