# 面向可验证的多模态深度研究：一个用于交错式报告生成的多智能体框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpqjlfr105k1slnourh2sgma
- 原文链接：https://arxiv.org/abs/2605.29861

## AI 摘要

大语言模型已将智能体从深度搜索推进至能生成长篇报告的深度研究。然而，可验证的多模态深度研究仍面临挑战。为此，研究提出了Ptah，一个多智能体框架。它通过规划、研究和写作阶段，协调从用户查询到网页报告的生成全流程，其中智能体负责构建计划、收集证据并维护视觉记忆。一个验证智能体确保整个流程的事实依据和跨模态一致性。研究还引入了PtahEval评估协议。实验表明，Ptah能生成比基线更可靠、视觉信息更丰富、更实用的多模态报告。

## 正文

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.