# TVIR：面向文本-视觉交错报告生成的深度研究智能体构建

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpwxbmhb01ljsl799h14lmqx
- 原文链接：https://arxiv.org/abs/2606.02320

## AI 摘要

针对现有深度研究系统以文本为中心、视觉元素可靠性与对齐性评估不足的问题，本文提出了TVIR框架，包括TVIR-Bench基准测试和TVIR-Agent多智能体框架。TVIR-Bench包含100个要求视觉元素服务于特定分析目标的多模态任务。TVIR-Agent采用分层多智能体设计，负责构建大纲、检索图像、生成可溯源图表并进行上下文感知写作。研究进一步开发了结合文本与视觉评估的双路径评估框架。对九个系统的实验表明，TVIR-Agent表现优异，凸显了多模态设计对于证据驱动报告生成的重要性。

## 正文

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.