# InterleaveThinker：强化智能体交错生成管线

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmqac9xug0k4sslld8lugkikt
- 原文链接：https://arxiv.org/abs/2606.13679

## AI 摘要

InterleaveThinker 提出多智能体管线，通过规划智能体组织图像-文本输入序列、批评智能体评估生成结果并修正指令，使任意现有图像生成器具备交错生成能力。构建 Interleave-Planner-SFT-80k 和 Interleave-Critic-SFT-112k 数据集进行冷启动，并利用 GRPO 在 Interleave-Critic-RL-13k 上强化批评智能体的逐步指令修正。提出 accuracy reward 和 step-wise reward，使单步强化学习有效引导整个生成轨迹。在交错生成基准上性能与 Nano Banana 和 GPT-5 相当；在 4-step FLUX.2-klein 推理基准上，WISE 和 RISE 指标显著提升。

## 正文

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
