MultiWorld:可扩展的多智能体多视角视频世界模型
阅读原文· arxiv.orgMultiWorld 是一个统一的多智能体多视角视频世界模型框架,突破了现有单智能体方法的局限。该框架引入多智能体条件模块实现精确控制,并通过全局状态编码器确保多视角一致性。系统支持智能体与视角数量的灵活扩展,可并行合成不同视角以提升效率。在多玩家游戏环境和多机器人操作任务中的实验表明,该模型在视频保真度、动作跟随能力和多视角一致性方面均优于基线方法。
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/