# PAIWorld：面向机器人操作的三维一致世界基础模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqiwbe8g04vssl5w9xrscs6h
- 原文链接：https://arxiv.org/abs/2606.18375

## AI 摘要

PAIWorld 是一种基于扩散 Transformer 的世界基础模型，通过三大组件解决多视图三维不一致问题：几何感知跨视图注意力模块建立显式视图间通信，几何旋转位置编码将相机光线方向和外部位姿编码进注意力机制，潜在三维 REPA 从冻结的三维基础模型中蒸馏三维感知特征。它在机器人操作基准上达到最优多视图三维一致性，WorldArena 排行榜第一，AgiBot-Challenge2026 排行榜第二，并支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

## 正文

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.
