# World Pilot：用世界动作先验引导视觉-语言-动作模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmq8ws2lk06inslldx1n6uw83
- 原文链接：https://arxiv.org/abs/2606.12403

## AI 摘要

World Pilot 是一种视觉-语言-动作（VLA）框架，通过世界动作模型（WAM）提供场景演进隐变量与预期轨迹两种先验，分别经 Latent Steering 和 Action Steering 注入决策链。在 LIBERO-Plus 零样本 OOD 基准上，总成功率达 84.7%，并在四个真实机器人操作任务中取得最高成功率，在视角、几何、变形状态和位姿变化场景下优势最显著。

## 正文

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/