几何动作模型 (GAM) 用于机器人策略学习
阅读原文· arxiv.orgGAM(Geometric Action Model)是一种语言条件操作策略,通过直接利用预训练几何基础模型(GFM)作为共享基座,在中间层分割GFM,浅层作观察编码器,插入因果未来预测器预测未来潜在token,再经剩余GFM块解码。设计让GFM以最小架构改动获得语言条件时间世界建模能力,同时保留丰富几何先验。在模拟和真实机器人操作基准上,GAM比当前基础模型规模基线更准确、鲁棒、快速且轻量。
Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.