DynaFLIP:通过三模态动态引导表示重新思考机器人感知
阅读原文· arxiv.orgDynaFLIP 是一个动力学感知的多模态预训练框架,旨在将运动理解能力前置到感知阶段。该框架利用异构的人类与机器人视频构建图像-语言-3D光流三元组作为训练监督信号,以单纯形体积最小化为核心思想,结合余弦正则化与对比学习目标,优化单一图像编码器在共享超球面空间中的对齐。分析表明,该模型能聚焦于对机械臂操作至关重要的控制相关区域。其生成的视觉表示可作为可复用骨干网络,在多种下游策略(包括视觉语言动作模型)中均优于基线。在分布外场景下,性能提升高达 +22.5%。
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.