DynaFLIP：通过三模态动态引导表示重新思考机器人感知

2026-05-28 08:00·36天前

AI 摘要

DynaFLIP 是一个动力学感知的多模态预训练框架，旨在将运动理解能力前置到感知阶段。该框架利用异构的人类与机器人视频构建图像-语言-3D光流三元组作为训练监督信号，以单纯形体积最小化为核心思想，结合余弦正则化与对比学习目标，优化单一图像编码器在共享超球面空间中的对齐。分析表明，该模型能聚焦于对机械臂操作至关重要的控制相关区域。其生成的视觉表示可作为可复用骨干网络，在多种下游策略（包括视觉语言动作模型）中均优于基线。在分布外场景下，性能提升高达 +22.5%。

原文 · 未翻译

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

HuggingFace Daily Papers（社区热门论文）

64导出 Markdown

DynaFLIP：通过三模态动态引导表示重新思考机器人感知

2026-05-28 08:00·36天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译