# UniT：面向人到人形机器人策略学习和世界建模的统一物理语言

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-21 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmoccay5702q8slsjshwuccvp
- 原文链接：https://arxiv.org/abs/2604.19734

## AI 摘要

UniT（Unified Latent Action Tokenizer via Visual Anchoring）框架通过视觉锚定建立跨具身的统一物理语言，基于异构运动学共享通用视觉后果的核心理念，采用三分支交叉重建机制生成与具身无关的共享离散潜在空间。VLA-UniT在策略学习中利用人类数据实现SOTA数据效率与OOD泛化，达成零样本任务迁移；WM-UniT在世界建模中实现直接的人到人形机器人动作迁移。t-SNE可视化证实人类与人形机器人特征收敛至共享流形。

## 正文

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.
