# μ_0： 一种可扩展的3D交互轨迹世界模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmqemsqub03pgsluns22h0hl6
- 原文链接：https://arxiv.org/abs/2606.13769

## AI 摘要

μ_0是基于3D轨迹的可扩展世界模型，通过预测物体、工具、手及接触区域等关键交互点的平滑3D轨迹，形成紧凑且无关具身形态的运动接口。其配套的TraceExtract系统可从多样化视频源自动提取3D监督。μ_0结合预训练视觉-语言骨干与模块化轨迹专家，用B样条控制点表示查询并预测未来轨迹。实验显示，μ_0在2D和3D轨迹预测上优于基线模型。冻结后的μ_0可搭配下游机器人动作专家，无需动作标签预训练的策略性能与使用动作监督预训练的VLA模型相当。

## 正文

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present μ_0, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, μ_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains μ_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that μ_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because μ_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as π_0. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.