VideoMDM： Towards 3D Human Motion Generation From 2D Supervision

2026-06-11 08:00·22天前

AI 摘要

VideoMDM是一个基于扩散的框架，从单目视频的精确2D姿态训练3D人体运动先验，无需3D真值。它利用预训练的2D-to-3D提升器提供近似3D序列作为噪声教师，经扩散和去噪后重投影到2D并与准确关键点对比以监督训练。理论证明深度加权的2D重投影损失在期望上等价于直接3D监督。在HumanML3D基准上，VideoMDM几乎缩小了与完全3D监督MDM的差距（FID 0.88 vs 0.54）；在真实视频数据集Fit3D和NBA上，生成的运动获得人类一致偏好。

原文 · 未翻译

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

HuggingFace Daily Papers（社区热门论文）

55导出 Markdown

VideoMDM： Towards 3D Human Motion Generation From 2D Supervision

2026-06-11 08:00·22天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译