# VideoMDM： Towards 3D Human Motion Generation From 2D Supervision

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmqac9xug0k4vslldigo5pn6m
- 原文链接：https://arxiv.org/abs/2606.13364

## AI 摘要

VideoMDM是一个基于扩散的框架，从单目视频的精确2D姿态训练3D人体运动先验，无需3D真值。它利用预训练的2D-to-3D提升器提供近似3D序列作为噪声教师，经扩散和去噪后重投影到2D并与准确关键点对比以监督训练。理论证明深度加权的2D重投影损失在期望上等价于直接3D监督。在HumanML3D基准上，VideoMDM几乎缩小了与完全3D监督MDM的差距（FID 0.88 vs 0.54）；在真实视频数据集Fit3D和NBA上，生成的运动获得人类一致偏好。

## 正文

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.