视频扩散模型在手部运动重建中的惊人有效性
阅读原文· arxiv.orgViDiHand 利用预训练视频扩散模型的表征重建 4D 双手姿态,通过手部叠加渲染目标适配扩散模型特征,保留世界先验的同时专门化手部特征,再由解码器恢复度量级姿态。整个管道直接处理全帧,无需检测器、填充器或测试时优化。在 ARCTIC、HOT3D 和 HOI4D 基准上,ViDiHand 显著优于现有方法,表明视频扩散模型可作为手部运动重建的新基础,并为具身智能的可扩展野外数据收集提供途径。
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.