# AnyMo：基于掩码建模的任意模态条件运动生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmpupuvkw007usl3tmktzz9r1
- 原文链接：https://arxiv.org/abs/2605.29488

## AI 摘要

提出OmniHuMo大规模高质量数据集，包含超过5000小时运动数据与320万序列，提供文本、语音、音乐和轨迹等多模态精准标注。基于此构建AnyMo统一多模态框架，结合Residual FSQ运动分词器与可扩展的掩码建模Transformer，支持任意模态组合下的高保真实时运动生成，并能灵活控制运动的空间与风格属性。

## 正文

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.