# MotionVLA：面向人形运动的视觉-语言-动作模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-13 08:00
- AIHOT 分数：40
- AIHOT 链接：https://aihot.virxact.com/items/cmqhiqg2b04rzsle1em1zgzgl
- 原文链接：https://arxiv.org/abs/2606.15142

## AI 摘要

MotionVLA 基于 Qwen3.5，采用 DSFT 双流频率分词器将运动分解为 Base 流和 Physical 流，通过 DCT 截断和 BPE 独立压缩，并在统一序列中按 Base → Physical 顺序预测。在 HumanML3D 和 MBench 上，仅 2B 参数轻量级骨干即实现：HumanML3D 多样性差距降低超 50%，MBench 运动条件一致性提升 3.8%，证明频率感知双流解耦对自回归运动生成的有效性。

## 正文

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.
