# MotiMotion：基于视觉推理的运动控制视频生成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpn76n430v1fsl01sy9z2kfg
- 原文链接：https://arxiv.org/abs/2605.22818

## AI 摘要

该研究指出当前运动控制视频生成模型存在轨迹僵硬、因果不完整的问题。为此，MotiMotion框架将运动控制重新定义为“先推理再生成”的任务。其核心是利用一个无需训练的视觉语言推理器来完善主轨迹坐标，并“幻想”出合理的次要运动。同时，框架引入置信度感知控制方案，根据计划的可信度调整引导强度。为系统评估，研究还构建了新的运动交互基准MotiBench。评估表明，MotiMotion能生成物体行为和交互更合理可信的视频，效果优于现有方法。

## 正文

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.