# minWM：一个用于实时交互式视频世界模型的全栈开源框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmpqd5x5k03y0slno8ld11blq
- 原文链接：https://arxiv.org/abs/2605.30263

## AI 摘要

minWM 是一个开源全栈框架，能将现有的双向视频扩散基础模型（如 Wan2.1-T2V-1.3B 和 HY1.5-TI2V-8B）转换为支持相机控制、低延迟推演的少步自回归世界模型。它提供了模块化的端到端流程，包含可控微调、Causal Forcing++ 流水线与蒸馏步骤，并可适配如 HY-WorldPlay 等现有模型。项目已开源相关脚本、权重及代码。

## 正文

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)