# MIMFlow：掩码图像建模与归一化流融合的端到端图像生成框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-24 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmr00z6hf02uxslkiohzw1lh9
- 原文链接：https://arxiv.org/abs/2606.26016

## AI 摘要

MIMFlow是一个统一端到端框架，联合优化潜语义、像素重建和生成流。它采用VAE编码器从掩码图像推断语义潜变量，使归一化流专注于建模简化的低频频谱流形，专用解码器处理高频合成，从而解决归一化流的容量瓶颈。在ImageNet 256×256上，MIMFlow-L达到71.3%线性探测准确率和FID 2.50。仅使用128 token（比标准模型少50%），性能较相似规模NF基线提升32.8%。代码已开源。

## 正文

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.
