表征强制:实现无瓶颈统一多模态模型
阅读原文· arxiv.org现有统一多模态模型(UMMs)仍依赖冻结的、单独预训练的VAE进行图像生成,这造成了结构瓶颈。本文提出表征强制(RF)技术,通过强制解码器在生成像素前,先自回归预测作为中间token的视觉表征,并使其留在上下文中引导同一骨干网络内的像素扩散。此举将表征从感知输出转变为生成目标,从而消除了对外部生成潜空间的需求。实验证明,RF能同时增强模型的理解与生成能力:其像素空间模型在图像生成上匹配了基于VAE的最先进模型,在图像理解上则通常优于对应的VAE变体。
Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.