MMCORE:基于表征对齐潜在嵌入的多模态连接
阅读原文· arxiv.orgMMCORE 是一个统一的多模态图像生成与编辑框架,通过预训练视觉语言模型(VLM)预测语义视觉嵌入,并将其作为条件信号引导扩散模型。该设计无需深度融合自回归与扩散模型或从头训练,显著降低计算成本的同时保持高保真合成。框架支持文本到图像生成与交错图像生成,在空间推理和视觉定位等复杂场景中展现出强大的多模态理解能力,在多项文本到图像及单/多图像编辑基准测试中均优于现有最先进基线。
We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.