# Nemotron-Labs-Diffusion-Image：掩蔽离散扩散模型的文本到图像合成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-29 08:00
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmr034hsc03faslkiblz0knek
- 原文链接：https://arxiv.org/abs/2606.29814

## AI 摘要

Nemotron-Labs-Diffusion-Image是一种用于高分辨率文本到图像合成的掩蔽离散扩散模型（MDM）。它引入token编辑机制，使推理时能动态修改已揭开的离散token，弥补标准MDM缺乏自纠正能力的缺陷；提出分组交叉熵（GCE）目标函数，为嵌入空间中邻近真实token的相邻token分配正学习信号，缓解大词汇量离散图像tokenizer带来的训练信号稀疏问题。同时实现针对GCE的自定义融合操作符，显著降低大词汇量场景下的显存占用。实验结果显示，该模型在GenEval上得分为0.90，DPG 86.9，HPSv3 10.76。

## 正文

We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.
