UDM-GRPO:面向均匀离散扩散模型的稳定高效群体相对策略优化
阅读原文· arxiv.org本文提出UDM-GRPO框架,首次实现均匀离散扩散模型与强化学习的稳定结合。针对训练不稳定问题,该方法将最终干净样本作为动作,并通过扩散前向过程重建轨迹以对齐预训练分布。此外,引入Reduced-Step和CFG-Free策略提升效率。实验表明,GenEval准确率从69%提升至96%,PickScore从20.46提升至23.81,OCR基准准确率从8%跃升至57%,在文本到图像任务中达到SOTA性能。
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.