# GDSD：基于引导式去噪器自蒸馏的扩散语言模型强化学习

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmpv4x8oc03ohsl0zf94ulefa
- 原文链接：https://arxiv.org/abs/2605.29398

## AI 摘要

本文提出GDSD方法，以解决扩散大语言模型中强化学习因策略似然难处理而受限的问题。该方法从反向KL正则化强化学习的闭式最优解中导出一个优势引导的自教师模型，并直接对其去噪器进行蒸馏。GDSD通过无归一化目标匹配学生的对数几率，将强化学习转化为无似然的自蒸馏过程，从而避免了以往使用证据下界作为似然代理所导致的训练-推理不匹配偏差。在LLaDA-8B与Dream-7B模型的规划、数学及代码基准测试中，GDSD训练奖励更稳定，性能一致优于此前基于证据下界的方法，测试准确率提升最高达+19.6%。

## 正文

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.