# OPDLM：基于On-Policy蒸馏的数据高效自回归到扩散语言模型转换

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmq5ic4gi088pslt2185qbapv
- 原文链接：https://arxiv.org/abs/2606.06712

## AI 摘要

现有方法将自回归模型（ARLM）转换为扩散语言模型（DLM）时面临两种分布偏移：目标函数切换导致知识丢失，以及训练时随机掩码序列与推理时置信度解码轨迹不匹配。研究者提出OPDLM，采用On-Policy蒸馏（OPD）进行转换。学生模型（双向注意力的ARLM）生成自身轨迹，教师模型（冻结的原ARLM）在这些轨迹上提供目标logits完成知识蒸馏。OPDLM以on-policy方式训练，消除了DLM的训练-推理不匹配，蒸馏机制保留了原ARLM知识。实验表明，OPDLM仅需原训练token量的1/15至1/7000，在多样任务上表现强劲，使DLM转换成为ARLM后训练手段。

## 正文

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.