# 从自我未来学习：面向dLLMs的在线策略自蒸馏

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmqhrcr2j01m1slf05emq2z0x
- 原文链接：https://arxiv.org/abs/2606.18195

## AI 摘要

d-OPSD是针对扩散大语言模型（dLLMs）提出的首个在线策略自蒸馏框架。其核心贡献包括：利用自生成答案作为后缀条件，使学生模型从自我未来经验学习；并将监督从token级转向step级，与dLLMs的迭代去噪过程对齐。在四个推理基准上，d-OPSD一致优于RLVR和SFT基线，且仅需RLVR约10%的优化步骤，展现出显著的样本效率。代码已开源。

## 正文

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.
