残差上下文扩散语言模型（Residual Context Diffusion Language Models）

2026-07-02 08:00·23小时前

AI 摘要

扩散大语言模型（dLLM）可并行解码多个token，但现有分块式dLLM依赖重新掩码机制，仅保留最置信token而丢弃其余，造成计算浪费。本文提出Residual Context Diffusion（RCD）模块，将丢弃token的表示转化为上下文残差并注入下一次去噪步骤，采用解耦两阶段训练绕过内存瓶颈。在长CoT推理（SDAR）和短CoT指令跟随（LLaDA）模型上验证，标准dLLM仅需约10亿token即可高效转换为RCD。RCD在多项基准上以极小额外计算将前沿dLLM精度提升5–10个点，在最具挑战的AIME任务上几乎翻倍基线准确率，等效精度下减少4–5倍去噪步数。

原文 · 未翻译

research area Speech and Natural Language Processingconference ICML

content type paperpublished July 2026

Residual Context Diffusion Language Models

AuthorsYuezhou Hu†*, Harman Singh†*, Monishwaran Maheswaran†*, Haocheng Xi†, Coleman Hooper†, Jintao Zhang†, Aditya Tomar†, Michael W. Mahoney†, Sewon Min†, Mehrdad Farajtabar, Kurt Keutzer†, Amir Gholami†‡, Chenfeng Xu†‡

View publication

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a “remasking” mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ∼1 billion tokens. RCD consistently improves frontier dLLMs by 5–10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4–5x fewer denoising steps at equivalent accuracy levels.

† University of California, Berkeley
* Equal contribution
‡ Equal advising

Related readings and updates.

Learning Unmasking Policies for Diffusion Language Models

July 2, 2026research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICML

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the sampling procedure that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token…

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

January 21, 2026research area Speech and Natural Language Processingconference ICLR

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding,…