多块扩散语言模型

2026-06-30 08:00·2天前

AI 摘要

MBD-LMs通过多块教师强制（MultiTF）后训练块扩散语言模型（BD-LMs）得到。MultiTF结合教师强制与扩散强制，在干净前缀上训练有界噪声组，采用随机噪声调度器匹配多块扩散推理状态。基于Block Buffer的优化解码实现前缀缓存复用和输入形状静态化，将更高并行度转为实际加速。MBD-LLaDA2-Mini的TPF从3.47提升至6.19，准确率从79.95%提升至81.03%；结合DMax后TPF

原文 · 未翻译

Block Diffusion Language Models (BD-LMs) improve diffusion-based text generation with KV caching and flexible-length generation. A natural next step is to extend them from Single-Block Diffusion (SingleBD) to Multi-Block Diffusion (MultiBD), where a running-set of consecutive blocks is decoded concurrently for inter-block parallelism. However, existing BD-LMs are mostly trained under teacher forcing, where the model observes only one noisy block conditioned on a clean prefix. While the recent diffusion forcing strategy introduces visibility among multiple noisy blocks, its training states still differ from MultiBD inference, where decoding operates on a bounded running-set with heterogeneous slot-wise noise patterns. To bridge this gap, we propose Multi-Block Diffusion Language Models (MBD-LMs), obtained by post-training BD-LMs with Multi-block Teacher Forcing (MultiTF). MultiTF integrates teacher forcing and diffusion forcing by training on bounded noise-groups conditioned on clean prefixes, with randomized noise-schedulers that better match MultiBD inference states. To make MultiBD practically executable, we further introduce an optimized decoding algorithm based on the Block Buffer mechanism that preserves prefix-cache reuse, keeps input shapes static, and translates increased decoding parallelism into wall-clock acceleration. Empirically, MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%; when combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.

HuggingFace Daily Papers（社区热门论文）

40导出 Markdown

多块扩散语言模型

2026-06-30 08:00·2天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译