# Domino：从自回归草拟中解耦因果建模的推测解码框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmpwqvwfq05cvslsnk7f1cw1x
- 原文链接：https://arxiv.org/abs/2605.29707

## AI 摘要

Domino是一种用于加速大语言模型推理的推测解码框架，它将因果依赖建模与高开销的自回归草拟过程解耦。该框架首先使用并行草拟骨干网络为整个块生成初步的草拟分布，随后应用一个轻量级的Domino头，利用前缀相关的因果信息对初步分布进行精细化修正。为稳定训练过程，论文提出了基础锚定训练课程，先强化并行骨干，再逐步优化因果修正后的最终分布。在Qwen3模型上的实验表明，Domino在Transformers后端下实现了高达5.49倍的端到端加速，在SGLang服务下实现了高达5.8倍的吞吐量加速。

## 正文

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.
