# 流水线推测解码

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmpvz0l3702w5slukt06fmqba
- 原文链接：https://arxiv.org/abs/2605.30852

## AI 摘要

提出一种名为SPD的推测解码框架，通过将目标大语言模型（LLM）划分为n个流水线阶段来并行处理n个token，以加速解码。SPD利用一个推测模块跨流水线深度聚合中间特征以预测下一个token，并与目标模型的流水线步骤严格并行执行，从而实现有限难度、更高接受率和零延迟气泡。实验显示，SPD的理论加速比显著高于主流基线。

## 正文

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding
