SemiAnalysis@SemiAnalysis_

2026-06-30 22:30·2天前

AI 摘要

JetSpec 是一种投机解码方法，通过因果并行树草稿联合优化草稿成本与质量，采用并行草稿树和树因果验证。在 MATH-500 上实现 9.64x 端到端加速，开放聊天场景达 4.58x 加速，且保持无损。结合 CUDA graph 与内核优化，单块 B200 可实现约 1000 TPS。SemiAnalysis 期待其与推理引擎 vLLM/SGLang 的深度集成。

Parallel draft tree， tree-causal verification Looking forward to its deeper integration with inference engines vLLM/SGLang！ Great work @Lanxiang_Hu！

Hao AI LabIntroducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal par...

推理论文/研究部署/工程

在 X 查看原推导出 Markdown

SemiAnalysis@SemiAnalysis_ · X

63导出 Markdown

2026-06-30 22:30·2天前

在 X 看原推· x.com

AI 摘要

Parallel draft tree， tree-causal verification Looking forward to its deeper integration with inference engines vLLM/SGLang！ Great work @Lanxiang_Hu！

Hao AI LabIntroducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal par...

推理论文/研究部署/工程