AI 摘要
JetSpec 是一种投机解码方法,通过因果并行树草稿联合优化草稿成本与质量,采用并行草稿树和树因果验证。在 MATH-500 上实现 9.64x 端到端加速,开放聊天场景达 4.58x 加速,且保持无损。结合 CUDA graph 与内核优化,单块 B200 可实现约 1000 TPS。SemiAnalysis 期待其与推理引擎 vLLM/SGLang 的深度集成。
Parallel draft tree, tree-causal verification Looking forward to its deeper integration with inference engines vLLM/SGLang! Great work @Lanxiang_Hu!
Introducing JetSpec: we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal par...