Hao AI Lab@haoailab

2026-06-26 03:18·7天前

AI 摘要

Sky Computing Lab推出JetSpec，一种通过因果并行树草稿（causal parallel tree drafting）联合优化草稿成本与质量的推测解码方法，可将LLM生成延迟推向极致。在MATH-500上达到最高9.64x端到端加速，开放式聊天达4.58x，且保持无损。结合CUDA graph和kernel优化，在单B200上实现约1000 TPS。

Introducing JetSpec： we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.

JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations， JetSpec further translates to around 1000 TPS on a single B200. ⚡️

Check out our project page for demos and a blog post on how we built it 👇 https://jetspec-project.github.io/jetspec-web/ https://haoailab.com/blogs/parallel-tree-decoding/

推理论文/研究部署/工程

在 X 查看原推导出 Markdown

Hao AI Lab@haoailab · X

52导出 Markdown

2026-06-26 03:18·7天前

在 X 看原推· x.com

AI 摘要

Introducing JetSpec： we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.

Check out our project page for demos and a blog post on how we built it 👇 https://jetspec-project.github.io/jetspec-web/ https://haoailab.com/blogs/parallel-tree-decoding/