# JetSpec：通过因果并行树草稿推测解码将LLM生成延迟推向极致

- 来源：Hao AI Lab (@haoailab)
- 发布时间：2026-06-26 03:18
- AIHOT 分数：52
- AIHOT 链接：https://aihot.virxact.com/items/cmqtw7cks07g5sl0elpckdwje
- 原文链接：https://x.com/haoailab/status/2070225035403694408

## AI 摘要

Sky Computing Lab推出JetSpec，一种通过因果并行树草稿（causal parallel tree drafting）联合优化草稿成本与质量的推测解码方法，可将LLM生成延迟推向极致。在MATH-500上达到最高9.64x端到端加速，开放式聊天达4.58x，且保持无损。结合CUDA graph和kernel优化，在单B200上实现约1000 TPS。

## 正文

Introducing JetSpec： we find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.

JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations， JetSpec further translates to around 1000 TPS on a single B200. ⚡️

Check out our project page for demos and a blog post on how we built it 👇
https://jetspec-project.github.io/jetspec-web/
https://haoailab.com/blogs/parallel-tree-decoding/
