SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention
We are excited to announce that SGLang supports DeepSeek-V3.2 on Day 0! According to the DeepSeek tech report, it equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. For more details about upcoming features, please check our Roadmap.
Installation and QuickStart
To get started, simply pull the container and launch SGLang as follows:
SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention
We are excited to announce that SGLang supports DeepSeek-V3.2 on Day 0! According to the DeepSeek tech report, it equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. For more details about upcoming features, please check our Roadmap.
At the heart of DeepSeek-V3.2 is DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that redefines long-context efficiency.
Instead of performing quadratic full attention over all tokens, DSA introduces:
Lightning Indexer (ultra-light FP8 scorer) to identify the most relevant tokens for each query.
Top-k Token Selection to focus computation only on the most impactful key-value entries.
This design reduces the complexity of core attention from O(L^2) to O(Lk), delivering dramatic improvements in both training and inference efficiency at up to 128K context length, with negligible loss of model quality.
To support this breakthrough, SGLang implements and integrates:
Lightning Indexer Support – with a dedicated key&key_scale cache in the memory pool for ultra-fast token scoring.
key&key_scale
Native Sparse Attention (NSA) Backend – a new backend purpose-built for sparse workloads, featuring: FlashMLA (DeepSeek’s optimized multi-query attention kernel) FlashAttention-3 Sparse (adapted for compatibility and maximum kernel reuse)
Together, these innovations enable DeepSeek-V3.2-Exp to deliver GPU-optimized sparse attention and dynamic cache management, cutting memory overhead while scaling seamlessly to 128K contexts.
The result is a runtime that preserves state-of-the-art reasoning quality, while dramatically lowering inference costs—making long-context LLM deployment not only possible, but also practical at scale.
Future Work
Future work will be tracked here. More specifically, we plan to:
Multi-token prediction (MTP) support coming soon: The MTP will speed up decoding, especially when the batch size is not large.
FP8 KV Cache: Compared to traditional BF16 KV cache, this will almost double the number of tokens in KV cache as well as halving the memory access pressure of attention kernels, making it possible to serve longer or more requests faster.
TileLang support: TileLang kernels are useful for flexible development.
Acknowledgments
We sincerely thank the DeepSeek team for their outstanding contributions to open model research, which have greatly benefited the open-source community, as well as for their highly efficient kernels that are now integrated into the SGLang inference engine.
From the SGLang community, we thank Tom Chen, Ziyi Xu, Liangsheng Yin, Biao He, Baizhou Zhang, Henry Xiao, Hubert Lu, Wun-guo Huang, Zhengda Qin and Fan Yin for their contributions to DeepSeek-V3.2-Exp support.
We also thank NVIDIA, AMD, and Nebius Cloud for sponsoring the GPU machines used in the development of this work.
At the heart of DeepSeek-V3.2 is DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that redefines long-context efficiency.
Instead of performing quadratic full attention over all tokens, DSA introduces:
Lightning Indexer (ultra-light FP8 scorer) to identify the most relevant tokens for each query.
Top-k Token Selection to focus computation only on the most impactful key-value entries.
This design reduces the complexity of core attention from O(L^2) to O(Lk), delivering dramatic improvements in both training and inference efficiency at up to 128K context length, with negligible loss of model quality.
To support this breakthrough, SGLang implements and integrates:
Lightning Indexer Support – with a dedicated key&key_scale cache in the memory pool for ultra-fast token scoring.
key&key_scale
Native Sparse Attention (NSA) Backend – a new backend purpose-built for sparse workloads, featuring: FlashMLA (DeepSeek’s optimized multi-query attention kernel) FlashAttention-3 Sparse (adapted for compatibility and maximum kernel reuse)
Together, these innovations enable DeepSeek-V3.2-Exp to deliver GPU-optimized sparse attention and dynamic cache management, cutting memory overhead while scaling seamlessly to 128K contexts.
The result is a runtime that preserves state-of-the-art reasoning quality, while dramatically lowering inference costs—making long-context LLM deployment not only possible, but also practical at scale.
Future Work
Future work will be tracked here. More specifically, we plan to:
Multi-token prediction (MTP) support coming soon: The MTP will speed up decoding, especially when the batch size is not large.
FP8 KV Cache: Compared to traditional BF16 KV cache, this will almost double the number of tokens in KV cache as well as halving the memory access pressure of attention kernels, making it possible to serve longer or more requests faster.
TileLang support: TileLang kernels are useful for flexible development.
Acknowledgments
We sincerely thank the DeepSeek team for their outstanding contributions to open model research, which have greatly benefited the open-source community, as well as for their highly efficient kernels that are now integrated into the SGLang inference engine.
From the SGLang community, we thank Tom Chen, Ziyi Xu, Liangsheng Yin, Biao He, Baizhou Zhang, Henry Xiao, Hubert Lu, Wun-guo Huang, Zhengda Qin and Fan Yin for their contributions to DeepSeek-V3.2-Exp support.
We also thank NVIDIA, AMD, and Nebius Cloud for sponsoring the GPU machines used in the development of this work.