DeepSeek 提出 DSpark,一种半并行推测解码系统,使 DeepSeek-V4 在相同吞吐量下每用户生成速度提升约 60% 至 85%。核心创新在于选择性验证:草稿模型并行生成多个候选 token,再由一个小型马尔可夫头根据前一个 token 微调每个猜测,弥补纯并行推测后段 token 组合质量下降的缺陷。置信度调度器基于接受概率和 GPU 负载,动态决定每个请求需验证的 token 数量,避免无效计算。
Fantastic, @deepseek_ai just published their new inference optimization method.
Proposes DSpark, a semi-parallel speculative decoding system that gave DeepSeek-V4 about 60% to 85% faster per-user generation at matched throughput.
The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking.
Speculative decoding already had the basic trick: a smaller draft model guesses several next tokens, then the real model checks them in 1 pass.
The problem is that long draft blocks often waste work, because later guesses are more likely to be wrong, and checking bad guesses still uses GPU capacity.
DSpark's breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off.
The mechanism has 2 linked parts: a strong parallel draft model makes many token guesses quickly, then a tiny Markov head adjusts each guess using the token right before it.
That small sequential piece matters because pure parallel drafting are fast, but their later tokens decay because each position guesses without knowing what the earlier sampled token actually was.
i.e. Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block.
Then the confidence scheduler estimates how many drafted tokens should be checked for each request, based on both acceptance chance and current GPU load.