Deepseek发布DSpark推理框架，AI响应速度最高提升85%

2026-06-30 16:28·2天前·Matthias Bastian

AI 摘要

Deepseek推出DSpark推理框架，采用推测解码技术，由小模型生成候选答案、大模型批量验证，并一次生成多个token而非单个，使每用户响应速度提升60–85%。系统基于置信度动态调整验证深度，减少无效计算。DSpark与Deepseek-V4-Pro模型（与北京大学联合开发）已在HuggingFace和GitHub以MIT许可证开源。高效推理降低对高端芯片需求，有助于中国与欧盟在芯片受限下获取更多AI性能，短期构成战略优势。

原文 · 未翻译

Deepseek's DSpark boosts AI speed by up to 85 percent, a strategic win under tightening US export controls

Matthias Bastian View the LinkedIn Profile of Matthias Bastian

Jun 30, 2026

Nano Banana Pro prompted by THE DECODER

Deepseek has released DSpark, a new method that boosts per-user response speed for its AI models by 60 to 85 percent, according to the company.

Most LLMs generate text one word at a time. That leads to low GPU utilization and long wait times for lengthy responses, Deepseek says. Its new framework, DSpark, uses speculative decoding, where a small, lightweight model proposes answer candidates that the larger model then checks in batches. It also generates small word groups instead of single tokens, boosting overall efficiency. A confidence-based system adjusts verification depth on the fly depending on compute load, cutting wasted processing on rejected token proposals.

Scatter plots comparing throughput (tokens per second per GPU) and per-user generation speed (TPS) for DeepSeek-V4-Flash and DeepSeek-V4-Pro. Green DSpark data points show significant gains over the blue MTP baseline, with throughput improvements up to 661 percent and TPS gains up to 85 percent. — Throughput vs. per-user generation speed (TPS) for DeepSeek-V4-Flash and DeepSeek-V4-Pro under live traffic. DSpark (green) pushes the performance frontier for both throughput and interactivity well beyond the MTP baseline (blue). | Image: Deepseek

Deepseek also tested DSpark with open models from Google DeepMind (Gemma) and Alibaba (Qwen), suggesting the approach works broadly. The framework and Deepseek-V4-Pro model, developed jointly with Peking University, are available on Hugging Face and GitHub under the MIT license. Technical details are in the paper.

Table showing speculative decoding results across math, code, and chat benchmarks for Qwen3-4B, Qwen3-8B, Qwen3-14B, and Gemma4-12B. DSpark achieves the highest accepted token length per decoding round across all models and categories, outperforming Eagle3 and DFlash drafters. — The DSpark drafter achieves the highest text generation efficiency, beating alternatives like Eagle3 and DFlash across all test categories, including Qwen and Gemma models. | Image: Deepseek

Less chip pressure or faster scaling

This release matters strategically for China. Faster inference lowers chip requirements and cuts infrastructure costs. That's good news for China and potentially for the EU, both of which trail the US in data center buildout and high-performance chips.

But the Jevons paradox could kick in. More efficient inference does reduce chip demand per query. Yet the freed-up compute will likely get absorbed immediately by more AI requests, longer contexts, or new applications. Total chip demand could stay flat or even grow. Deepseek itself says that DSpark "enables performance tiers that were previously unattainable, shifting the Pareto frontier of our serving system."

Still, in the short term, these efficiency gains help China and the EU. They can squeeze more AI performance out of fewer high-end chips. Given tight chip supply and US export restrictions, that's a strategic advantage, reducing the US's ability to use chips as a geopolitical lever.

The Decoder：AI News（RSS）

67导出 Markdown