Speculative decoding (SD) techniques have proliferated recently. SD accelerates autoregressive generation by letting a lightweight draft model propose future tokens, while the target model verifies them in parallel.
Among recent efforts, DSpark and JetSpec emerged almost concurrently around the same bottleneck: once drafting becomes cheap, how do we preserve enough causal consistency for parallel proposals to survive verification?
This naturally raises the question: which one is better? Or, more interestingly, are they actually complementary?
The fact that both works converge in this direction suggests that causality is becoming a central lever for next-generation speculative decoding. They approach it from complementary sides of the throughput-latency frontier. DSpark targets high-concurrency serving: on Qwen3-8B and AIME25, DSpark improves accepted length from 4.07 (DFlash) to 5.01 at budget 7 with causal recurrent state for confidence-scheduled verification. JetSpec targets the latency-oriented, compute-budget-rich regime: by building causality directly into the parallel draft head, it turns larger draft budgets into longer accepted prefixes, on the same settings, scaling accepted length from 7.23 at budget 16 to 9.82 at budget 128, up from DFlash's 7.34 (DDTree's 8.66) at budget 128, for low latency generation.
Causality in DSpark and JetSpec
Traditional drafters like the EAGLE series often preserve draft quality through autoregressive generation, but this makes longer drafts require more sequential draft steps. DFlash changes the cost structure: by using a lightweight block-parallel drafter to predict many future positions in one pass, it opens the door to making draft cost cheap.
Speculative decoding (SD) techniques have proliferated recently. SD accelerates autoregressive generation by letting a lightweight draft model propose future tokens, while the target model verifies them in parallel.
Among recent efforts, DSpark and JetSpec emerged almost concurrently around the same bottleneck: once drafting becomes cheap, how do we preserve enough causal consistency for parallel proposals to survive verification?
But cheap drafting is not enough. Once the draft cost drops, the bottleneck shifts to whether parallel proposals can survive verification. When future positions are weakly conditioned on earlier draft tokens, they may appear plausible in isolation but become inconsistent as a sequence. Here is where causality becomes important.
DSpark keeps the parallel drafting backbone cheap, while adding a lightweight sequential head and confidence estimation to better decide which proposals should be sent for verification, thereby controlling the per-request compute budget. As a result, DSpark consistently improves throughput over MTP-style pure autoregressive drafting, where longer drafts require more sequential draft steps (Figure 1).
On the other hand, under a latency-oriented Service Level Objective (SLO) with low concurrency, the system is more FLOPs-rich, so the goal shifts toward maximizing accepted rate per verification step. In this regime, we can afford to spend more on draft compute to raise the acceptance rate and maintain high acceptance at deeper positions. This is where causal parallel drafting, as in JetSpec, becomes especially important: the draft budget is used for generating path-conditioned tree, making it more likely to produce long accepted prefixes.
How Causality Helps
Once drafting becomes cheap, the next question is how to spend limited compute intensity: should we squeeze more throughput under high concurrency, or push lower latency when more FLOPs are available per request? This is where causality becomes the key lever.
Pushing the Throughput Limit: DSpark for Budget-Aware Correction
DSpark targets the high-concurrency, budget-constrained regime. It uses a lightweight Markov-style correction head and confidence head (or an RNN-head variant that carry recurrent prefix state across positions). For each draft position i, the parallel drafter first produces base logits z_i^0, and a corresponding draft hidden state h_i. the confidence head estimates prefix-dependent confidence scores c_i:
where the Markov head B then injects a small causal correction from the previous draft token to generate . The verification budget is then scheduled by keeping only the longest confident prefix under budget B and threshold rho:
This makes it suitable for budget-aware serving: the draft backbone stays parallel, while the correction path improves local or prefix-dependent consistency.
Pushing the Latency Limit: JetSpec Turns Draft Budget into Higher Acceptance
With low concurrency, modern AI accelerators come with more spare FLOPs, so the key question becomes: how to translate higher compute budget into more accepted tokens per draft-verification step? This is where JetSpec takes a different path. JetSpec uses a causal parallel draft head to produce a path-conditioned draft tree, where deeper nodes are conditioned on earlier tokens along the same branch.
The effect shows up clearly in the depth-wise acceptance profile (Figure 4). JetSpec consistently maintains higher acceptance than DFlash on both coding and math reasoning workloads.
On AIME25, JetSpec starts with a near-perfect per-position acceptance rate of (q_1 at around 99%) at draft depth 1 and still maintains roughly (q_8 at 50%) acceptance at depth 8. Here q_i denotes the survival probability that at least the first i draft tokens are accepted. The empirical acceptance length is
Under the constant per-token acceptance rate assumption used in the original speculative decoding analysis,
We define alpha_eff by fitting the theoretical and empirical acceptance lengths:
This corresponds to an estimated effective per-token acceptance rate of about 93%, substantially higher than DFlash. In this low-cost, high-acceptance regime, even a 5% gain in per-token acceptance can have an outsized impact on speculative decoding: it significantly increases the maximum theoretical acceptance length (Figure 4), which in turn directly reduces generation latency.
Up Next: Enabling Both Throughput- and Latency-Oriented Parallel Drafting
A foreseeable next step is to build a dynamic serving framework that can push both ends of the throughput-latency Pareto frontier: low-concurrency settings that demand higher per-user TPS, and high-concurrency settings that require higher aggregate throughput under tight verification budgets.
In this direction, JetSpec and DSpark are naturally complementary: JetSpec strengthens the parallel drafting backbone for low-latency budget scaling, while DSpark adds lightweight sequential confidence checking and budget control for high-concurrency serving.
This naturally raises the question: which one is better? Or, more interestingly, are they actually complementary?
The fact that both works converge in this direction suggests that causality is becoming a central lever for next-generation speculative decoding. They approach it from complementary sides of the throughput-latency frontier. DSpark targets high-concurrency serving: on Qwen3-8B and AIME25, DSpark improves accepted length from 4.07 (DFlash) to 5.01 at budget 7 with causal recurrent state for confidence-scheduled verification. JetSpec targets the latency-oriented, compute-budget-rich regime: by building causality directly into the parallel draft head, it turns larger draft budgets into longer accepted prefixes, on the same settings, scaling accepted length from 7.23 at budget 16 to 9.82 at budget 128, up from DFlash's 7.34 (DDTree's 8.66) at budget 128, for low latency generation.
Causality in DSpark and JetSpec
Traditional drafters like the EAGLE series often preserve draft quality through autoregressive generation, but this makes longer drafts require more sequential draft steps. DFlash changes the cost structure: by using a lightweight block-parallel drafter to predict many future positions in one pass, it opens the door to making draft cost cheap.
But cheap drafting is not enough. Once the draft cost drops, the bottleneck shifts to whether parallel proposals can survive verification. When future positions are weakly conditioned on earlier draft tokens, they may appear plausible in isolation but become inconsistent as a sequence. Here is where causality becomes important.
DSpark keeps the parallel drafting backbone cheap, while adding a lightweight sequential head and confidence estimation to better decide which proposals should be sent for verification, thereby controlling the per-request compute budget. As a result, DSpark consistently improves throughput over MTP-style pure autoregressive drafting, where longer drafts require more sequential draft steps (Figure 1).
On the other hand, under a latency-oriented Service Level Objective (SLO) with low concurrency, the system is more FLOPs-rich, so the goal shifts toward maximizing accepted rate per verification step. In this regime, we can afford to spend more on draft compute to raise the acceptance rate and maintain high acceptance at deeper positions. This is where causal parallel drafting, as in JetSpec, becomes especially important: the draft budget is used for generating path-conditioned tree, making it more likely to produce long accepted prefixes.
How Causality Helps
Once drafting becomes cheap, the next question is how to spend limited compute intensity: should we squeeze more throughput under high concurrency, or push lower latency when more FLOPs are available per request? This is where causality becomes the key lever.
Pushing the Throughput Limit: DSpark for Budget-Aware Correction
DSpark targets the high-concurrency, budget-constrained regime. It uses a lightweight Markov-style correction head and confidence head (or an RNN-head variant that carry recurrent prefix state across positions). For each draft position i, the parallel drafter first produces base logits z_i^0, and a corresponding draft hidden state h_i. the confidence head estimates prefix-dependent confidence scores c_i:
where the Markov head B then injects a small causal correction from the previous draft token to generate . The verification budget is then scheduled by keeping only the longest confident prefix under budget B and threshold rho:
This makes it suitable for budget-aware serving: the draft backbone stays parallel, while the correction path improves local or prefix-dependent consistency.
Pushing the Latency Limit: JetSpec Turns Draft Budget into Higher Acceptance
With low concurrency, modern AI accelerators come with more spare FLOPs, so the key question becomes: how to translate higher compute budget into more accepted tokens per draft-verification step? This is where JetSpec takes a different path. JetSpec uses a causal parallel draft head to produce a path-conditioned draft tree, where deeper nodes are conditioned on earlier tokens along the same branch.
The effect shows up clearly in the depth-wise acceptance profile (Figure 4). JetSpec consistently maintains higher acceptance than DFlash on both coding and math reasoning workloads.
On AIME25, JetSpec starts with a near-perfect per-position acceptance rate of (q_1 at around 99%) at draft depth 1 and still maintains roughly (q_8 at 50%) acceptance at depth 8. Here q_i denotes the survival probability that at least the first i draft tokens are accepted. The empirical acceptance length is
Under the constant per-token acceptance rate assumption used in the original speculative decoding analysis,
We define alpha_eff by fitting the theoretical and empirical acceptance lengths:
This corresponds to an estimated effective per-token acceptance rate of about 93%, substantially higher than DFlash. In this low-cost, high-acceptance regime, even a 5% gain in per-token acceptance can have an outsized impact on speculative decoding: it significantly increases the maximum theoretical acceptance length (Figure 4), which in turn directly reduces generation latency.
Up Next: Enabling Both Throughput- and Latency-Oriented Parallel Drafting
A foreseeable next step is to build a dynamic serving framework that can push both ends of the throughput-latency Pareto frontier: low-concurrency settings that demand higher per-user TPS, and high-concurrency settings that require higher aggregate throughput under tight verification budgets.
In this direction, JetSpec and DSpark are naturally complementary: JetSpec strengthens the parallel drafting backbone for low-latency budget scaling, while DSpark adds lightweight sequential confidence checking and budget control for high-concurrency serving.