AI Infrastructure

Speculative Decoding, and How DeepSeek DSpark Made Inference Up to 5x Faster

DeepSeek open-sourced DSpark, and the internet rounded it up to “400x faster,” which is off by a factor. The real number is still excellent: 50 to 400 percent faster, no retraining, and output that is mathematically identical to the slow version. Here is the trick that makes that possible, explained without the hand-waving.

Jithin Kumar PalepuJune 27, 202612 min read

On June 27, 2026, DeepSeek open-sourced DSpark, a speculative decoding framework that makes its production V4 model generate text 60 to 85 percent faster, with no retraining, no weight changes, no new hardware, and output identical to the slow path. The number bouncing around social media, “400x faster,” is wrong. The honest figure is 50 to 400 percent, which is up to roughly five times the throughput. Less dramatic, still a very big deal.

Worth getting the units right, because the gap between “400x” and “400%” is the gap between magic and engineering. 400x would mean a response that took 40 seconds now takes a tenth of a second. That is not what happened, and no speculative method does that. 400% means the same response comes back up to about five times sooner under ideal conditions. That is what happened, and it is exactly the kind of win that quietly changes what agents and chat apps feel like to use.

“400x” is a typo for “400%.” The truth is up to ~5x faster, with byte-for-byte identical output. You do not have to exaggerate this to be impressed by it.

Why is generating text slow at all?

A large language model generates one token at a time, and each token depends on every token before it. To write the tenth word it has to have already written the ninth, so the work is stubbornly sequential. The frustrating part is that this is not a compute problem. A modern GPU could do far more math than a single token requires. The bottleneck is memory: for every single token, the model has to stream its entire multi-hundred-billion-parameter weight set through the chip. You are paying the full memory cost of a giant model to produce one little token, then doing it again, and again.

That asymmetry is the opening. Verifying a batch of tokens at once costs almost the same memory trip as generating a single one, because the expensive part, hauling the weights through, happens once either way. So if you could somehow guess the next several tokens cheaply and then check them all in one pass, you would get many tokens for the price of one memory trip. That is the entire idea behind speculative decoding.

How speculative decoding works

You run two models. A small, fast draft model sprints ahead and proposes the next handful of tokens. Then the big, accurate target model checks all of those proposals in a single parallel pass. Every token the draft got right is accepted for free. The first token it got wrong is corrected by the target, and the draft starts sprinting again from there. You keep the big model's judgment but skip most of its sequential plodding.

draft model:   the cat sat on the [mat]      ← proposes 6 tokens, fast
target model:  the cat sat on the ____       ← verifies all 6 in ONE pass
               ✓   ✓   ✓  ✓  ✓   ✗(rug)      ← keeps 5, fixes the 6th
result:        5 tokens accepted for ~1 big-model step

The part people miss, and the reason this is safe to turn on in production, is that it is lossless. The acceptance test is built so the final output distribution is mathematically identical to what the target model would have produced on its own. A wrong guess gets rejected, never smuggled into the answer. You are not trading quality for speed. You are getting the same answer, sooner. The only thing that varies is how much sooner, which depends on how often the little draft model guesses right.

So what did DSpark actually add?

Speculative decoding is not new. The reason DSpark is news is that it pushes acceptance length higher than the previous best methods, and it does so with a clever three-part draft design built with researchers from Peking University. Here is the stack without the jargon.

DFlash: a parallel backbone that guesses all positions at once

Instead of a tiny draft model that itself decodes one token at a time, DFlash produces hidden states for many positions in parallel. That makes the drafting step fast and gives it high accuracy on the very next token. The weakness is that the further out it guesses, the shakier the guess gets.

Why it matters — Great at the first guess, but its later guesses drift, a problem the team calls suffix decay.

A Markov head: cheap correlation between neighbors

On top of DFlash sits an extremely lightweight sequential head that injects the dependency between adjacent tokens, the thing a fully parallel guess loses. It uses a low-rank factorization (rank 256) so it stays nearly free even across a huge vocabulary. This is what props up the later, drift-prone guesses and lifts the overall acceptance length.

Why it matters — The fix for suffix decay, bought with a table lookup and a single matrix multiply.

A confidence head + load-aware scheduler: guess harder when the GPU is bored

A confidence head scores each drafted token by how likely it is to survive verification. A scheduler then watches GPU utilization and verifies more speculative tokens when the hardware is idle and fewer when it is slammed. Speculative decoding costs a little extra compute to save a lot of memory trips; doing it adaptively means you spend that compute only when you actually have it to spare.

Why it matters — Speculation is adaptive, not fixed. It speeds up under light load and backs off under heavy load.

Earlier methods asked “how many tokens should we guess?” once and lived with the answer. DSpark re-asks it every step, based on how busy the GPU is right now.

The honest numbers

Put together, DSpark reports meaningfully longer acceptance than the prior state of the art. On Qwen3-series targets (4B, 8B, 14B), average accepted length improves 26.7% to 30.9% over EAGLE3, the previous strong baseline, and 16.3% to 18.4% over DFlash alone. In production on DeepSeek-V4, that translates to:

Setting	Speedup vs MTP-1	Output quality
V4-Flash (per-user generation)	+60% to +85%	Identical
V4-Pro (per-user generation)	+57% to +78%	Identical
Reported range (throughput, ideal cases)	+50% to +400%	Identical

DeepSeek-reported figures, June 2026. MTP-1 is the single-token multi-token-prediction baseline.

The reason the speedup is a range and not one number is acceptance length again. On predictable, boilerplate-heavy text, the draft guesses right constantly and you ride the high end. On dense, surprising, high-entropy text, more guesses get rejected and you fall toward the low end. The “400%” ceiling is a best case under ideal conditions and light load, not a number you should expect on every prompt. Anyone quoting a single figure is quoting the best slice.

You can use it: the DeepSpec toolkit

DSpark is the method; DeepSpec is the open-source, MIT-licensed codebase for training and evaluating your own speculative decoding drafters. The crucial detail is that it is not DeepSeek-only. The team trained and tested drafters against Qwen3 (4B, 8B, 14B) and tested Gemma4-12B offline, so the approach generalizes to other target models. Ready-made checkpoints, DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark, reuse the existing V4 weights with a draft module bolted on, which is exactly why no retraining of the big model is required.

The asterisk: no independent verification yet

Be a good skeptic. As of late June 2026, every number above comes from DeepSeek's own measurements. There is no third-party reproduction of the acceptance-length or production-speed claims published yet. The math behind speculative decoding being lossless is well established and not in doubt, so the “identical output” guarantee is solid. The exact magnitude of the speedup on your workload is the part to verify yourself, which, helpfully, the MIT license lets you do.

The bigger picture is where this fits. Model quality keeps climbing, but the next frontier is quietly the cost and speed of running these things, and that is where margins and user experience actually live. A cheaper, faster token changes the economics of everything built on top, which is the same force behind Anthropic pricing Claude Sonnet 5 so aggressively and behind the whole move toward leaner agent harnesses. When tokens get cheap and fast enough, you stop rationing them, and that is when agents start doing things that were previously too slow or too expensive to bother with.