eazyware
Engineering·June 3, 2024·10 min read

Speculative decoding: 2-3x faster inference, same model quality

Draft model proposes tokens; main model verifies. 2-3x speedup on decoding with identical output distribution. Deployment patterns and pitfalls.

KR
Kushal R.
Engineering lead

Speculative decoding is one of the cleverest inference optimizations in the modern LLM stack. A small draft model predicts tokens; a large target model verifies them in parallel. When predictions are right, you get 2-4x throughput. When wrong, you pay normal cost. The math works out to substantial speedups for most generation workloads. This post covers the mechanics, the deployment options in 2026, and when it earns its added complexity.

Speculative mechanics
Speculative decoding — draft + verify 1. Draft model small, fast (1-2B params) proposes N tokens ahead cheap predictions 2. Main model big, slow (70B+) verifies all N in parallel accepts or rejects 3. Output accepted tokens resume draft from there same output dist. Why it works and what to know · Main model verifies N tokens in one forward pass — cheaper than N sequential · Acceptance rate depends on draft/main alignment: 50-80% typical · Best speedups when draft model trained on main model outputs (self-distillation) · vLLM, TensorRT-LLM, TGI all support it; 2-3x throughput gain common
Small draft model generates K tokens quickly. Large target model verifies all K in one parallel forward pass. Accepted tokens retained; rejected trigger backoff.

How it works

Traditional autoregressive decoding: large model predicts token, predicts next token, predicts next token. Sequential. Each prediction requires a full forward pass. Latency per token is the model's forward pass time.

Speculative decoding: small draft model predicts K tokens quickly (much cheaper per token). Large target model verifies all K tokens in a single forward pass. If all K predictions match what the large model would have generated, you got K tokens for the cost of one large-model forward pass.

When predictions disagree, you accept tokens up to the first mismatch and regenerate from there with the large model. Worst case: you wasted the draft model's work but didn't get incorrect output.

Speedup math

Expected speedup depends on draft model accuracy. Typical production: draft model matches target 60-80% of tokens. With K=5 draft tokens per batch, expected acceptance 3-4 tokens per large-model pass. That's 3-4x throughput vs 1 token per pass.

Draft model quality matters more than draft model size. Distilled models specifically trained to match target output work better than generic small models. Medusa heads (attached to target model) are another approach.

Cost: running the draft model adds inference cost, but draft models are typically 10-100x cheaper per token. Net cost impact is minimal.

Implementations in 2026

vLLM. Supports speculative decoding natively. Widely deployed. For self-hosted model serving, vLLM is the default.

TensorRT-LLM. NVIDIA's optimized serving stack. Speculative decoding with multiple draft model options. Lower-level than vLLM but often faster.

Commercial providers. Some inference APIs use speculative decoding under the hood. You benefit automatically; nothing to configure. Together AI, Fireworks, Anyscale are examples.

Frontier APIs (OpenAI, Anthropic, Google). Speculative decoding is likely used internally but isn't exposed as a user-facing option.

Technique variants

Medusa heads. Instead of a separate draft model, attach multiple parallel prediction heads to the target model. Each head predicts a future token. Simpler to train and deploy than separate draft models.

Lookahead decoding. Uses n-gram patterns from existing generation as speculation. No draft model needed. Works well for repetitive or structured outputs (code, JSON).

Tree-based speculation. Draft model generates multiple candidate continuations; target model evaluates the tree; best path accepted. Better acceptance rate than linear speculation at higher compute.

When speculative decoding earns its complexity

Self-hosted model serving with throughput or latency sensitivity. If you're running vLLM already, speculative decoding is a config change that often yields 2x throughput. Trivial upside.

High-volume API-backed systems. If you're paying for inference at scale and the provider supports speculation (or you can swap to one that does), the cost savings compound.

Not relevant if your inference cost or latency isn't a bottleneck. For most applications using OpenAI/Anthropic with moderate volume, you don't need to think about this — it's happening (or not) at the infrastructure level you don't control.

Quality preservation

Correctly implemented speculative decoding produces identical output distribution to non-speculative. It's a speed optimization, not a quality compromise. If your provider or serving stack supports it, enabling it should not regress quality.

Always verify with evals after any inference stack change. Implementation bugs exist; don't assume theoretical correctness means no regression. See eval post.

Speculative decoding is one of several techniques (flash attention, continuous batching, paged attention) that modern serving stacks combine for maximum throughput. See GPU hosting post. Most teams benefit from all of these simultaneously once they self-host at scale.

Read next
Self-hosting vs managed: GPU decisions in 2026
Read next
Model distillation: making small models think like big ones
Read next
Latency budgeting for LLM systems
Tags
speculative decodinginference optimizationlatency
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request