Speculative decoding is one of the cleverest inference optimizations in the modern LLM stack. A small draft model predicts tokens; a large target model verifies them in parallel. When predictions are right, you get 2-4x throughput. When wrong, you pay normal cost. The math works out to substantial speedups for most generation workloads. This post covers the mechanics, the deployment options in 2026, and when it earns its added complexity.
How it works
Traditional autoregressive decoding: large model predicts token, predicts next token, predicts next token. Sequential. Each prediction requires a full forward pass. Latency per token is the model's forward pass time.
Speculative decoding: small draft model predicts K tokens quickly (much cheaper per token). Large target model verifies all K tokens in a single forward pass. If all K predictions match what the large model would have generated, you got K tokens for the cost of one large-model forward pass.
When predictions disagree, you accept tokens up to the first mismatch and regenerate from there with the large model. Worst case: you wasted the draft model's work but didn't get incorrect output.
Speedup math
Expected speedup depends on draft model accuracy. Typical production: draft model matches target 60-80% of tokens. With K=5 draft tokens per batch, expected acceptance 3-4 tokens per large-model pass. That's 3-4x throughput vs 1 token per pass.
Draft model quality matters more than draft model size. Distilled models specifically trained to match target output work better than generic small models. Medusa heads (attached to target model) are another approach.
Cost: running the draft model adds inference cost, but draft models are typically 10-100x cheaper per token. Net cost impact is minimal.
Implementations in 2026
vLLM. Supports speculative decoding natively. Widely deployed. For self-hosted model serving, vLLM is the default.
TensorRT-LLM. NVIDIA's optimized serving stack. Speculative decoding with multiple draft model options. Lower-level than vLLM but often faster.
Commercial providers. Some inference APIs use speculative decoding under the hood. You benefit automatically; nothing to configure. Together AI, Fireworks, Anyscale are examples.
Frontier APIs (OpenAI, Anthropic, Google). Speculative decoding is likely used internally but isn't exposed as a user-facing option.
Technique variants
Medusa heads. Instead of a separate draft model, attach multiple parallel prediction heads to the target model. Each head predicts a future token. Simpler to train and deploy than separate draft models.
Lookahead decoding. Uses n-gram patterns from existing generation as speculation. No draft model needed. Works well for repetitive or structured outputs (code, JSON).
Tree-based speculation. Draft model generates multiple candidate continuations; target model evaluates the tree; best path accepted. Better acceptance rate than linear speculation at higher compute.
When speculative decoding earns its complexity
Self-hosted model serving with throughput or latency sensitivity. If you're running vLLM already, speculative decoding is a config change that often yields 2x throughput. Trivial upside.
High-volume API-backed systems. If you're paying for inference at scale and the provider supports speculation (or you can swap to one that does), the cost savings compound.
Not relevant if your inference cost or latency isn't a bottleneck. For most applications using OpenAI/Anthropic with moderate volume, you don't need to think about this — it's happening (or not) at the infrastructure level you don't control.
Quality preservation
Correctly implemented speculative decoding produces identical output distribution to non-speculative. It's a speed optimization, not a quality compromise. If your provider or serving stack supports it, enabling it should not regress quality.
Always verify with evals after any inference stack change. Implementation bugs exist; don't assume theoretical correctness means no regression. See eval post.
Related inference optimizations
Speculative decoding is one of several techniques (flash attention, continuous batching, paged attention) that modern serving stacks combine for maximum throughput. See GPU hosting post. Most teams benefit from all of these simultaneously once they self-host at scale.