The conventional wisdom in 2023 was: bigger is better. Grow parameters. Grow context. Grow compute. In 2026, smaller is quietly winning. Not because size stopped mattering, but because for most production workloads, a task-tuned 8B model delivers results indistinguishable from a frontier 70B model at a tenth of the inference cost. This post is the case for small, and when it actually holds.

Closing the gap

Task quality over time for the 70B frontier vs a task-tuned 8B. The gap has closed dramatically in 2024-2026. A tuned 8B in 2026 roughly matches the frontier 70B of 2024.

Why small is winning

Three forces are pushing small-model quality up faster than large-model quality. First, fine-tuning and distillation from stronger models produce small models that inherit frontier-level ability on specific tasks. Second, better training data — curated, high-quality, task-representative — gives small models a bigger quality lift than scale does. Third, inference optimizations (quantization, speculative decoding, custom kernels) make small models not just cheaper but categorically faster, opening UX patterns bigger models can't fill.

The net effect: on any specific, bounded production task, a well-tuned 8B model in 2026 often ties or beats a frontier model in blinded evaluation, while running 10x cheaper and 3x faster.

Where small wins clearly

Classification and extraction at scale

Categorizing support tickets, extracting fields from invoices, identifying intents in search queries. Fine-tune an 8B on a few thousand labeled examples; it matches the frontier model and runs at 1/10th the cost. At production volumes, this is $100K/year savings per use case — real money.

Retrieval-heavy RAG responses

When the answer is in the retrieved context and the model's job is to synthesize, not to know, smaller models do fine. The heavy lifting is in retrieval quality; the generation quality is secondary. RAG patterns favor smaller models because the quality bar on generation is modest.

Latency-sensitive interactive features

Autocomplete, real-time suggestions, voice response, streaming UX. The latency budget doesn't accommodate a 70B call. Smaller models running on optimized inference stacks hit the sub-200ms bars these use cases need.

Edge deployment

On-device, in-browser, on-gateway. The device has memory and compute constraints that rule out large models. Smaller models open deployment patterns that were previously impossible — entirely offline AI experiences, privacy-preserving by construction.

Where big still wins

Reasoning-heavy work. Multi-step math, hard coding, complex planning. Frontier models still outperform, often by significant margins. See reasoning models post.

Open-ended generation without strong grounding. Creative writing, complex analysis, synthesis across disparate contexts. Bigger models show more depth and fewer slip-ups.

Complex tool use and agent orchestration. The model needs to reason about which tools to use in what order, handle tool failures, recover from errors. Frontier models have more of this ability baked in; small models often need elaborate scaffolding.

Anything requiring broad world knowledge. Small models have less of it. If your task depends on knowing obscure facts, big models retrieve from their parametric memory better.

The production pattern we deploy

Multi-model routing: a classifier (itself small) routes queries to the right model tier. Simple queries to an 8B. Medium to a 32B or 72B. Hard reasoning to a frontier model. Result: 70-85% of traffic lands on cheaper tiers; overall cost drops 40-60% without noticeable quality regression. See multi-model routing.

Distillation when it's worth the investment. Use a frontier model to generate high-quality training data; use that data to fine-tune a small model for a specific task. The resulting small model inherits most of the frontier model's task-specific quality at a fraction of the serving cost. See synthetic data post.

The cultural shift this requires

Engineering organizations habituated to 'use the best model' have to shift to 'use the right-sized model for each workload.' This involves eval infrastructure (to know when small is enough), routing (to direct traffic), and an operational culture that measures cost-per-task alongside quality. The teams that have made this shift earliest are systematically ahead on AI economics.

Small models are back — and that changes the economics

Why small is winning

Where small wins clearly

Classification and extraction at scale

Retrieval-heavy RAG responses

Latency-sensitive interactive features

Edge deployment

Where big still wins

The production pattern we deploy

The cultural shift this requires

Continue the thread.

Multi-model routing: cutting LLM costs 40-60% with zero quality loss

Total cost of ownership for LLM systems

Open-source models in production: what actually holds up

Want to talk about this?