eazyware
Opinion·June 9, 2025·8 min read

Small models are back — and that changes the economics

Sub-10B models now do tasks that needed 70B a year ago. Routing to the smallest-capable-model is the new default.

KR
Kushal R.
Engineering lead

The conventional wisdom in 2023 was: bigger is better. Grow parameters. Grow context. Grow compute. In 2026, smaller is quietly winning. Not because size stopped mattering, but because for most production workloads, a task-tuned 8B model delivers results indistinguishable from a frontier 70B model at a tenth of the inference cost. This post is the case for small, and when it actually holds.

Closing the gap
Small model comeback — quality per billion parameters task quality → time → 70B frontier 8B tuned '23 '24 '25 '26 gap closing fast task-tuned 8B in 2026 ≈ frontier 70B in 2024 · at 1/10 the inference cost
Task quality over time for the 70B frontier vs a task-tuned 8B. The gap has closed dramatically in 2024-2026. A tuned 8B in 2026 roughly matches the frontier 70B of 2024.

Why small is winning

Three forces are pushing small-model quality up faster than large-model quality. First, fine-tuning and distillation from stronger models produce small models that inherit frontier-level ability on specific tasks. Second, better training data — curated, high-quality, task-representative — gives small models a bigger quality lift than scale does. Third, inference optimizations (quantization, speculative decoding, custom kernels) make small models not just cheaper but categorically faster, opening UX patterns bigger models can't fill.

The net effect: on any specific, bounded production task, a well-tuned 8B model in 2026 often ties or beats a frontier model in blinded evaluation, while running 10x cheaper and 3x faster.

Where small wins clearly

Classification and extraction at scale

Categorizing support tickets, extracting fields from invoices, identifying intents in search queries. Fine-tune an 8B on a few thousand labeled examples; it matches the frontier model and runs at 1/10th the cost. At production volumes, this is $100K/year savings per use case — real money.

Retrieval-heavy RAG responses

When the answer is in the retrieved context and the model's job is to synthesize, not to know, smaller models do fine. The heavy lifting is in retrieval quality; the generation quality is secondary. RAG patterns favor smaller models because the quality bar on generation is modest.

Latency-sensitive interactive features

Autocomplete, real-time suggestions, voice response, streaming UX. The latency budget doesn't accommodate a 70B call. Smaller models running on optimized inference stacks hit the sub-200ms bars these use cases need.

Edge deployment

On-device, in-browser, on-gateway. The device has memory and compute constraints that rule out large models. Smaller models open deployment patterns that were previously impossible — entirely offline AI experiences, privacy-preserving by construction.

Where big still wins

Reasoning-heavy work. Multi-step math, hard coding, complex planning. Frontier models still outperform, often by significant margins. See reasoning models post.

Open-ended generation without strong grounding. Creative writing, complex analysis, synthesis across disparate contexts. Bigger models show more depth and fewer slip-ups.

Complex tool use and agent orchestration. The model needs to reason about which tools to use in what order, handle tool failures, recover from errors. Frontier models have more of this ability baked in; small models often need elaborate scaffolding.

Anything requiring broad world knowledge. Small models have less of it. If your task depends on knowing obscure facts, big models retrieve from their parametric memory better.

The production pattern we deploy

Multi-model routing: a classifier (itself small) routes queries to the right model tier. Simple queries to an 8B. Medium to a 32B or 72B. Hard reasoning to a frontier model. Result: 70-85% of traffic lands on cheaper tiers; overall cost drops 40-60% without noticeable quality regression. See multi-model routing.

Distillation when it's worth the investment. Use a frontier model to generate high-quality training data; use that data to fine-tune a small model for a specific task. The resulting small model inherits most of the frontier model's task-specific quality at a fraction of the serving cost. See synthetic data post.

The cultural shift this requires

Engineering organizations habituated to 'use the best model' have to shift to 'use the right-sized model for each workload.' This involves eval infrastructure (to know when small is enough), routing (to direct traffic), and an operational culture that measures cost-per-task alongside quality. The teams that have made this shift earliest are systematically ahead on AI economics.

Read next
Open-source models in production: what actually holds up
Read next
Multi-model routing: cutting LLM costs 40-60% with zero quality loss
Read next
When to fine-tune (and when RAG is fine)
Tags
small modelsefficiencycostrouting
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request