eazyware
Engineering·January 5, 2026·11 min read

Multi-model routing: cutting LLM costs 40-60% with zero quality loss

Route by task, not by vendor. A deep dive into how we classify queries and route them to the cheapest capable model — with real cost data from production.

KR
Kushal R.
Engineering lead

One of the cleanest cost optimizations in production LLM systems is multi-model routing. The idea: don't send every query to the most expensive model. Send easy queries to cheap models, hard queries to expensive ones, and route intelligently in the middle. Done well, this cuts LLM bills 40-60% with zero perceivable quality loss. Done badly, it cuts quality in ways users notice and ops teams spend weeks unwinding.

We've implemented multi-model routing for a dozen clients in the last 18 months. What follows is the playbook that works, plus the three mistakes we most commonly see when teams try it without guidance.

Routing flow
Multi-model routing decision flow Incoming query Rule check task type, length, tier simple uncertain needs best Cheap model Haiku / mini ~2% of best price Classifier small LLM decides 85–95% routing accuracy Expensive model Opus / GPT-4-tier fallback for escalation typical savings at equivalent quality: 40–60% of API cost
Three-layer router: rules for obvious cases, classifier for ambiguous ones, expensive model as escalation fallback. Delivers 40-60% savings at quality parity.

Why routing works

LLM pricing has a 10-50x spread between the cheapest and most expensive capable models. A GPT-4o-mini call costs roughly 2% of a GPT-4o call. Claude Haiku 4.5 is 4% the price of Opus 4.6. For a large fraction of real queries — classification, extraction, simple rephrasing, structured parsing — the cheap model produces output indistinguishable from the expensive model. You are paying 25-50x for capability you don't use.

The catch: some queries genuinely need the strongest model. Complex reasoning, nuanced judgment, long-form generation with tight constraints — these degrade noticeably on smaller models. The routing challenge is correctly identifying which queries are which, cheaply and reliably, without human intervention.

Three routing strategies

Strategy 1: Rule-based routing

The simplest approach. Define rules based on task type, input length, user tier, or feature flag. Example rules: 'classification tasks go to Haiku,' 'outputs under 50 tokens go to mini,' 'enterprise users get Opus, free tier gets Sonnet.' This captures 60-70% of routing value with essentially zero added latency or complexity.

When to use: MVP systems, systems with clean task-type signals, systems where the classification is already explicit in the call site. Start here before getting fancier.

Strategy 2: Classifier-based routing

A cheap LLM (or a small classifier model) inspects the query and routes it. Input: the query and relevant context. Output: 'easy' or 'hard' (or multi-class). Easy goes to cheap model; hard goes to expensive. The classifier itself is a ~$0.0001 call. Good classifiers route correctly 85-95% of the time on well-defined task domains.

When to use: when queries vary significantly in difficulty and rules can't capture the pattern. This is our default for copilot and chat systems. Requires eval infrastructure to measure routing accuracy and spot drift.

Strategy 3: Escalation routing

Always try the cheap model first. Evaluate the response (confidence score, output length, structured validation). If the response is weak, retry with the expensive model. This is 'optimistic' routing — the cheap path handles what it can, and the expensive model is a fallback.

When to use: tasks with cheap verification (the output is easy to validate) and high variance in difficulty. Works beautifully for extraction tasks, code generation with test feedback, and multi-turn chat where follow-up clarifications reveal earlier miscues.

Which strategy for your workload

Most systems end up combining two or three strategies. A typical production stack: rule-based routing as the first filter (catches obvious cases), classifier-based for the rest, escalation as a safety net for critical flows. We've shipped this exact three-layer stack for voice AI systems and it reliably hits 50-60% cost reduction against a naive 'always use the best model' baseline.

Implementation in practice

The key architectural move: put routing behind a single abstraction. Your application code calls a 'model gateway' with the query and task hints; the gateway handles routing. This lets you change routing logic without touching application code. It also centralizes observability — every decision is logged to one place.

Libraries that help: LiteLLM for basic gateway functionality, LangChain's model abstractions if you're in that ecosystem, or build your own thin wrapper (most of our clients build their own — it's 200-400 lines). The build-your-own approach wins on control and minimal dependencies.

The observability requirement

You cannot route what you cannot measure. Log every routing decision: which model was chosen, why, and what the outcome quality was. Without this, routing is flying blind and silently degrading. See our <a href="/blog/observability-stack">LLM observability stack post</a> for tooling.

Measuring routing quality

Routing quality has two axes: cost reduction (easy to measure — just compare bills) and quality preservation (much harder). For quality, the gold standard is periodic A/B testing where a random sample of queries goes to the baseline 'always expensive' path, and the user-visible quality is compared against the routed path. If user ratings or downstream behavior diverge, routing is costing quality.

Cheaper proxies: run your eval dataset through both paths and compare pass rates. If pass rates match within 2-3 points, routing is safe. If they diverge by more than 5 points, the routing is too aggressive.

Common mistakes

  1. Routing without evals. You need to measure quality before and after. Otherwise you're cutting cost blind, and discovering damage six weeks later from user complaints.
  2. Aggressive routing at launch. Start conservative — route 30% of queries to cheaper models in week one, expand based on measured quality. Teams that route 80% on day one always regress.
  3. Ignoring latency. Cheap models are often faster. Routing can improve both cost and latency — measure both.
  4. No escalation path. Even a well-tuned classifier misclassifies sometimes. Without escalation, those misclassifications show up as user-facing quality drops.
  5. Forgetting that prompts are model-specific. A prompt optimized for GPT-4 may not produce the same quality from a routed cheap model. Tune prompts per model.

Real numbers from production

Three representative deployments we've tracked over 12+ months:

  • SaaS copilot, 2M calls/month: 58% cost reduction via classifier routing. No measurable quality change on evals. User satisfaction unchanged.
  • Document extraction pipeline, 500K calls/month: 71% cost reduction via escalation routing. Pass rate on evals improved 3 points (the cheap model caught easy cases faster, with fewer timeouts).
  • Chatbot for an e-commerce site, 10M calls/month: 44% cost reduction via rule-based + classifier hybrid. Marginal quality drop on complex multi-turn conversations — tuned down the aggressiveness and landed at 38% reduction with quality parity.

Across our client base, the median routing system saves 40-50% of LLM cost at quality parity. The high end hits 60-70% for workloads with strong cheap/expensive task separation. The low end is still usually 25%+.

Routing enables migration

A side benefit of routing: when a new model comes out (and one does every few months), you can A/B test it as a routing option without rewriting application code. Route 10% of easy queries to the new model, measure, ramp. This makes model migration a routine ops task instead of a multi-week engineering effort. For TCO impact of this see our cost modeling post.

The cheapest capable model is the correct default. The most expensive model is the fallback. Every routing decision is a bet, and the bets compound.

When not to route

Some systems shouldn't route: single-task systems where the task only uses one model well (no cost upside), extremely low-volume systems (engineering cost of routing exceeds savings), and systems with tight latency budgets where even a small routing-decision call matters. For anything with meaningful volume and query variance, routing pays off.

Closing

Multi-model routing is one of the few AI cost interventions with a clean cost-quality frontier — you can systematically dial up cost savings with modest quality impact. Combined with semantic caching and prompt optimization, 80%+ of naive LLM cost is usually recoverable. For teams sizing their first year of production LLM spend, assume routing will save half of your API cost by month six. If it doesn't, something is wrong.

Read next
Total cost of ownership for LLM systems
Read next
Semantic caching cut our biggest client's LLM bill 43%
Read next
LLM observability without vendor lock-in
Tags
LLM routingcost optimizationmulti-modelorchestration
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request