The pricing page on every LLM vendor's site shows a per-token cost. It's clean, it's simple, it's what every early conversation about AI budgets anchors on. It is also, in practice, responsible for maybe 30% of what you'll actually spend on a production LLM system. The other 70% is what nobody mentions at the start and what blindsides finance teams six months into deployment.
This post is the complete TCO framework we use when sizing engagements and when helping clients model their first year of LLM spend. Numbers come from our last two years of production deployments across 15+ clients, ranging from 10K calls/day startups to 50M calls/day enterprise systems. All figures are in USD.
The seven cost categories of LLM systems
Real LLM cost is not one line item. It is seven, and they scale differently as your system grows. Ignoring any one of them will understate your bill by 20-80%.
1. Model API costs (30-40% of total)
The obvious one. Tokens in + tokens out times the per-million-token price. Worth breaking into input-heavy (RAG systems send big context windows and receive short responses — input dominates) vs output-heavy (creative generation, code generation — output dominates). The ratio matters because input tokens are usually 3-5x cheaper than output tokens on major vendors.
For sizing: a typical SaaS copilot processes 800 input tokens and generates 300 output tokens per call, at roughly $0.003/call on a mid-tier model. At 10,000 calls/day, that's $900/month. At 1M calls/day, $90K/month. Model choice matters but not as much as call volume — a 4x cheaper model saves you less than halving your call volume through caching.
2. Embedding and retrieval costs (5-15%)
Every RAG system runs embedding calls for every query, often with reranking on top. Vector database hosting adds $200-$5,000/month depending on scale. These costs are usually forgotten in early planning and then show up as a surprise. The core trap: embeddings at ingestion time are cheap, but embedding queries at runtime scale with call volume and quickly match or exceed LLM costs for RAG-heavy systems. We've written a fuller analysis in the RAG patterns post.
3. Infrastructure (10-20%)
Servers, databases, queues, caches, monitoring. A production LLM system typically needs: app servers ($500-$5K/month depending on load), a Postgres or equivalent ($200-$2K), Redis for caching ($100-$500), a vector DB ($300-$3K), and observability ($200-$2K). For most production deployments, infrastructure runs $2K-$15K/month depending on scale and HA requirements. Latency-sensitive deployments double this for multi-region.
4. Evaluation and monitoring (5-10%)
Eval runs are themselves LLM calls. A mature eval setup with 300 items and three LLM-as-judge scorers per item, running in CI on every PR and nightly on production, adds $200-$2K/month in LLM calls just for evals. Add observability tooling ($200-$1K), and total eval/monitoring runs 5-10% of the bill. This is a cost worth paying — it prevents the much larger cost of silent regression — but most planning omits it.
5. Engineering time (30-50% in year one, dropping to 15-25%)
The biggest and least-discussed cost. A production LLM system requires ongoing engineering: prompt iteration, model migrations, eval maintenance, handling new edge cases, security patches, integration updates. Budget one senior engineer at 30-50% capacity during year one, dropping to 10-20% capacity in steady state. At $200K fully-loaded, that's $60-100K of year-one engineering cost before any new features. Teams consistently underbudget this and then wonder why their roadmap never moves.
6. Data and compliance (5-15%)
Getting data into the system (ingestion pipelines, data cleaning), securing it (encryption at rest/in transit, access controls), and keeping it compliant (PII redaction, audit logs, DPA negotiations with vendors, SOC 2 evidence for AI usage). Often another $2K-$20K/month for regulated industries. Our clients in fintech and healthcare spend most on this line.
7. Human-in-the-loop operations (0-30%)
For systems with human review — fraud queues, content moderation, voice escalations — the humans are often the dominant cost. A single ops reviewer at $40/hour handling 60 calls/hour is cheaper than GPT-4-tier models for the same review quality. Most teams don't model this, then discover that AI didn't replace ops cost — it redistributed it, and sometimes increased it.
For every $1 of API cost, budget $2-$3 of total system cost in year one, dropping to $1.50-$2 in year two and beyond. Teams that budget only API cost end up 3-5x over projection by month six.
How costs scale (and why this matters)
Not all seven categories scale linearly with usage. Understanding the shape of each curve is what separates good TCO modeling from bad.
- Model API costs scale linearly with calls. Double the calls, double the API bill.
- Embedding costs scale with calls but with caching can plateau.
- Infrastructure scales sublinearly — you pay for capacity, not per-call. Doubling load from 10K to 20K calls/day rarely doubles infrastructure cost.
- Eval and monitoring are largely fixed costs — they scale with dataset size, not production traffic.
- Engineering time scales with feature velocity, not usage. A system with 1M calls/day and a system with 10K calls/day both need the same engineer to maintain if they have the same feature surface.
- Data/compliance scales with data volume and regulatory scope.
- HITL ops scales linearly with cases requiring review.
Implication: at low scale, engineering dominates. At high scale, API and HITL dominate. The optimization playbook changes accordingly.
The cost reduction playbook
Once you can see the seven categories separately, reducing cost becomes tractable. Here's the playbook we apply in order — cheapest levers first.
- Semantic caching. Cache LLM responses by embedding similarity. Typical reduction: 30-50% on API cost for repetitive workflows. Detail in our semantic caching post.
- Multi-model routing. Route easy queries to cheaper models. Typical reduction: 40-60% on API cost without quality loss for mixed workloads. Detail in multi-model routing.
- Prompt compression. Remove redundant context before sending to the model. Typical reduction: 15-25% on input tokens.
- Batching. Group similar requests into single calls where latency allows. 10-30% reduction for background jobs.
- Cheaper embeddings. Move from OpenAI embeddings to open-source options like BGE or Nomic. 5-10x cost reduction at comparable quality for many domains.
- Context pruning. Retrieve less, and better. Every retrieved token is a token you pay for. Better retrieval (see hybrid search) cuts context size without hurting quality.
- Negotiated rates. At >$10K/month on any major vendor, call their sales team. Enterprise rates discount 20-40% off list.
The ROI framing that actually works with CFOs
Most AI ROI pitches fail because they frame benefits in soft terms ('faster,' 'better,' 'more intelligent') and costs in hard terms ('$X/month'). Hard beats soft every time in a finance meeting. The framing that works: pick one specific process the AI system replaces or augments. Measure baseline cost per unit (time spent, dollars paid, customers churned). Measure new cost per unit with AI. Compute delta, multiply by volume, subtract the full TCO (all seven categories). That number is the ROI.
Our cost calculator walks through this model with your numbers. We built it specifically because the vendor pricing pages don't model anything close to true TCO.
Common budget mistakes we see
- Budgeting only for API costs. As covered — this is 30-40% of the real bill.
- Assuming engineering time drops to zero after launch. It drops, but to 15-25% of a FTE, not zero.
- Not budgeting for model migrations. Every 12-18 months, a better/cheaper model comes along. Migration takes 2-6 weeks of engineering. Plan for it.
- Missing the HITL line when the process has human review. This is a common multiplier miss.
- Ignoring infrastructure inflation. At scale, a single additional region can double your infra bill.
- Skipping eval costs. "We'll add evals later" often becomes "we never added evals" and the silent regressions eat the savings many times over.
The vendor's per-token price is the tip. Real cost lives in the seven layers below. Budget the iceberg, not the tip.
Closing
Realistic TCO modeling is the difference between AI programs that continue and AI programs that get quietly shelved after the first finance review. The framework above is how we size every Eazyware engagement and how we help clients forecast year-two and year-three spend. Spend an hour with your actual numbers and it will change your planning. If you want a second pair of eyes on your model, we're available for a 30-minute call.