Fine-tuning feels powerful. The narrative is seductive: take a base model, feed it your data, and you get a custom model that knows your domain inside and out. In practice, fine-tuning is the wrong tool for 80% of the problems teams apply it to. Most 'we need fine-tuning' moments are actually 'we need better RAG' or 'we need better prompts' or 'we need an eval framework.'

This post is the framework we use to decide when fine-tuning is actually the right call — plus the three specific scenarios where it earns its cost. The goal is to save you from the 3-month detour we've watched teams take when they picked fine-tuning for the wrong reasons.

Decision tree

Most "we need fine-tuning" answers are actually "we need better RAG or prompts." Only proceed to fine-tune after exhausting those and only for format, style, or specialized fluency.

What fine-tuning actually does

Fine-tuning adjusts model weights based on training examples. It teaches the model new behaviors, response styles, or formats. It does not — contrary to popular belief — teach the model new facts. Facts change; facts belong in RAG. Behaviors, styles, and formats are harder to prompt consistently; those belong in fine-tuning (sometimes).

A concrete example: if you want the model to always respond in a specific JSON format with 12 specific fields, you could prompt-engineer this repeatedly and hope for the best, or you could fine-tune on 500 examples and get it reliably. If you want the model to know this month's sales numbers, fine-tuning won't help — the data goes stale immediately. Use RAG.

The three cases where fine-tuning wins

Case 1: Consistent output format or style

You need outputs that reliably conform to a specific structure — a JSON schema, a DSL, a specific writing voice. Prompt engineering gets you 85% there; the remaining 15% are long-tail cases where prompts fail. Fine-tuning on 300-1000 examples pushes the reliability to 98%+.

Real example from a recent client engagement: they needed the model to output valid GraphQL queries for a custom schema. Prompting produced valid queries 70% of the time; the other 30% had subtle syntax issues. Fine-tuning got this to 97% valid, which was the difference between production-ready and not.

Case 2: Specialized domain vocabulary

Your domain uses terminology that foundation models don't know well — medical specialties, legal clauses, proprietary product codes, industry jargon. RAG helps the model see the terms in context, but doesn't teach it how to use them fluently. For heavy usage of specialized vocabulary, fine-tuning genuinely adds value over RAG alone.

Caveat: only when you have high volume of domain-specific examples. 100 examples won't move the needle. 5,000 will. This is one of the 'data as moat' cases mentioned in our build vs buy post.

Case 3: Latency or cost constraints

You can fine-tune a smaller model to perform comparably to a larger one on a narrow task. The small model is faster and cheaper at inference. If you have genuinely tight latency requirements (voice AI with sub-500ms budgets) or cost requirements (billions of calls), fine-tuning a smaller model is sometimes the only path to the target.

Reality check: this case is rarer than people think. Most 'we need low latency' teams can solve it with multi-model routing and caching, and most 'we need low cost' teams can solve it with better routing. Fine-tuning for cost only wins after those optimizations.

When fine-tuning fails

"We need the model to know our company facts." Wrong tool. Use RAG.
"We want the model to be smarter on our domain." Usually wrong tool. Better prompting and better RAG cover 80% of this. Fine-tuning helps only after those are optimized.
"The model gives bad answers sometimes." Without evals, you can't tell if fine-tuning would help. Build eval infrastructure first; you'll probably find the issue is retrieval or prompting, not model capability.
"We want to reduce hallucinations." Fine-tuning can actually increase hallucinations if done badly. The root cause is usually missing or bad context; RAG fixes this.

The cost of fine-tuning

People underestimate this. Fine-tuning costs include: building the training dataset (usually 300-3000 examples, curated — this is 2-8 weeks of work), running the training (GPU cost, $500-$5000 depending on model and data), evaluating the fine-tuned model (needs a proper eval suite, 1-2 weeks), and ongoing maintenance (every model update from the provider invalidates your fine-tune; you re-run every 3-6 months).

Total: a fine-tuning project is typically 6-12 weeks of real work and $5K-$50K in direct costs, plus ongoing 15-25% of an engineer's time for maintenance. This is not 'click the fine-tune button.' Budget it accordingly.

LoRA and parameter-efficient fine-tuning

PEFT methods (LoRA, QLoRA) let you fine-tune without updating all model weights. Results are comparable to full fine-tuning in most cases at 10-20x lower compute cost. If you're going to fine-tune, almost always start with LoRA rather than full fine-tuning. Major providers (OpenAI, Anthropic) abstract this away — their 'fine-tuning' offering is LoRA-based under the hood.

The right order: optimize before fine-tuning

Before considering fine-tuning, exhaust the cheaper wins:

Better prompts. Systematic prompt engineering with good evals often closes 50% of the gap for free.
Better RAG. The six RAG patterns stack meaningfully and are cheaper to improve than model weights.
Better retrieval. Reranking and hybrid search are often the biggest wins in RAG systems.
Better model. Sometimes the issue is the base model — upgrading to a stronger tier closes the gap.

Then and only then, consider fine-tuning. 80% of teams never need to get past step 4.

Don't fine-tune without evals

Fine-tuning without an evaluation framework is committing to a change you can't measure. You'll spend weeks on it, ship a model that's probably similar but might be worse, and have no way to know. Every serious fine-tuning project starts with evals. Without them, you're guessing.

The 80% rule

Before fine-tuning, ask: what percentage of the gap between current performance and desired performance is explained by prompting and retrieval, and what percentage is genuinely model-capability? If you can't answer specifically, fine-tuning is premature. Better answer first, then fine-tune if the remaining gap is genuinely model-shaped.

Tools we use

For most clients: OpenAI's fine-tuning API or Anthropic's (when available) are fine for basic tuning. Together.ai and Fireworks have broader model coverage. For serious work including open-source models, we deploy on Modal or direct on cloud GPUs with Axolotl or Transformers + PEFT. The choice follows the use case — if latency and cost matter, open-source models with LoRA give more control.

Fine-tuning is the last 20% of AI capability improvement. Most teams try to do it first and then wonder why the first 80% is missing.

Closing

Fine-tuning is a powerful tool in a narrow band of situations. For most AI projects, the time and money spent on fine-tuning would return 2-5x more value if invested in evaluation, retrieval, and prompt engineering. Be disciplined about what problem you're actually trying to solve, and prefer the cheapest tool that solves it. If you're considering fine-tuning and want an outside view on whether it's the right call, ask us. We'll tell you honestly — and often point away from fine-tuning when it's not warranted.

When to fine-tune (and when RAG is fine)