eazyware
Engineering·December 22, 2025·10 min read

Synthetic data for AI: when to generate, when to buy

LLM-generated training data has gone from novelty to necessity. The patterns that work, the traps to avoid.

KR
Kushal R.
Engineering lead

Synthetic training data went from dubious research idea to standard production practice between 2023 and 2026. LLMs generating training data for smaller models. Generating adversarial cases for evaluation. Augmenting sparse real datasets. Filling in underrepresented categories. The techniques work — with caveats that matter. This post is where synthetic data earns its keep and where it fails.

Generation loop
Synthetic data generation loop Seed set 30-100 real Generator LLM diverse, strict Critic / judge reject dups, errors Human sample spot-check 5% seed with approved samples · iterate until coverage Traps to avoid · Mode collapse (all examples look similar) — enforce diversity in critic · Generator bias — different model for generator and target task
Seed set → generator LLM → critic filter → human sample check → back to seed with approved samples. Iterate until coverage holds.

When synthetic data actually helps

You have a small real dataset and need more

Classic case: 200 labeled customer support tickets for a specific category, and you need 2000 to fine-tune a classifier. Use the 200 as seeds; ask an LLM to generate variations; filter for quality. This works well for text classification, NER, and extraction tasks.

You need adversarial or edge-case examples

For eval sets, particularly for red-teaming, synthetic generation is a superpower. Ask a strong model to generate 'hard' examples of a category (prompt injections, jailbreak attempts, ambiguous cases). Real-world edge cases are rare; synthetic edge cases scale cheaply. See our red-teaming post.

You have privacy constraints on real data

Medical records, financial transactions, PII-heavy domains. Generate synthetic data that preserves statistical properties without reproducing real individuals. This is harder than it looks — naive LLM generation tends to reproduce patterns from training data — but domain-specific generators tuned carefully work.

You need to balance a skewed class distribution

Fraud detection: 0.1% positive rate. Rare disease classification: similar. Synthetic positive examples are essential for supervised learning to converge, and LLM generation is often higher quality than SMOTE-style upsampling.

The generation loop that works

Start with a small real dataset — 30-100 samples — covering the diversity you want the synthetic set to have. Use a strong LLM (frontier model, good at following instructions) as generator. Prompt for diversity explicitly: 'generate 10 variations, each substantively different in phrasing, tone, and specifics.'

Follow with a critic filter: a different LLM or the same model in a different role, judging each generated sample for (a) schema validity, (b) diversity (is this too similar to another sample?), (c) realism (could this be a real case?). Reject failures.

Sample 5-10% for human review at each iteration. Look for mode collapse, unrealistic patterns, unintended biases. Add approved samples back to the seed set. Iterate 3-5 rounds.

The traps

Mode collapse

LLMs converge on a limited diversity of patterns. Your 10,000 synthetic support tickets start sounding like 5 archetypes repeated with different entity names. Mitigation: explicit diversity prompting, critic that rejects near-duplicates, variety in generator temperature and sampling params.

Generator bias leaking into student model

If you train a small model on data generated by GPT-4, the small model inherits GPT-4's quirks and biases. This is bad when the small model is supposed to model a different distribution (your domain, your users). Mitigation: generator should be different from target model; critic or human review catches bias patterns.

Confidence in your eval set

Don't use the same LLM to generate training and eval data — you'll measure your model's ability to match the generator, not the real world. Keep a real-data eval set, always, regardless of how much synthetic training data you use.

When to buy data instead

For tasks where high-quality human-labeled data exists commercially (common NLP tasks, standard categories), buying is often cheaper than generating-and-filtering. Scale AI, Surge, Prolific all sell labeled data. Do the math per 1000 quality samples before committing to a generation pipeline.

Read next
When to fine-tune (and when RAG is fine)
Read next
Data strategy for AI: what to fix before you buy models
Read next
Why evaluation infrastructure matters more than prompts
Tags
synthetic datafine-tuningdata generation
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request