Synthetic training data went from dubious research idea to standard production practice between 2023 and 2026. LLMs generating training data for smaller models. Generating adversarial cases for evaluation. Augmenting sparse real datasets. Filling in underrepresented categories. The techniques work — with caveats that matter. This post is where synthetic data earns its keep and where it fails.

Generation loop

Seed set → generator LLM → critic filter → human sample check → back to seed with approved samples. Iterate until coverage holds.

When synthetic data actually helps

You have a small real dataset and need more

Classic case: 200 labeled customer support tickets for a specific category, and you need 2000 to fine-tune a classifier. Use the 200 as seeds; ask an LLM to generate variations; filter for quality. This works well for text classification, NER, and extraction tasks.

You need adversarial or edge-case examples

For eval sets, particularly for red-teaming, synthetic generation is a superpower. Ask a strong model to generate 'hard' examples of a category (prompt injections, jailbreak attempts, ambiguous cases). Real-world edge cases are rare; synthetic edge cases scale cheaply. See our red-teaming post.

You have privacy constraints on real data

Medical records, financial transactions, PII-heavy domains. Generate synthetic data that preserves statistical properties without reproducing real individuals. This is harder than it looks — naive LLM generation tends to reproduce patterns from training data — but domain-specific generators tuned carefully work.

You need to balance a skewed class distribution

Fraud detection: 0.1% positive rate. Rare disease classification: similar. Synthetic positive examples are essential for supervised learning to converge, and LLM generation is often higher quality than SMOTE-style upsampling.

The generation loop that works

Start with a small real dataset — 30-100 samples — covering the diversity you want the synthetic set to have. Use a strong LLM (frontier model, good at following instructions) as generator. Prompt for diversity explicitly: 'generate 10 variations, each substantively different in phrasing, tone, and specifics.'

Follow with a critic filter: a different LLM or the same model in a different role, judging each generated sample for (a) schema validity, (b) diversity (is this too similar to another sample?), (c) realism (could this be a real case?). Reject failures.

Sample 5-10% for human review at each iteration. Look for mode collapse, unrealistic patterns, unintended biases. Add approved samples back to the seed set. Iterate 3-5 rounds.

The traps

Mode collapse

LLMs converge on a limited diversity of patterns. Your 10,000 synthetic support tickets start sounding like 5 archetypes repeated with different entity names. Mitigation: explicit diversity prompting, critic that rejects near-duplicates, variety in generator temperature and sampling params.

Generator bias leaking into student model

If you train a small model on data generated by GPT-4, the small model inherits GPT-4's quirks and biases. This is bad when the small model is supposed to model a different distribution (your domain, your users). Mitigation: generator should be different from target model; critic or human review catches bias patterns.

Confidence in your eval set

Don't use the same LLM to generate training and eval data — you'll measure your model's ability to match the generator, not the real world. Keep a real-data eval set, always, regardless of how much synthetic training data you use.

When to buy data instead

For tasks where high-quality human-labeled data exists commercially (common NLP tasks, standard categories), buying is often cheaper than generating-and-filtering. Scale AI, Surge, Prolific all sell labeled data. Do the math per 1000 quality samples before committing to a generation pipeline.

Synthetic data for AI: when to generate, when to buy

When synthetic data actually helps

You have a small real dataset and need more

You need adversarial or edge-case examples

You have privacy constraints on real data

You need to balance a skewed class distribution

The generation loop that works

The traps

Mode collapse

Generator bias leaking into student model

Confidence in your eval set

When to buy data instead

Continue the thread.

When to fine-tune (and when RAG is fine)

Data strategy for AI: what to fix before you buy models

Why evaluation infrastructure matters more than prompts

Want to talk about this?