Synthetic training data went from dubious research idea to standard production practice between 2023 and 2026. LLMs generating training data for smaller models. Generating adversarial cases for evaluation. Augmenting sparse real datasets. Filling in underrepresented categories. The techniques work — with caveats that matter. This post is where synthetic data earns its keep and where it fails.
When synthetic data actually helps
You have a small real dataset and need more
Classic case: 200 labeled customer support tickets for a specific category, and you need 2000 to fine-tune a classifier. Use the 200 as seeds; ask an LLM to generate variations; filter for quality. This works well for text classification, NER, and extraction tasks.
You need adversarial or edge-case examples
For eval sets, particularly for red-teaming, synthetic generation is a superpower. Ask a strong model to generate 'hard' examples of a category (prompt injections, jailbreak attempts, ambiguous cases). Real-world edge cases are rare; synthetic edge cases scale cheaply. See our red-teaming post.
You have privacy constraints on real data
Medical records, financial transactions, PII-heavy domains. Generate synthetic data that preserves statistical properties without reproducing real individuals. This is harder than it looks — naive LLM generation tends to reproduce patterns from training data — but domain-specific generators tuned carefully work.
You need to balance a skewed class distribution
Fraud detection: 0.1% positive rate. Rare disease classification: similar. Synthetic positive examples are essential for supervised learning to converge, and LLM generation is often higher quality than SMOTE-style upsampling.
The generation loop that works
Start with a small real dataset — 30-100 samples — covering the diversity you want the synthetic set to have. Use a strong LLM (frontier model, good at following instructions) as generator. Prompt for diversity explicitly: 'generate 10 variations, each substantively different in phrasing, tone, and specifics.'
Follow with a critic filter: a different LLM or the same model in a different role, judging each generated sample for (a) schema validity, (b) diversity (is this too similar to another sample?), (c) realism (could this be a real case?). Reject failures.
Sample 5-10% for human review at each iteration. Look for mode collapse, unrealistic patterns, unintended biases. Add approved samples back to the seed set. Iterate 3-5 rounds.
The traps
Mode collapse
LLMs converge on a limited diversity of patterns. Your 10,000 synthetic support tickets start sounding like 5 archetypes repeated with different entity names. Mitigation: explicit diversity prompting, critic that rejects near-duplicates, variety in generator temperature and sampling params.
Generator bias leaking into student model
If you train a small model on data generated by GPT-4, the small model inherits GPT-4's quirks and biases. This is bad when the small model is supposed to model a different distribution (your domain, your users). Mitigation: generator should be different from target model; critic or human review catches bias patterns.
Confidence in your eval set
Don't use the same LLM to generate training and eval data — you'll measure your model's ability to match the generator, not the real world. Keep a real-data eval set, always, regardless of how much synthetic training data you use.
When to buy data instead
For tasks where high-quality human-labeled data exists commercially (common NLP tasks, standard categories), buying is often cheaper than generating-and-filtering. Scale AI, Surge, Prolific all sell labeled data. Do the math per 1000 quality samples before committing to a generation pipeline.