Data cleaning has been painful forever. LLMs change the economics — fuzzy matching, entity resolution, format standardization at scale that previously required expensive rule-based systems or manual work. Used thoughtfully, AI accelerates cleaning significantly; used naively, it produces plausible-but-wrong outputs. This post is the patterns that work for AI-assisted data cleaning.
Deduplication
Fuzzy matching. 'John Smith' vs 'Jon Smith' vs 'John Smith Jr.' Rule-based fuzzy matching works; LLM-augmented context-aware matching works better for complex cases.
Entity resolution. Same entity across sources with different representations. Company databases: 'IBM' vs 'International Business Machines' vs 'I.B.M.'
Embedding similarity. Convert entities to embeddings; cluster by similarity. Good for bulk operations.
Human review for uncertain matches. Low-confidence pairs flagged for human decision.
Standardization
Address parsing. Breaking '123 Main Street, Apt 4B, Cityville, ST 12345' into structured components. AI handles many formats better than regex.
Name canonicalization. 'Dr. Jane Smith, MD' → 'Jane Smith'. Removing honorifics, handling variants.
Date/number format. Input in various formats; output in single canonical format. AI robust to edge cases.
Phone number standardization. International formats, extensions, country codes.
Enrichment
Missing field inference. Given available fields, infer missing ones. Industry code from company description; gender from name; category from product description.
Category assignment. Classify items into taxonomy. Products to categories, documents to topics.
Validation against rules. Check data against business rules; flag violations. AI detects subtle violations rule-based systems miss.
External data joins. Match records against external databases (company data, geographic data).
Risk controls — where AI cleaning goes wrong
Sample validation before trusting at scale. Run on 1000 records; manually validate 100. Quality acceptable? Scale up. Quality poor? Refine prompt or approach.
Confidence thresholds. Low-confidence AI decisions flagged for human review. High-confidence automated.
Audit trail. Track which cells were AI-modified, by which model, with what confidence. Reconstruct decisions; detect systematic issues.
Human-in-loop for high-stakes. Financial, legal, medical data — automate cleaning cautiously. Mistakes expensive.
AI-specific failure modes
Plausible-but-wrong outputs. AI generates reasonable-looking data that's wrong. Hard to detect without ground truth.
Inconsistent decisions. Same input, different calls, different outputs. Determinism matters for data cleaning; temperature=0 helps.
Systematic biases. AI may systematically miscategorize certain inputs. Audit for patterns in corrections.
Overreach. AI 'fixes' data that was actually correct in unusual format. Losing legitimate data variance.
Economic considerations
AI cost per record. Typically $0.001-$0.01 per record for LLM-based cleaning. At million-record scales, adds up.
Batch processing. Batch API (50% discount at most providers) for non-real-time cleaning.
Caching. Same inputs produce same outputs; cache aggressively.
Model selection. Smaller models (Haiku, GPT-4 mini) often sufficient for cleaning tasks. Much cheaper.
Pipeline integration
ETL/ELT pipeline insertion. AI cleaning steps added to existing data pipelines.
dbt models. AI-assisted transformations as dbt models.
Airflow/Dagster operators. Schedule AI cleaning steps.
MLflow lineage. Track model versions, prompts, outputs for reproducibility.
Tools
OpenRefine. Classic data cleaning; AI extensions emerging.
Trifacta (Alteryx). Visual data wrangling with ML assistance.
DIY with LangChain, LlamaIndex. Many data cleaning tasks straightforward with LLM APIs.
Specialized. Melissa (address validation), Dun & Bradstreet (company data), LinkedIn Talent Insights (people data).