eazyware
Engineering·May 22, 2023·10 min read

AI for data cleaning: moving past manual SQL

Duplicate detection, entity resolution, standardization, outlier correction. AI in the unglamorous but essential work of data cleaning.

KR
Kushal R.
Engineering lead

Data cleaning has been painful forever. LLMs change the economics — fuzzy matching, entity resolution, format standardization at scale that previously required expensive rule-based systems or manual work. Used thoughtfully, AI accelerates cleaning significantly; used naively, it produces plausible-but-wrong outputs. This post is the patterns that work for AI-assisted data cleaning.

Where it helps
AI for data cleaning — where it helps Deduplication Fuzzy matching Entity resolution Embedding similarity Standardization Address parsing Name canonicalization Date/number formats Enrichment Missing field inference Category assignment Validation against rules Risk controls Sample output validation before trusting at scale Confidence thresholds — low confidence flagged for human review Audit trail — track which cells were AI-modified, by which model
Deduplication: fuzzy matching, entity resolution. Standardization: addresses, names, dates. Enrichment: missing fields, categorization, validation.

Deduplication

Fuzzy matching. 'John Smith' vs 'Jon Smith' vs 'John Smith Jr.' Rule-based fuzzy matching works; LLM-augmented context-aware matching works better for complex cases.

Entity resolution. Same entity across sources with different representations. Company databases: 'IBM' vs 'International Business Machines' vs 'I.B.M.'

Embedding similarity. Convert entities to embeddings; cluster by similarity. Good for bulk operations.

Human review for uncertain matches. Low-confidence pairs flagged for human decision.

Standardization

Address parsing. Breaking '123 Main Street, Apt 4B, Cityville, ST 12345' into structured components. AI handles many formats better than regex.

Name canonicalization. 'Dr. Jane Smith, MD' → 'Jane Smith'. Removing honorifics, handling variants.

Date/number format. Input in various formats; output in single canonical format. AI robust to edge cases.

Phone number standardization. International formats, extensions, country codes.

Enrichment

Missing field inference. Given available fields, infer missing ones. Industry code from company description; gender from name; category from product description.

Category assignment. Classify items into taxonomy. Products to categories, documents to topics.

Validation against rules. Check data against business rules; flag violations. AI detects subtle violations rule-based systems miss.

External data joins. Match records against external databases (company data, geographic data).

Risk controls — where AI cleaning goes wrong

Sample validation before trusting at scale. Run on 1000 records; manually validate 100. Quality acceptable? Scale up. Quality poor? Refine prompt or approach.

Confidence thresholds. Low-confidence AI decisions flagged for human review. High-confidence automated.

Audit trail. Track which cells were AI-modified, by which model, with what confidence. Reconstruct decisions; detect systematic issues.

Human-in-loop for high-stakes. Financial, legal, medical data — automate cleaning cautiously. Mistakes expensive.

AI-specific failure modes

Plausible-but-wrong outputs. AI generates reasonable-looking data that's wrong. Hard to detect without ground truth.

Inconsistent decisions. Same input, different calls, different outputs. Determinism matters for data cleaning; temperature=0 helps.

Systematic biases. AI may systematically miscategorize certain inputs. Audit for patterns in corrections.

Overreach. AI 'fixes' data that was actually correct in unusual format. Losing legitimate data variance.

Economic considerations

AI cost per record. Typically $0.001-$0.01 per record for LLM-based cleaning. At million-record scales, adds up.

Batch processing. Batch API (50% discount at most providers) for non-real-time cleaning.

Caching. Same inputs produce same outputs; cache aggressively.

Model selection. Smaller models (Haiku, GPT-4 mini) often sufficient for cleaning tasks. Much cheaper.

Pipeline integration

ETL/ELT pipeline insertion. AI cleaning steps added to existing data pipelines.

dbt models. AI-assisted transformations as dbt models.

Airflow/Dagster operators. Schedule AI cleaning steps.

MLflow lineage. Track model versions, prompts, outputs for reproducibility.

Tools

OpenRefine. Classic data cleaning; AI extensions emerging.

Trifacta (Alteryx). Visual data wrangling with ML assistance.

DIY with LangChain, LlamaIndex. Many data cleaning tasks straightforward with LLM APIs.

Specialized. Melissa (address validation), Dun & Bradstreet (company data), LinkedIn Talent Insights (people data).

Read next
AI for data labeling: active learning, weak supervision, LLM labels
Read next
AI for schema matching: data integration at scale
Read next
AI for ETL pipelines: generation, monitoring, repair
Tags
data cleaningETLdata quality
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request