Schema matching — figuring out how fields in different data sources correspond — has historically been labor-intensive. AI helps significantly, especially for initial proposals and handling of semantic equivalents. This post is the AI approaches to schema matching and the patterns that work at scale.

Signal sources

Name matching: fuzzy string, embedding, abbreviations. Content inspection: distributions, types, cardinality. LLM-assisted: semantic, domain context, transforms.

Name matching

Fuzzy string match. 'customer_name' ↔ 'cust_name' ↔ 'CustomerName'. Edit distance, Jaccard similarity.

Embedding similarity. Field names as embeddings; similarity in semantic space. Catches 'client' ↔ 'customer'.

Abbreviation handling. Common abbreviations expanded or normalized. 'CNT' → 'COUNT'.

Multiple name signals. Name plus description plus column comments — richer input.

Content inspection

Value distributions. Column with values 1-100 vs 1-1000 — not the same thing. Distributions reveal semantics.

Data type inference. String, number, date, boolean. Types must align for matching.

Cardinality and uniqueness. Primary key columns vs categorical columns distinguishable by cardinality.

Sample values. Example values can reveal semantics. Emails in email column; amounts in currency column.

LLM-assisted matching

Semantic understanding. LLM can judge 'customer_id' and 'account_number' refer to same concept in context.

Domain context. LLM knows business context; applies industry-specific knowledge.

Complex transforms. 'first_name + last_name' ↔ 'full_name'. LLM proposes transformations.

Natural language explanation. LLM justifies matches; human reviewer can validate reasoning.

Practice

Multiple signals combined. Name + content + context weighted together. No single signal sufficient.

Confidence scoring. Match confidence based on signal agreement. High confidence automates; low triggers review.

Human-in-loop for ambiguous. Matches with moderate confidence reviewed by human. Decisions stored as training data.

Iterate. Each project teaches the matching system. Accuracy improves over time.

Applications

Data integration projects. Combining data from multiple systems (CRM integration, ERP migration, M&A data merger).

Data catalog curation. Enterprise data catalogs map fields across sources. AI accelerates initial mapping.

ETL pipeline development. Source-to-target mapping in ETL tools. AI proposes mappings; developer validates.

API integration. Matching API fields to internal schema. Large integration projects particularly benefit.

Transformations beyond matching

Type conversions. '2024-01-15' → Unix timestamp. AI infers target type from context.

Unit conversions. Kilometers to miles; dollars to euros. Based on metadata or context.

Splitting and joining. 'name' → 'first_name' + 'last_name'. Requires transformation code.

Business logic. Customer segments derived from purchase history. Beyond simple mapping.

Tools

Data catalog platforms. Alation, Collibra, Atlan, Informatica with AI features.

Schema matching-specific. Less common as standalone; usually part of larger tools.

ETL tools. Fivetran, Stitch, Airbyte with AI-assisted mapping capabilities.

DIY with LLMs. Many teams build custom matching with LLM APIs. Flexible; often tailored to specific needs.

Pitfalls

False confidence. AI proposes plausible match that's semantically wrong. Validate.

Historical mismatches embedded. If wrong matches made in the past, system may learn those patterns.

Temporal schema drift. Schemas change over time; matches may need updating.

Permission and privacy. Schema matching may involve sensitive data; handle carefully.

AI for schema matching: data integration at scale

Name matching

Content inspection

LLM-assisted matching

Practice

Applications

Transformations beyond matching

Tools

Pitfalls

Continue the thread.

AI for data cleaning: moving past manual SQL

AI for ETL pipelines: generation, monitoring, repair

AI for data labeling: active learning, weak supervision, LLM labels

Want to talk about this?