Schema matching — figuring out how fields in different data sources correspond — has historically been labor-intensive. AI helps significantly, especially for initial proposals and handling of semantic equivalents. This post is the AI approaches to schema matching and the patterns that work at scale.
Name matching
Fuzzy string match. 'customer_name' ↔ 'cust_name' ↔ 'CustomerName'. Edit distance, Jaccard similarity.
Embedding similarity. Field names as embeddings; similarity in semantic space. Catches 'client' ↔ 'customer'.
Abbreviation handling. Common abbreviations expanded or normalized. 'CNT' → 'COUNT'.
Multiple name signals. Name plus description plus column comments — richer input.
Content inspection
Value distributions. Column with values 1-100 vs 1-1000 — not the same thing. Distributions reveal semantics.
Data type inference. String, number, date, boolean. Types must align for matching.
Cardinality and uniqueness. Primary key columns vs categorical columns distinguishable by cardinality.
Sample values. Example values can reveal semantics. Emails in email column; amounts in currency column.
LLM-assisted matching
Semantic understanding. LLM can judge 'customer_id' and 'account_number' refer to same concept in context.
Domain context. LLM knows business context; applies industry-specific knowledge.
Complex transforms. 'first_name + last_name' ↔ 'full_name'. LLM proposes transformations.
Natural language explanation. LLM justifies matches; human reviewer can validate reasoning.
Practice
Multiple signals combined. Name + content + context weighted together. No single signal sufficient.
Confidence scoring. Match confidence based on signal agreement. High confidence automates; low triggers review.
Human-in-loop for ambiguous. Matches with moderate confidence reviewed by human. Decisions stored as training data.
Iterate. Each project teaches the matching system. Accuracy improves over time.
Applications
Data integration projects. Combining data from multiple systems (CRM integration, ERP migration, M&A data merger).
Data catalog curation. Enterprise data catalogs map fields across sources. AI accelerates initial mapping.
ETL pipeline development. Source-to-target mapping in ETL tools. AI proposes mappings; developer validates.
API integration. Matching API fields to internal schema. Large integration projects particularly benefit.
Transformations beyond matching
Type conversions. '2024-01-15' → Unix timestamp. AI infers target type from context.
Unit conversions. Kilometers to miles; dollars to euros. Based on metadata or context.
Splitting and joining. 'name' → 'first_name' + 'last_name'. Requires transformation code.
Business logic. Customer segments derived from purchase history. Beyond simple mapping.
Tools
Data catalog platforms. Alation, Collibra, Atlan, Informatica with AI features.
Schema matching-specific. Less common as standalone; usually part of larger tools.
ETL tools. Fivetran, Stitch, Airbyte with AI-assisted mapping capabilities.
DIY with LLMs. Many teams build custom matching with LLM APIs. Flexible; often tailored to specific needs.
Pitfalls
False confidence. AI proposes plausible match that's semantically wrong. Validate.
Historical mismatches embedded. If wrong matches made in the past, system may learn those patterns.
Temporal schema drift. Schemas change over time; matches may need updating.
Permission and privacy. Schema matching may involve sensitive data; handle carefully.