AI integrates into ETL and data pipelines for specific tasks — document parsing, classification, enrichment, anomaly detection — while leaving deterministic transforms alone. The key is knowing what AI does well and what's better left to traditional logic. This post is the integration patterns and the cost/quality tradeoffs.

Where AI fits in ETL

Extract: document parsing, semi-structured data, entity extraction. Transform: cleaning, classification, enrichment. Load: quality validation, anomaly detection.

Extract stage

Document parsing. PDFs, Word docs, email. AI extracts structured data from unstructured sources.

Semi-structured sources. HTML pages, logs with complex formats. AI handles variance better than regex.

Entity extraction. People, companies, locations from text. LLMs competitive with specialized NER models.

Image / scanned document processing. OCR plus layout understanding. LLMs with vision capabilities.

Transform stage

Data cleaning. See data cleaning post.

Classification and tagging. Text to categories; products to taxonomy.

Enrichment. Adding fields from context (sentiment, topic, category) or external sources.

Schema matching. See schema matching post.

Load stage

Quality validation. Does output meet quality thresholds? AI can catch subtle quality issues.

Anomaly detection. Unusual patterns flagged for review. See anomaly detection post.

Lineage tracking. Which AI transformations applied to which records? Auditability.

Where AI fits, where it doesn"t

AI fits. Unstructured sources, document parsing, classification, enrichment, fuzzy matching.

AI doesn't fit. Deterministic transforms where rules clear and cheap. Sum totals, concatenations, simple joins — don't use AI.

Context matters. Same transform (text normalization) is deterministic for some cases (lowercase), AI-appropriate for others (canonical name resolution).

Cost management

AI steps can blow pipeline budget. Million rows × $0.01/row = $10,000/run. Adds up at daily runs.

Sample aggressively during development. Full runs after prompt and logic stabilized.

Batch API where possible. 50% discount for non-real-time workloads.

Cache aggressively. Same inputs produce same outputs; cache by input hash.

Cheaper models for routine tasks. Haiku, Mini-class models for classification and simple cleaning.

Reliability

Retry logic. LLM calls can fail transiently; retry with exponential backoff.

Provider redundancy. Fallback to secondary provider if primary down.

Circuit breakers. If error rate exceeds threshold, fall back to non-AI path or halt pipeline.

Idempotency. Running pipeline twice should produce same results. Handle AI non-determinism carefully.

Orchestration patterns

Airflow, Dagster, Prefect. Mature orchestrators; add AI steps as custom operators.

dbt. SQL-first transformations; Python models for AI steps.

Streaming pipelines. Kafka + Flink + AI. Real-time enrichment, classification.

Hybrid. Streaming for low-latency needs; batch for cost-efficient bulk.

Observability

Track AI step metrics. Success rate, latency, cost, output quality.

Data quality over time. AI outputs may drift; track and alert.

Lineage. Which model version processed which records? Essential for reproducibility and debugging.

Cost dashboards. Per-pipeline, per-team AI cost visibility. See cost attribution post.

Common patterns

Pre-classify then process. Classify input type; route to appropriate transformation. Reduces unnecessary AI calls.

Fan-out processing. Parallel AI calls; aggregate results. Faster than sequential.

Checkpoint aggressively. Long pipelines: checkpoint intermediate results. Avoid rerunning expensive AI steps on failures.

Human review queues. Low-confidence outputs routed to review interface. Continuous improvement loop.

Tools

Unstructured.io. Document parsing specialized for LLM pipelines.

LlamaIndex, LangChain. Frameworks for LLM-integrated pipelines.

Fivetran AI, Informatica AI. Commercial ETL with AI features.

DIY typical. Many teams build custom pipelines integrating provider APIs.

AI for ETL pipelines: generation, monitoring, repair

Extract stage

Transform stage

Load stage

Where AI fits, where it doesn"t

Cost management

Reliability

Orchestration patterns

Observability

Common patterns

Tools

Continue the thread.

AI for schema matching: data integration at scale

AI for data cleaning: moving past manual SQL

AI for data labeling: active learning, weak supervision, LLM labels

Want to talk about this?