AI integrates into ETL and data pipelines for specific tasks — document parsing, classification, enrichment, anomaly detection — while leaving deterministic transforms alone. The key is knowing what AI does well and what's better left to traditional logic. This post is the integration patterns and the cost/quality tradeoffs.
Extract stage
Document parsing. PDFs, Word docs, email. AI extracts structured data from unstructured sources.
Semi-structured sources. HTML pages, logs with complex formats. AI handles variance better than regex.
Entity extraction. People, companies, locations from text. LLMs competitive with specialized NER models.
Image / scanned document processing. OCR plus layout understanding. LLMs with vision capabilities.
Transform stage
Data cleaning. See data cleaning post.
Classification and tagging. Text to categories; products to taxonomy.
Enrichment. Adding fields from context (sentiment, topic, category) or external sources.
Schema matching. See schema matching post.
Load stage
Quality validation. Does output meet quality thresholds? AI can catch subtle quality issues.
Anomaly detection. Unusual patterns flagged for review. See anomaly detection post.
Lineage tracking. Which AI transformations applied to which records? Auditability.
Where AI fits, where it doesn"t
AI fits. Unstructured sources, document parsing, classification, enrichment, fuzzy matching.
AI doesn't fit. Deterministic transforms where rules clear and cheap. Sum totals, concatenations, simple joins — don't use AI.
Context matters. Same transform (text normalization) is deterministic for some cases (lowercase), AI-appropriate for others (canonical name resolution).
Cost management
AI steps can blow pipeline budget. Million rows × $0.01/row = $10,000/run. Adds up at daily runs.
Sample aggressively during development. Full runs after prompt and logic stabilized.
Batch API where possible. 50% discount for non-real-time workloads.
Cache aggressively. Same inputs produce same outputs; cache by input hash.
Cheaper models for routine tasks. Haiku, Mini-class models for classification and simple cleaning.
Reliability
Retry logic. LLM calls can fail transiently; retry with exponential backoff.
Provider redundancy. Fallback to secondary provider if primary down.
Circuit breakers. If error rate exceeds threshold, fall back to non-AI path or halt pipeline.
Idempotency. Running pipeline twice should produce same results. Handle AI non-determinism carefully.
Orchestration patterns
Airflow, Dagster, Prefect. Mature orchestrators; add AI steps as custom operators.
dbt. SQL-first transformations; Python models for AI steps.
Streaming pipelines. Kafka + Flink + AI. Real-time enrichment, classification.
Hybrid. Streaming for low-latency needs; batch for cost-efficient bulk.
Observability
Track AI step metrics. Success rate, latency, cost, output quality.
Data quality over time. AI outputs may drift; track and alert.
Lineage. Which model version processed which records? Essential for reproducibility and debugging.
Cost dashboards. Per-pipeline, per-team AI cost visibility. See cost attribution post.
Common patterns
Pre-classify then process. Classify input type; route to appropriate transformation. Reduces unnecessary AI calls.
Fan-out processing. Parallel AI calls; aggregate results. Faster than sequential.
Checkpoint aggressively. Long pipelines: checkpoint intermediate results. Avoid rerunning expensive AI steps on failures.
Human review queues. Low-confidence outputs routed to review interface. Continuous improvement loop.
Tools
Unstructured.io. Document parsing specialized for LLM pipelines.
LlamaIndex, LangChain. Frameworks for LLM-integrated pipelines.
Fivetran AI, Informatica AI. Commercial ETL with AI features.
DIY typical. Many teams build custom pipelines integrating provider APIs.