Document intelligence in 2026 is a mature category. The stack from OCR through structured extraction has settled into a standard five-layer pattern, and the failure modes are well-understood. This post is that stack — where each layer fits, which tools we use, and where projects typically break.
The five layers
Layer 1 — Document ingest. PDF, image, scan, phone photo. Preprocessing: de-skew, denoise, split pages, normalize orientation. Standard libraries handle this. Quality-in equals quality-out.
Layer 2 — OCR + structure detection. AWS Textract, Azure Document Intelligence, Google Document AI for managed; Tesseract for self-hosted. Vision LLMs handle this directly for some cases but dedicated OCR is faster and cheaper at scale.
Layer 3 — Layout understanding. Tables with headers and rows. Multi-column layouts with reading order. Forms with labeled fields. Signature blocks, headers, footers. Managed services handle basic layout; complex documents often need custom logic on top.
Layer 4 — Semantic extraction. From structured text and layout to meaning. LLMs excel here with clear schema, relevant context, and explicit instructions on what to extract and what to skip.
Layer 5 — Schema-aware structured output. Validate. Known value ranges, required fields populated, enum values within allowed sets. Confidence scores attached where applicable. See structured outputs post.
Where the stack breaks
The boundary between layer 3 and layer 4. Layout gives you cells of a table; extraction needs to know which cell is 'net amount' vs 'gross amount' vs 'tax.' Headers vary; aliases exist; multi-language documents add complexity.
Our pattern: a small library of known document types (each a specific vendor's invoice, a specific claim layout), with fallback to LLM-only processing for unknowns. Auto-classification routes to the right extractor. The library grows as volume reveals new types.
Tool recommendations
Managed: Azure Document Intelligence for forms/invoices layout, AWS Textract for strong table extraction, Google Document AI for healthcare-specific forms. Vision LLMs: GPT-4o and Claude Sonnet for the semantic layer.
Self-hosted: Marker or Nougat for PDF-to-Markdown (better than most managed for complex technical docs), LayoutLM variants for form-field extraction, open vision LLMs for semantic layer.
Deployment patterns
Human-in-loop for high-stakes extractions (legal, medical, financial). AI extracts; reviewer confirms or corrects. Correction data feeds back to improve over time — a powerful flywheel if the team commits.
Batch vs streaming. Most workflows are batch — cheaper, easier to validate. Real-time document processing is a different architecture with tighter latency.