Document intelligence in 2026 is a mature category. The stack from OCR through structured extraction has settled into a standard five-layer pattern, and the failure modes are well-understood. This post is that stack — where each layer fits, which tools we use, and where projects typically break.

Five layers

Document ingest → OCR + structure detection → layout understanding → semantic extraction (LLM) → schema-aware structured output. Where the stack commonly breaks: boundary between layer 3 and 4.

The five layers

Layer 1 — Document ingest. PDF, image, scan, phone photo. Preprocessing: de-skew, denoise, split pages, normalize orientation. Standard libraries handle this. Quality-in equals quality-out.

Layer 2 — OCR + structure detection. AWS Textract, Azure Document Intelligence, Google Document AI for managed; Tesseract for self-hosted. Vision LLMs handle this directly for some cases but dedicated OCR is faster and cheaper at scale.

Layer 3 — Layout understanding. Tables with headers and rows. Multi-column layouts with reading order. Forms with labeled fields. Signature blocks, headers, footers. Managed services handle basic layout; complex documents often need custom logic on top.

Layer 4 — Semantic extraction. From structured text and layout to meaning. LLMs excel here with clear schema, relevant context, and explicit instructions on what to extract and what to skip.

Layer 5 — Schema-aware structured output. Validate. Known value ranges, required fields populated, enum values within allowed sets. Confidence scores attached where applicable. See structured outputs post.

Where the stack breaks

The boundary between layer 3 and layer 4. Layout gives you cells of a table; extraction needs to know which cell is 'net amount' vs 'gross amount' vs 'tax.' Headers vary; aliases exist; multi-language documents add complexity.

Our pattern: a small library of known document types (each a specific vendor's invoice, a specific claim layout), with fallback to LLM-only processing for unknowns. Auto-classification routes to the right extractor. The library grows as volume reveals new types.

Tool recommendations

Managed: Azure Document Intelligence for forms/invoices layout, AWS Textract for strong table extraction, Google Document AI for healthcare-specific forms. Vision LLMs: GPT-4o and Claude Sonnet for the semantic layer.

Self-hosted: Marker or Nougat for PDF-to-Markdown (better than most managed for complex technical docs), LayoutLM variants for form-field extraction, open vision LLMs for semantic layer.

Deployment patterns

Human-in-loop for high-stakes extractions (legal, medical, financial). AI extracts; reviewer confirms or corrects. Correction data feeds back to improve over time — a powerful flywheel if the team commits.

Batch vs streaming. Most workflows are batch — cheaper, easier to validate. Real-time document processing is a different architecture with tighter latency.

Document intelligence: beyond OCR into understanding

The five layers

Where the stack breaks

Tool recommendations

Deployment patterns

Continue the thread.

Multimodal AI in production: vision + text patterns

AI for legal teams: patterns that pass review

AI in insurance: claims, underwriting, and fraud in practice

Want to talk about this?