eazyware
Engineering·March 31, 2025·11 min read

Document intelligence: beyond OCR into understanding

Extracting tables, layouts, and meaning from complex documents. The 2026 stack from OCR through structured output.

KR
Kushal R.
Engineering lead

Document intelligence in 2026 is a mature category. The stack from OCR through structured extraction has settled into a standard five-layer pattern, and the failure modes are well-understood. This post is that stack — where each layer fits, which tools we use, and where projects typically break.

Five layers
Document intelligence stack — 2026 5. Schema-aware structured output JSON · with confidence 4. Semantic extraction (LLM) meaning · entities · relationships 3. Layout understanding tables · forms · reading order 2. OCR + structure detection Textract, Azure DI, Tesseract 1. Document ingest (PDF, image, scan) normalize · de-skew · split each layer has mature tools · where the stack breaks: boundary between 3 and 4
Document ingest → OCR + structure detection → layout understanding → semantic extraction (LLM) → schema-aware structured output. Where the stack commonly breaks: boundary between layer 3 and 4.

The five layers

Layer 1 — Document ingest. PDF, image, scan, phone photo. Preprocessing: de-skew, denoise, split pages, normalize orientation. Standard libraries handle this. Quality-in equals quality-out.

Layer 2 — OCR + structure detection. AWS Textract, Azure Document Intelligence, Google Document AI for managed; Tesseract for self-hosted. Vision LLMs handle this directly for some cases but dedicated OCR is faster and cheaper at scale.

Layer 3 — Layout understanding. Tables with headers and rows. Multi-column layouts with reading order. Forms with labeled fields. Signature blocks, headers, footers. Managed services handle basic layout; complex documents often need custom logic on top.

Layer 4 — Semantic extraction. From structured text and layout to meaning. LLMs excel here with clear schema, relevant context, and explicit instructions on what to extract and what to skip.

Layer 5 — Schema-aware structured output. Validate. Known value ranges, required fields populated, enum values within allowed sets. Confidence scores attached where applicable. See structured outputs post.

Where the stack breaks

The boundary between layer 3 and layer 4. Layout gives you cells of a table; extraction needs to know which cell is 'net amount' vs 'gross amount' vs 'tax.' Headers vary; aliases exist; multi-language documents add complexity.

Our pattern: a small library of known document types (each a specific vendor's invoice, a specific claim layout), with fallback to LLM-only processing for unknowns. Auto-classification routes to the right extractor. The library grows as volume reveals new types.

Tool recommendations

Managed: Azure Document Intelligence for forms/invoices layout, AWS Textract for strong table extraction, Google Document AI for healthcare-specific forms. Vision LLMs: GPT-4o and Claude Sonnet for the semantic layer.

Self-hosted: Marker or Nougat for PDF-to-Markdown (better than most managed for complex technical docs), LayoutLM variants for form-field extraction, open vision LLMs for semantic layer.

Deployment patterns

Human-in-loop for high-stakes extractions (legal, medical, financial). AI extracts; reviewer confirms or corrects. Correction data feeds back to improve over time — a powerful flywheel if the team commits.

Batch vs streaming. Most workflows are batch — cheaper, easier to validate. Real-time document processing is a different architecture with tighter latency.

Read next
Multimodal AI in production: vision + text patterns
Read next
AI for legal teams: patterns that pass review
Read next
AI in insurance: claims, underwriting, and fraud in practice
Tags
OCRdocument AIextractionstructured
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request