Multimodal AI — LLMs that see images alongside text — matured from 'interesting demo' to 'production deployable' between 2024 and 2026. GPT-4o, Claude's vision, Gemini's multimodal variants all handle standard tasks reliably. But failure modes still exist, and this post is the patterns we deploy across clients.

Multimodal pipeline

Input → preprocess → vision LLM → validate → structured output. Each step exists because raw vision-LLM output still fails in specific, repeatable ways.

What vision LLMs do well

Describing images. Extracting information from documents where layout matters. Reading tables into structured data. Identifying objects and their relationships. Answering specific questions about image content. OCR on messy real-world images — often better than dedicated OCR for complex layouts.

These are production-ready tasks. Accuracy is high enough that with validation and human-in-loop for high-stakes cases, you can ship.

Where vision LLMs still fail

Precise spatial reasoning. 'What is to the left of the red box' returns plausible wrong answers. Counting things accurately — numbers close but often wrong. Fine-grained detail in high-res images (model sees downsampled). Hallucinated objects that are not in the image, particularly when the prompt suggests what might be there.

Mitigations: pair with specialized detectors (YOLOv8, Grounding DINO) for spatial tasks; use counting models for counts; crop and tile for detail; verify claims with critic models.

Production patterns we deploy

Preprocess for the model's strength. Split high-res images into tiles with positional context. De-skew and binarize documents before OCR-adjacent tasks. Sample video frames at intervals; don't pass long sequences. Validate every output — schema checks, sanity bounds, spot-check samples against ground truth.

Combine with specialized models. Vision LLM for semantic understanding and narration. Specialized models for precision work (object detection, OCR, document classification). The LLM orchestrates; specialists do the heavy lifting where precision matters.

Cost management. Vision requests cost significantly more than text-only. Measure cost per image. Resize aggressively when task allows. Cache when the same image is processed multiple times. For bulk pipelines, consider smaller or self-hosted vision models.

Use cases that ship reliably

Invoice and receipt processing, document intelligence (see next post), accessibility alt-text, product photo classification, medical imaging triage with regulatory controls, content moderation. Use cases to be careful with: anything requiring precise coordinates or measurements, anything where a hallucination would be high-stakes without human review.

Multimodal AI in production: vision + text patterns

What vision LLMs do well

Where vision LLMs still fail

Production patterns we deploy

Use cases that ship reliably

Continue the thread.

Document intelligence: beyond OCR into understanding

Six RAG patterns that actually work in production

We ran 200 LLMs through our eval suite. Here's what we learned.

Want to talk about this?