Multimodal AI — LLMs that see images alongside text — matured from 'interesting demo' to 'production deployable' between 2024 and 2026. GPT-4o, Claude's vision, Gemini's multimodal variants all handle standard tasks reliably. But failure modes still exist, and this post is the patterns we deploy across clients.
What vision LLMs do well
Describing images. Extracting information from documents where layout matters. Reading tables into structured data. Identifying objects and their relationships. Answering specific questions about image content. OCR on messy real-world images — often better than dedicated OCR for complex layouts.
These are production-ready tasks. Accuracy is high enough that with validation and human-in-loop for high-stakes cases, you can ship.
Where vision LLMs still fail
Precise spatial reasoning. 'What is to the left of the red box' returns plausible wrong answers. Counting things accurately — numbers close but often wrong. Fine-grained detail in high-res images (model sees downsampled). Hallucinated objects that are not in the image, particularly when the prompt suggests what might be there.
Mitigations: pair with specialized detectors (YOLOv8, Grounding DINO) for spatial tasks; use counting models for counts; crop and tile for detail; verify claims with critic models.
Production patterns we deploy
Preprocess for the model's strength. Split high-res images into tiles with positional context. De-skew and binarize documents before OCR-adjacent tasks. Sample video frames at intervals; don't pass long sequences. Validate every output — schema checks, sanity bounds, spot-check samples against ground truth.
Combine with specialized models. Vision LLM for semantic understanding and narration. Specialized models for precision work (object detection, OCR, document classification). The LLM orchestrates; specialists do the heavy lifting where precision matters.
Cost management. Vision requests cost significantly more than text-only. Measure cost per image. Resize aggressively when task allows. Cache when the same image is processed multiple times. For bulk pipelines, consider smaller or self-hosted vision models.
Use cases that ship reliably
Invoice and receipt processing, document intelligence (see next post), accessibility alt-text, product photo classification, medical imaging triage with regulatory controls, content moderation. Use cases to be careful with: anything requiring precise coordinates or measurements, anything where a hallucination would be high-stakes without human review.