Batch inference APIs from OpenAI, Anthropic, and Google cut LLM inference costs by 50%, yet most teams don't use them. The objections — 24-hour SLA feels slow, the integration is different, the workload seems to need real-time — are mostly wrong. Most AI workloads have a large batchable component. This post is the architecture for mixing batch and real-time inference, and the specific use cases where batch earns payback.

Batch vs real-time

Batch: document enrichment, bulk classification, nightly refreshes, eval runs. Real-time: chat, streaming, tool calls, voice. Hybrid pipelines route per-request.

The economics

OpenAI Batch API: 50% discount, 24-hour SLA (typically completes in 1-4 hours). Anthropic Message Batches API: 50% discount, 24-hour SLA. Google Gemini: similar. For a team running 10M+ tokens/day through batchable workloads, the savings are substantial — $150K/year on a $300K annual bill isn't theoretical.

The SLA is not 24 hours — it's a 24-hour ceiling. In practice batches of a few thousand requests complete in minutes to an hour. Unless your workflow specifically needs sub-second response, batch fits.

Workloads that fit batch

Document enrichment pipelines: extract entities, classify, generate summaries for uploaded files. User doesn't need the enrichment in their first second — they expect it to appear within minutes. Batch fits perfectly.

Bulk classification: categorizing support tickets, emails, customer feedback. These are processed in hourly or daily windows. Batch matches the cadence.

Nightly derived data: pre-computing tomorrow's recommendations, refreshing cached summaries, regenerating content for stale entries. The entire workload is inherently batch.

Eval runs: running your eval suite against a new model or prompt. No latency requirement at all. Batch cuts the eval budget in half. See eval infrastructure post.

Workloads that require real-time

Chat interactions, streaming generations, tool calls in agent loops, voice copilot interactions. Anything where user perceives wait time and expects response within a second or two.

Interactive UX: code completion, search autocomplete, inline document editing. Latency below 200ms is often the goal; batch is irrelevant.

The hybrid pipeline pattern

Real-time for the interactive surface; batch for follow-ups and enrichment. User uploads a document, sees a snippet summary in 3 seconds (real-time call), and gets full enrichment (chapter summaries, extracted entities, related documents) within an hour (batch).

The architectural pattern: a queue between the user-facing layer and the batch layer. Async workers drain the queue, send requests to batch APIs, handle the async completion callback, store results. User sees progressive enhancement as batch results land.

Implementation specifics

Batch APIs accept JSONL files of requests. Upload the file; get a batch_id; poll or subscribe for completion. Then download the results file. Each major provider has slightly different file format and polling semantics — abstract behind a common interface.

Error handling: individual requests within a batch can fail independently. Parse each response; requeue failures; alert on systemic failure patterns.

Batch pitfalls

Don't batch latency-critical workloads hoping 'it'll usually be fast.' Sometimes batches do take 20+ hours. Architect for the SLA ceiling, not the average case.

Don't batch tiny volumes (fewer than 50 requests per batch). Overhead dominates. Accumulate for at least a few minutes or until you hit a size threshold.

Don't forget observability. Batch pipelines fail in ways real-time doesn't — stuck batches, partial completions, quota exhaustion. Monitor queue depth, batch status, completion latency explicitly. See observability post.

When to adopt batch

When your monthly inference bill exceeds $10K and batchable workloads are at least 30% of traffic. Below that, the engineering effort doesn't pay back fast enough. Above that, it's leaving significant money on the table.

Batch inference for LLMs: the economics and the patterns

The economics

Workloads that fit batch

Workloads that require real-time

The hybrid pipeline pattern

Implementation specifics

Batch pitfalls

When to adopt batch

Continue the thread.

Total cost of ownership for LLM systems

Self-hosting vs managed: GPU decisions in 2026

Latency budgeting for LLM systems

Want to talk about this?