Batch inference APIs from OpenAI, Anthropic, and Google cut LLM inference costs by 50%, yet most teams don't use them. The objections — 24-hour SLA feels slow, the integration is different, the workload seems to need real-time — are mostly wrong. Most AI workloads have a large batchable component. This post is the architecture for mixing batch and real-time inference, and the specific use cases where batch earns payback.
The economics
OpenAI Batch API: 50% discount, 24-hour SLA (typically completes in 1-4 hours). Anthropic Message Batches API: 50% discount, 24-hour SLA. Google Gemini: similar. For a team running 10M+ tokens/day through batchable workloads, the savings are substantial — $150K/year on a $300K annual bill isn't theoretical.
The SLA is not 24 hours — it's a 24-hour ceiling. In practice batches of a few thousand requests complete in minutes to an hour. Unless your workflow specifically needs sub-second response, batch fits.
Workloads that fit batch
Document enrichment pipelines: extract entities, classify, generate summaries for uploaded files. User doesn't need the enrichment in their first second — they expect it to appear within minutes. Batch fits perfectly.
Bulk classification: categorizing support tickets, emails, customer feedback. These are processed in hourly or daily windows. Batch matches the cadence.
Nightly derived data: pre-computing tomorrow's recommendations, refreshing cached summaries, regenerating content for stale entries. The entire workload is inherently batch.
Eval runs: running your eval suite against a new model or prompt. No latency requirement at all. Batch cuts the eval budget in half. See eval infrastructure post.
Workloads that require real-time
Chat interactions, streaming generations, tool calls in agent loops, voice copilot interactions. Anything where user perceives wait time and expects response within a second or two.
Interactive UX: code completion, search autocomplete, inline document editing. Latency below 200ms is often the goal; batch is irrelevant.
The hybrid pipeline pattern
Real-time for the interactive surface; batch for follow-ups and enrichment. User uploads a document, sees a snippet summary in 3 seconds (real-time call), and gets full enrichment (chapter summaries, extracted entities, related documents) within an hour (batch).
The architectural pattern: a queue between the user-facing layer and the batch layer. Async workers drain the queue, send requests to batch APIs, handle the async completion callback, store results. User sees progressive enhancement as batch results land.
Implementation specifics
Batch APIs accept JSONL files of requests. Upload the file; get a batch_id; poll or subscribe for completion. Then download the results file. Each major provider has slightly different file format and polling semantics — abstract behind a common interface.
Error handling: individual requests within a batch can fail independently. Parse each response; requeue failures; alert on systemic failure patterns.
Batch pitfalls
Don't batch latency-critical workloads hoping 'it'll usually be fast.' Sometimes batches do take 20+ hours. Architect for the SLA ceiling, not the average case.
Don't batch tiny volumes (fewer than 50 requests per batch). Overhead dominates. Accumulate for at least a few minutes or until you hit a size threshold.
Don't forget observability. Batch pipelines fail in ways real-time doesn't — stuck batches, partial completions, quota exhaustion. Monitor queue depth, batch status, completion latency explicitly. See observability post.
When to adopt batch
When your monthly inference bill exceeds $10K and batchable workloads are at least 30% of traffic. Below that, the engineering effort doesn't pay back fast enough. Above that, it's leaving significant money on the table.