Every production AI system needs a gateway layer between application code and LLM providers. Without it, your routing logic, retries, fallbacks, observability, and cost attribution end up scattered across the codebase. With it, concerns separate cleanly and every cross-cutting policy lives in one place. This post is the responsibilities a good gateway handles, what to build versus buy, and the specific tools we recommend as of 2026.

Gateway responsibilities

Middleware between app and providers handles routing, retries, fallbacks, rate limits, caching, observability, cost attribution, PII redaction, audit logging, and streaming passthrough.

Core responsibilities

Auth and tenant routing. Every request arrives with tenant context; the gateway enforces auth and propagates tenant_id throughout the pipeline. See multi-tenancy post.

Model routing and fallbacks. Route to the right model for the task (simple queries to cheaper tiers, complex to frontier). Fallback automatically on provider errors or latency spikes. See multi-model routing post.

Rate limiting and quota management. See rate limiting post. Enforced here so every request path gets the same treatment.

Caching. Exact match, template, semantic layers as appropriate. See caching patterns post.

Observability and cost attribution. Every request tagged with tenant, user, endpoint, model, token counts, cost, latency. Dashboards derived from these labels tell you who's spending what, where.

Retries and timeouts. Provider APIs fail; the gateway abstracts this from app code. Configurable retry policies per endpoint; distinguish retriable (5xx, rate limits) from non-retriable (4xx) errors.

PII redaction. Optional: scan requests for sensitive data before hitting external providers. See PII redaction post.

Audit logging. Immutable record of every request for compliance and debugging. Queryable by tenant.

Guardrails stack. Content filters, output validators, safety checks applied centrally so every endpoint benefits. See guardrails post.

Streaming passthrough. SSE forwarding with interception for observability, without breaking client-side streaming UX. See streaming UX post.

Build or buy

Buy: Portkey, Helicone (closed-source hosted), LiteLLM (OSS). These handle routing, caching, observability, cost attribution, rate limiting. Deployment takes a day. For most teams under 100 engineers, buy is the right choice.

Build: when you need deep integration with your own auth, audit, or domain-specific guardrails that the off-the-shelf products don't support well. Common in regulated industries (healthcare, finance) with very specific compliance requirements.

Hybrid: thin internal wrapper over Portkey or LiteLLM. Wrapper handles your custom logic; the underlying tool handles the routine plumbing. Most sophisticated teams end up here.

Common pitfalls

Rolling your own when you don't need to. Building a full gateway from scratch is 6+ months of engineering. Very rarely justified versus adopting an existing tool and adding the custom layer on top.

Skipping observability as an afterthought. Without instrumentation from day one, you'll be flying blind when issues hit. Every request through the gateway should be tagged and logged.

Ignoring streaming. Many gateways add latency to streaming responses because they buffer and re-emit. Use a gateway that genuinely passes through SSE without buffering.

Rollout pattern

Don't migrate the whole app at once. Start with new endpoints routed through the gateway. Migrate existing endpoints one at a time, validating observability and behavior at each step. Full migration typically takes 2-4 weeks for a mid-size app.

AI API gateway: the middleware layer every production system needs

Core responsibilities

Build or buy

Common pitfalls

Rollout pattern

Continue the thread.

Multi-model routing: cutting LLM costs 40-60% with zero quality loss

LLM observability without vendor lock-in

Rate limiting for LLM APIs: fair sharing and cost control

Want to talk about this?