Every production AI system needs a gateway layer between application code and LLM providers. Without it, your routing logic, retries, fallbacks, observability, and cost attribution end up scattered across the codebase. With it, concerns separate cleanly and every cross-cutting policy lives in one place. This post is the responsibilities a good gateway handles, what to build versus buy, and the specific tools we recommend as of 2026.
Core responsibilities
Auth and tenant routing. Every request arrives with tenant context; the gateway enforces auth and propagates tenant_id throughout the pipeline. See multi-tenancy post.
Model routing and fallbacks. Route to the right model for the task (simple queries to cheaper tiers, complex to frontier). Fallback automatically on provider errors or latency spikes. See multi-model routing post.
Rate limiting and quota management. See rate limiting post. Enforced here so every request path gets the same treatment.
Caching. Exact match, template, semantic layers as appropriate. See caching patterns post.
Observability and cost attribution. Every request tagged with tenant, user, endpoint, model, token counts, cost, latency. Dashboards derived from these labels tell you who's spending what, where.
Retries and timeouts. Provider APIs fail; the gateway abstracts this from app code. Configurable retry policies per endpoint; distinguish retriable (5xx, rate limits) from non-retriable (4xx) errors.
PII redaction. Optional: scan requests for sensitive data before hitting external providers. See PII redaction post.
Audit logging. Immutable record of every request for compliance and debugging. Queryable by tenant.
Guardrails stack. Content filters, output validators, safety checks applied centrally so every endpoint benefits. See guardrails post.
Streaming passthrough. SSE forwarding with interception for observability, without breaking client-side streaming UX. See streaming UX post.
Build or buy
Buy: Portkey, Helicone (closed-source hosted), LiteLLM (OSS). These handle routing, caching, observability, cost attribution, rate limiting. Deployment takes a day. For most teams under 100 engineers, buy is the right choice.
Build: when you need deep integration with your own auth, audit, or domain-specific guardrails that the off-the-shelf products don't support well. Common in regulated industries (healthcare, finance) with very specific compliance requirements.
Hybrid: thin internal wrapper over Portkey or LiteLLM. Wrapper handles your custom logic; the underlying tool handles the routine plumbing. Most sophisticated teams end up here.
Common pitfalls
Rolling your own when you don't need to. Building a full gateway from scratch is 6+ months of engineering. Very rarely justified versus adopting an existing tool and adding the custom layer on top.
Skipping observability as an afterthought. Without instrumentation from day one, you'll be flying blind when issues hit. Every request through the gateway should be tagged and logged.
Ignoring streaming. Many gateways add latency to streaming responses because they buffer and re-emit. Use a gateway that genuinely passes through SSE without buffering.
Rollout pattern
Don't migrate the whole app at once. Start with new endpoints routed through the gateway. Migrate existing endpoints one at a time, validating observability and behavior at each step. Full migration typically takes 2-4 weeks for a mid-size app.