eazyware
Engineering·January 28, 2026·12 min read

Self-hosting vs managed: GPU decisions in 2026

When to pay for managed inference and when to run your own GPUs. Real costs from real deployments.

KR
Kushal R.
Engineering lead

Self-host or use a managed API? The question gets asked every quarter at every client. The answer has genuinely changed over the last two years — Fal, Modal, Replicate, RunPod, and friends have turned bursty GPU hosting from a research project into a line item. This post is the decision tree and the real numbers behind each branch.

Decision tree
GPU hosting decision tree Inference need Closed-source model enough? yes no / privacy Use API (OpenAI/Anthropic) Need > 8 GPUs utilization? no · bursty yes · steady Serverless GPU (Fal, Modal) Rent / own GPUs serverless wins below 20% GPU utilization · owned wins above ~60%
Closed-source API if the model fits and privacy allows. Serverless GPU for bursty self-hosted. Dedicated or owned GPUs above ~60% utilization. The economics invert sharply around the utilization threshold.

The three tiers

Tier 1: Managed API (OpenAI, Anthropic, Google)

Zero ops. Prices well-known. Scales from 0 to 10M requests/day without your involvement. Limitations: you can't fine-tune most of these (Anthropic, notably); you can't run offline or in an air-gapped environment; data leaves your network boundary (with mitigations — BAA, enterprise tiers, Azure OpenAI). Cost model: per-token.

When to pick: default for anything that fits. Around 80% of our client projects never need to leave this tier.

Tier 2: Serverless GPU (Fal, Modal, Replicate, Together, Fireworks)

You provide the model (either open-weights or your fine-tune), the platform hosts and exposes it as an API, you pay per-second of GPU time with sub-second cold starts. Great for bursty workloads, open-model hosting without operating GPUs, prototyping new open models.

Cost model: per GPU-second. A single A100 is roughly $2-4/hour on demand on these platforms; H100 is $4-8/hour. The math works if your average utilization is below about 20% — you pay only when serving, no idle cost.

When to pick: any open-weights model where you can't or won't go to Tier 1; bursty workloads with high variance; teams without dedicated infra capacity.

Tier 3: Dedicated or owned GPUs

You rent GPU instances long-term (reserved instances on hyperscalers, dedicated instances on specialized GPU clouds) or you own hardware in a colo. Pay per-hour for the box whether you use it or not.

Cost model: the break-even vs serverless depends on utilization. At ~60-70% average utilization (i.e., GPU is actively serving 14-17 hours out of 24), dedicated rental beats serverless. Above 70%, owned hardware in colo beats rental over a 2-3 year amortization. Below 40%, serverless is almost always cheaper.

When to pick: high-volume, steady-state open-model workloads; cases where privacy requires on-premise; cases where you need custom hardware (very large models, specialized accelerators). This is the tier where ops costs get real — monitoring, patching, model updates, the GPU driver update that takes down production for two hours.

The economic crossovers that matter

From real client numbers: a moderate-volume workload (10M requests/day, 300-token avg response, Llama 3.1 70B) costs roughly: $5K/day on managed API (if available), $2-3K/day on serverless GPU, $1K/day on dedicated rental at 70% utilization, $400/day on amortized owned hardware at 85% utilization.

Each step down is about 50% cost reduction and 3-5x the ops burden. The savings only make sense if you either can't use the lower-ops tier (model choice, privacy) or if ops cost is marginal for you (existing infra team, existing hardware).

Model selection changes the calculus

Open models keep getting better per parameter. Llama 3.3 70B hits quality that required Llama 3 405B eighteen months earlier. This shrinks the hardware footprint required for a given quality bar — and thus shrinks the break-even utilization. Every year the tier-3 advantage grows for teams that keep up with model releases.

Our current recommendation: Tier 1 as default, Tier 2 when model choice forces open-weights or when tier-1 pricing gets bad at scale, Tier 3 only when you have genuine scale (consistent multi-million-requests/day), an ops team, and a cost model that shows meaningful savings. Don't skip Tier 2 on the way to Tier 3.

Read next
Open-source models in production: what actually holds up
Read next
Total cost of ownership for LLM systems
Read next
Build vs buy: when custom AI beats off-the-shelf
Tags
infrastructureGPUself-hostingcost
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request