Self-host or use a managed API? The question gets asked every quarter at every client. The answer has genuinely changed over the last two years — Fal, Modal, Replicate, RunPod, and friends have turned bursty GPU hosting from a research project into a line item. This post is the decision tree and the real numbers behind each branch.
The three tiers
Tier 1: Managed API (OpenAI, Anthropic, Google)
Zero ops. Prices well-known. Scales from 0 to 10M requests/day without your involvement. Limitations: you can't fine-tune most of these (Anthropic, notably); you can't run offline or in an air-gapped environment; data leaves your network boundary (with mitigations — BAA, enterprise tiers, Azure OpenAI). Cost model: per-token.
When to pick: default for anything that fits. Around 80% of our client projects never need to leave this tier.
Tier 2: Serverless GPU (Fal, Modal, Replicate, Together, Fireworks)
You provide the model (either open-weights or your fine-tune), the platform hosts and exposes it as an API, you pay per-second of GPU time with sub-second cold starts. Great for bursty workloads, open-model hosting without operating GPUs, prototyping new open models.
Cost model: per GPU-second. A single A100 is roughly $2-4/hour on demand on these platforms; H100 is $4-8/hour. The math works if your average utilization is below about 20% — you pay only when serving, no idle cost.
When to pick: any open-weights model where you can't or won't go to Tier 1; bursty workloads with high variance; teams without dedicated infra capacity.
Tier 3: Dedicated or owned GPUs
You rent GPU instances long-term (reserved instances on hyperscalers, dedicated instances on specialized GPU clouds) or you own hardware in a colo. Pay per-hour for the box whether you use it or not.
Cost model: the break-even vs serverless depends on utilization. At ~60-70% average utilization (i.e., GPU is actively serving 14-17 hours out of 24), dedicated rental beats serverless. Above 70%, owned hardware in colo beats rental over a 2-3 year amortization. Below 40%, serverless is almost always cheaper.
When to pick: high-volume, steady-state open-model workloads; cases where privacy requires on-premise; cases where you need custom hardware (very large models, specialized accelerators). This is the tier where ops costs get real — monitoring, patching, model updates, the GPU driver update that takes down production for two hours.
The economic crossovers that matter
From real client numbers: a moderate-volume workload (10M requests/day, 300-token avg response, Llama 3.1 70B) costs roughly: $5K/day on managed API (if available), $2-3K/day on serverless GPU, $1K/day on dedicated rental at 70% utilization, $400/day on amortized owned hardware at 85% utilization.
Each step down is about 50% cost reduction and 3-5x the ops burden. The savings only make sense if you either can't use the lower-ops tier (model choice, privacy) or if ops cost is marginal for you (existing infra team, existing hardware).
Model selection changes the calculus
Open models keep getting better per parameter. Llama 3.3 70B hits quality that required Llama 3 405B eighteen months earlier. This shrinks the hardware footprint required for a given quality bar — and thus shrinks the break-even utilization. Every year the tier-3 advantage grows for teams that keep up with model releases.
Our current recommendation: Tier 1 as default, Tier 2 when model choice forces open-weights or when tier-1 pricing gets bad at scale, Tier 3 only when you have genuine scale (consistent multi-million-requests/day), an ops team, and a cost model that shows meaningful savings. Don't skip Tier 2 on the way to Tier 3.