Self-hosted LLM serving is a different beast from calling a managed API. You own the availability, the scale, the failure modes. Good monitoring is not optional — it's the difference between a system you trust and one that silently degrades. This post covers the specific metrics that matter for vLLM, TGI, and TensorRT-LLM deployments, the dashboards to build, and the alerts that actually wake the right people up.
Why self-hosted monitoring is different
Managed APIs abstract away the infrastructure. You see latency and cost; the provider handles the rest. Self-hosted, you see everything — including all the ways a GPU deployment can silently underperform.
A healthy-looking system can be running at half its potential throughput because of KV cache pressure. A dashboard that doesn't surface this keeps you paying for 2x the compute you need.
Performance metrics
Tokens per second (throughput). The fundamental output metric. Aggregate across all requests; also track per-model if you're serving multiple.
Time to first token (TTFT). User-facing latency. For streaming responses, this is what perceived latency depends on. See latency budgeting post.
End-to-end latency percentiles: p50, p95, p99. Track all three. p99 tail latency often moves before p50 when the system is stressed; it's the leading indicator.
Queue depth. How many requests are waiting to be served? A rising queue is the clearest sign of capacity pressure. Alert when queue depth exceeds threshold for sustained time.
GPU health
GPU utilization percentage. Sustained utilization below 50% suggests you're over-provisioned; above 95% suggests you're under-provisioned or hitting memory bandwidth limits.
GPU memory used and free. Out-of-memory is the most common failure mode. Headroom of at least 10% is healthy; less and you're one long prompt away from OOM.
Temperature and thermal throttling events. Throttled GPUs run slower; sustained throttling hurts throughput. Monitor temperature; alert on throttling events.
Power draw. Correlates with actual work. Unexpected dips can indicate idle GPUs or throttling.
Serving internals
KV cache hit rate. vLLM and similar serving stacks reuse KV cache across requests with shared prefixes. Higher hit rate = better throughput. Low hit rate suggests workload isn't benefiting from caching; consider prompt structure changes.
Batch size — average and p95. The serving stack batches requests for throughput. Small batches mean underutilized GPU; huge batches increase latency for individual requests. Target depends on your latency/throughput tradeoff.
Request preemptions. vLLM may preempt running requests to serve newer ones when memory is pressure. Frequent preemption indicates memory pressure affecting service quality.
Paged attention block allocation. For vLLM specifically — block utilization tells you how efficiently KV cache memory is being used.
Alerts that matter
p95 latency breach. Your SLO defines the threshold; alert immediately when breached for more than a few minutes.
Error rate above baseline. 500 errors, OOM errors, model-load failures. Any sustained non-zero rate warrants attention.
Queue depth sustained above threshold. Waiting users are unhappy users.
GPU memory below threshold (say, 5% free). Imminent OOM.
Unusual cost patterns. Token throughput up without traffic up suggests workload shift; worth investigating.
Dashboards to build
Real-time ops dashboard. Throughput, latency percentiles, error rate, queue depth. On-screen for on-call engineers.
Capacity planning dashboard. Weekly trends on tokens served, GPU utilization, cache hit rates. Drives decisions about scaling up or down.
Per-model cost breakdown. If serving multiple models, which is expensive? Which is underutilized? Information for deprecation and scaling decisions.
Tooling
Prometheus + Grafana is standard. vLLM, TGI, TensorRT-LLM all expose Prometheus metrics out of the box.
DCGM for GPU metrics. NVIDIA's data center GPU manager exports GPU health to Prometheus.
Datadog, New Relic, commercial APM — all work with self-hosted LLM serving. Pick based on team preference and budget. See observability stack post.