eazyware
Ops·August 5, 2024·10 min read

Self-hosted LLM monitoring: the metrics that matter

GPU utilization, KV cache hit rate, tokens per second, TTFT, queue depth. The specific metrics for vLLM, TGI, and TensorRT-LLM deployments.

KR
Kushal R.
Engineering lead

Self-hosted LLM serving is a different beast from calling a managed API. You own the availability, the scale, the failure modes. Good monitoring is not optional — it's the difference between a system you trust and one that silently degrades. This post covers the specific metrics that matter for vLLM, TGI, and TensorRT-LLM deployments, the dashboards to build, and the alerts that actually wake the right people up.

Metrics that matter
Self-hosted LLM — metrics that matter Performance Tokens/sec throughput TTFT (first token) p50/p95/p99 latency Queue depth GPU health GPU utilization % GPU memory used/free Temperature / throttling Power draw Serving internals KV cache hit rate Batch size avg/p95 Request preemptions Paged attention blocks Dashboards to build Top: real-time throughput + latency with p95 alerts wired to on-call Middle: per-GPU drilldown — utilization, memory, temperature Bottom: serving internals for capacity planning — cache hit, batch size Weekly: per-model cost breakdown (tokens × compute × hours)
Three layers: performance (throughput, latency), GPU health (utilization, memory, temperature), serving internals (KV cache, batch size, preemptions).

Why self-hosted monitoring is different

Managed APIs abstract away the infrastructure. You see latency and cost; the provider handles the rest. Self-hosted, you see everything — including all the ways a GPU deployment can silently underperform.

A healthy-looking system can be running at half its potential throughput because of KV cache pressure. A dashboard that doesn't surface this keeps you paying for 2x the compute you need.

Performance metrics

Tokens per second (throughput). The fundamental output metric. Aggregate across all requests; also track per-model if you're serving multiple.

Time to first token (TTFT). User-facing latency. For streaming responses, this is what perceived latency depends on. See latency budgeting post.

End-to-end latency percentiles: p50, p95, p99. Track all three. p99 tail latency often moves before p50 when the system is stressed; it's the leading indicator.

Queue depth. How many requests are waiting to be served? A rising queue is the clearest sign of capacity pressure. Alert when queue depth exceeds threshold for sustained time.

GPU health

GPU utilization percentage. Sustained utilization below 50% suggests you're over-provisioned; above 95% suggests you're under-provisioned or hitting memory bandwidth limits.

GPU memory used and free. Out-of-memory is the most common failure mode. Headroom of at least 10% is healthy; less and you're one long prompt away from OOM.

Temperature and thermal throttling events. Throttled GPUs run slower; sustained throttling hurts throughput. Monitor temperature; alert on throttling events.

Power draw. Correlates with actual work. Unexpected dips can indicate idle GPUs or throttling.

Serving internals

KV cache hit rate. vLLM and similar serving stacks reuse KV cache across requests with shared prefixes. Higher hit rate = better throughput. Low hit rate suggests workload isn't benefiting from caching; consider prompt structure changes.

Batch size — average and p95. The serving stack batches requests for throughput. Small batches mean underutilized GPU; huge batches increase latency for individual requests. Target depends on your latency/throughput tradeoff.

Request preemptions. vLLM may preempt running requests to serve newer ones when memory is pressure. Frequent preemption indicates memory pressure affecting service quality.

Paged attention block allocation. For vLLM specifically — block utilization tells you how efficiently KV cache memory is being used.

Alerts that matter

p95 latency breach. Your SLO defines the threshold; alert immediately when breached for more than a few minutes.

Error rate above baseline. 500 errors, OOM errors, model-load failures. Any sustained non-zero rate warrants attention.

Queue depth sustained above threshold. Waiting users are unhappy users.

GPU memory below threshold (say, 5% free). Imminent OOM.

Unusual cost patterns. Token throughput up without traffic up suggests workload shift; worth investigating.

Dashboards to build

Real-time ops dashboard. Throughput, latency percentiles, error rate, queue depth. On-screen for on-call engineers.

Capacity planning dashboard. Weekly trends on tokens served, GPU utilization, cache hit rates. Drives decisions about scaling up or down.

Per-model cost breakdown. If serving multiple models, which is expensive? Which is underutilized? Information for deprecation and scaling decisions.

Tooling

Prometheus + Grafana is standard. vLLM, TGI, TensorRT-LLM all expose Prometheus metrics out of the box.

DCGM for GPU metrics. NVIDIA's data center GPU manager exports GPU health to Prometheus.

Datadog, New Relic, commercial APM — all work with self-hosted LLM serving. Pick based on team preference and budget. See observability stack post.

Read next
Self-hosting vs managed: GPU decisions in 2026
Read next
LLM observability without vendor lock-in
Read next
Open-source models in production: what actually holds up
Tags
monitoringself-hostedvLLMobservability
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request