You wouldn't run a web service without Datadog or New Relic. You wouldn't run a database without query logs. But the LLM in your production system? There's a decent chance you're running it blind. No traces, no cost attribution, no quality dashboards, no replay logs. This is the norm at companies where AI was shipped fast, and it's the first thing we fix when called in to stabilize a production AI system.

This post is a head-to-head of the LLM observability stack options in 2026 — what each does well, tradeoffs, and how to compose them. The goal: a shopping list you can take to your next planning meeting.

Tool matrix

Six observability options across five dimensions. Langfuse wins on breadth for most use cases; specialize where a single dimension matters most.

What you need to see in production

Before picking tools, be clear on what you're measuring. A complete LLM observability stack answers seven questions:

Cost per request, per user, per feature, trending over time.
Latency end-to-end and per-step, with percentile distributions.
Quality: success rate on evals, in production, by category.
Errors: rate, categories, impacted users.
User feedback: thumbs up/down, follow-up patterns, escalations.
Token usage: input and output distributions, prompts closest to limits.
Traces: the full request flow for any specific interaction, reproducible.

A good observability stack answers all seven without friction. A bad one answers two and makes the rest excruciating to reconstruct.

The options

Langfuse

Open-source, self-hostable, broad feature set. Handles tracing, eval integration, dataset management, cost tracking. Our current default for most client deployments. Strengths: no vendor lock-in, active development, good integration with LangChain/LangGraph ecosystems. Weaknesses: self-hosting has real operational cost; managed cloud option is good but more expensive than you'd expect at scale.

Braintrust

Commercial, focused on eval-driven development. Excellent dataset and scoring tooling. Strengths: strong eval workflows, excellent comparison views between model versions. Weaknesses: commercial-only, pricing at scale, less focus on cost tracking compared to Langfuse.

LangSmith

LangChain's observability product. Strengths: deep integration with LangChain/LangGraph, excellent for teams already in that ecosystem. Weaknesses: tight coupling to LangChain makes it less ideal for pure API-based setups, and pricing scales aggressively at enterprise tier.

Helicone

Drop-in gateway that logs everything. Simplest to set up — a one-line code change. Strengths: zero-friction onboarding, good cost and latency dashboards. Weaknesses: less sophisticated eval tooling than Langfuse or Braintrust.

Arize Phoenix

Open-source, research-friendly, strong embedding and retrieval debugging. Strengths: excellent for RAG debugging — visualize embedding clusters, retrieval hit/miss patterns. Weaknesses: operational setup for self-hosting is non-trivial.

OpenTelemetry + custom dashboards

The escape hatch: emit OpenInference-compatible traces to your existing OpenTelemetry stack (Grafana, Datadog, etc) and build custom dashboards. Strengths: full control, integration with your existing ops stack, no AI-specific vendor. Weaknesses: significant build cost, requires in-house observability expertise.

Our default stack for clients

For most production deployments we recommend: Langfuse (self-hosted on their Docker image in your cloud) for traces, datasets, evals, and cost tracking. Supplemented with Sentry for error monitoring and Grafana for infrastructure-level dashboards. This gives comprehensive LLM observability without vendor lock-in, at a fixed infrastructure cost of $200-$800/month depending on volume.

For clients who want less operational overhead, managed Langfuse Cloud or Helicone both work well. Budget $500-$3,000/month for managed options depending on volume.

Integration patterns

Two ways to instrument:

Proxy-based (Helicone, LiteLLM-proxy): route all LLM calls through a proxy that logs everything. One-line setup, minimal code changes. Best for getting baseline observability fast.
SDK-based (Langfuse, Braintrust): instrument via their SDK at the call site. More granular (you can annotate with user ID, feature flags, etc.) but requires touching every call site. Best for long-term observability with rich context.

Most of our deployments use SDK-based for critical paths and proxy-based for catch-all. The combination gives rich context where it matters and baseline coverage for everything else.

Cost attribution

Knowing your total LLM spend isn't enough. You need per-feature, per-user, per-customer attribution. Without it, cost optimization is blind — you can cut cost across the board, but you can't target the workloads that are actually expensive.

Good cost attribution requires tagging every LLM call with feature ID, user ID, and customer ID. Langfuse, Braintrust, and Helicone all support this through their SDKs. Build the tagging into your model gateway so it's automatic — relying on engineers to remember tags means 80% of calls will be untagged. See the cost modeling post for how to use this attribution.

Trace retention strategy

Traces are expensive to store. A high-volume system can generate gigabytes daily. Retention strategy:

Full traces (input, output, intermediate steps): retain 7-30 days.
Summary records (metadata, cost, latency, outcome): retain 90-365 days.
Error traces: retain longer (90 days or more) — they're most useful for debugging.
Sample a percentage (10-20%) for long-term retention to support trend analysis.

The alert rule that always pays off

Alert on cost per call trending up. Not cost per day — that scales with usage. Cost per call signals prompt bloat, context window growth, or a regression in routing. We catch real issues this way about twice a month per client.

Dashboards that matter

Three dashboards every LLM system needs:

Live operations: request rate, error rate, p50/p95/p99 latency, rolling eval pass rate. Check hourly.
Cost: spend today, trend vs last week, top features by cost, top users by cost. Check weekly.
Quality: eval pass rates by category, user feedback trends, escalation rates. Check weekly.

More than three becomes noise. Fewer and you're missing critical signals. Three, reviewed on cadence, catches 90% of issues before users do.

Observability as incident prevention

Good observability prevents incidents by catching drift before it hurts. Bad observability means you find out from a user. Our incident response post covers the downstream playbook — but the single biggest incident prevention lever is this: have dashboards, check them weekly, alert on the obvious signals, and review trends monthly. Invisibly boring. Extremely effective.

You ship the AI you can see. What you can't see breaks in the dark.

Closing

LLM observability is no longer a nice-to-have. It's the instrumentation every production AI system needs. Pick one of the options above (we'd pick Langfuse self-hosted for most), build cost attribution in, set up the three dashboards, and review them weekly. This is a 2-week investment that prevents a month of firefighting later. It's the single highest-ROI engineering investment in any AI system beyond eval infrastructure.

LLM observability without vendor lock-in