AI agents had the best demos of 2024. GPT with tools, LangGraph workflows, multi-agent systems completing tasks end-to-end — the videos looked like the future. Two years later, a lot of those same teams are quietly pulling agents out of production and replacing them with simpler deterministic flows. What happened in between is what this post is about.

We've shipped and maintained agentic systems for a dozen clients. Some succeed wonderfully. Others fail in ways that are instructive. The difference between success and failure is never the model — it's how the team handled seven specific failure modes that only show up past the 10,000-calls-a-day mark. This is the playbook for handling each.

Architecture

Production agent: orchestrator with guard nodes between LLM and tools, explicit state store, budget + eval + observability layer underneath. Remove any box and you get the failure modes below.

Failure 1: Silent drift

Day one: agent completes tasks at 85% success rate on eval set. Day 30: success rate has dropped to 68%. Nobody noticed. User complaints hadn't spiked — the failures just spread thinly across many users. By the time it's visible, the regression is deep and the root cause (a model update, a tool change, a prompt tweak weeks ago) is hard to reconstruct.

Fix: treat agents like any other production system. Comprehensive eval infrastructure running continuously. Per-step scoring, not just end-to-end. Alerts when pass rate drops more than 3 points week-over-week. This catches drift within a week instead of within a quarter.

Failure 2: Infinite loops

Agent calls tool A, sees it didn't work, calls tool A again, sees it didn't work again, and so on. The model has no concept of 'I already tried this.' At 100K calls/day, a 1% loop rate means 1000 loops/day, each burning tokens and eventually hitting retry limits or exhausting the context window.

Fix: hard budgets on iterations and cost. Every agent run gets a max-steps and max-cost budget enforced outside the LLM. If the budget is hit, the run terminates and logs an explicit 'budget exceeded' error. This keeps rogue runs from silently burning $50 of API calls before someone notices.

Failure 3: Tool errors get ignored

Tool returns an error. The LLM doesn't always notice, and even when it does, it often makes up a plausible response rather than handling the error cleanly. The agent proceeds as if the tool succeeded, and downstream state becomes garbage.

Fix: structured tool results with explicit success/error branches. Your orchestration framework (LangGraph, custom) routes errors to an error-handler node rather than passing raw error text to the LLM. The LLM should see only the success path or a clean error message; it should not have to parse 'null reference exception at line 42.'

Failure 4: State explosion

Agents accumulate state as they work — tool outputs, intermediate reasoning, user turns. At turn 20 of a long session, the context window contains 50K tokens of prior state. The model gets slower, more expensive, and worse at focusing on the current task.

Fix: explicit state management outside the LLM. Don't put everything in the context window. Use a summarization step to compact older state. Use structured memory (Redis, Postgres) to hold state between turns. The LLM sees a small, relevant window at each step, not the cumulative history.

Failure 5: Non-determinism breaks debugging

Agents are stochastic. Run the same input twice, get different outputs. When something goes wrong in production, you can't always reproduce it. Debugging becomes archaeology.

Fix: deterministic replay logging. Capture every LLM input, tool call, and result with timestamps and request IDs. Build a replay tool that can rerun any production trace with the exact same inputs. Temperature 0 for replay runs. This turns bug reproduction from hours to minutes.

Failure 6: Permission and auth drift

Agent has access to tools A, B, C. Over weeks, someone adds tool D to the toolkit. No one updates the permission model. Suddenly the agent is touching customer data in ways the security team didn't authorize.

Fix: explicit allowlists per agent, reviewed quarterly. Every tool added to an agent's kit requires a written authorization by an owner. Audit logs of all tool invocations. This feels bureaucratic at three tools; it's critical at twenty.

Failure 7: No clean escalation path

Agent hits a case it can't handle. Without an escalation path, it either: (a) hallucinates a plausible-sounding answer, (b) loops trying variations, or (c) returns an error that confuses the user. Pick any of the three — all are bad outcomes.

Fix: explicit 'don't know' or 'needs human' paths. The agent should be trained and prompted to recognize out-of-scope cases and hand off cleanly. For voice AI see our voice AI patterns post; the escalation patterns are similar for chat and async workflows.

When agents actually work well

The agentic pattern isn't dead. It's just not appropriate for everything. Agents shine when:

The task has a clear goal state ("book a meeting that satisfies these constraints").
The environment is bounded (a few tools, a few decisions).
There is a natural verification step (the outcome is checkable).
Latency tolerance is high (background processing, not live UX).

Agents are a bad fit for:

Single-turn simple tasks (use a direct LLM call).
Tight latency budgets (agents add round-trips).
Mission-critical deterministic flows (use code, not agents).
Unbounded environments with hundreds of tools (reliability degrades non-linearly with toolkit size).

Architecture patterns we deploy

A few specific patterns recur across successful agent deployments. These are covered in depth in our LangGraph patterns post, but briefly: state-based routing (decisions flow through typed state, not free-form messages), human-in-loop checkpoints (the agent pauses at risky steps for human approval), retry-with-reflection (on failure, the model reflects on the cause before retrying), parallel tool calls (fan-out for read-only tools), and guard nodes (structural validation before proceeding).

The agent maturity test

Ask the team: 'How do you know an agent run succeeded?' If the answer is 'the user didn't complain,' the agent is not production-ready. Production agents have explicit success criteria, automated evaluation, and clear escalation paths.

Measurement and observability

Agent observability is harder than single-turn LLM observability. You're tracking a trajectory, not a single call. Good observability for agents captures: full trace of every step, duration and cost per step, tool call success rates, final outcome state, and user feedback if any. Langfuse and OpenInference do this well. We cover the full observability stack in a dedicated post.

Closing

Agentic systems in production are engineering problems more than research problems. The models are good enough. What separates successful deployments from failed ones is the handling of the seven failure modes above: eval discipline, hard budgets, structured error handling, state management, replay logging, permission discipline, and clean escalation. Get those right and agents work. Skip any one and they decay in ways that are maddening to debug.

Agents aren't hard because of intelligence. They're hard because production systems are unforgiving, and agents have more surface area to break.

AI agents in production: what actually breaks

Failure 1: Silent drift

Failure 2: Infinite loops

Failure 3: Tool errors get ignored

Failure 4: State explosion

Failure 5: Non-determinism breaks debugging

Failure 6: Permission and auth drift

Failure 7: No clean escalation path

When agents actually work well

Architecture patterns we deploy

Measurement and observability

Closing

Continue the thread.

Why evaluation infrastructure matters more than prompts

LLM observability without vendor lock-in

LangGraph patterns we use in every agentic system

Want to talk about this?