Reasoning models — o1, o3, DeepSeek-R1, Claude's extended thinking — deliver meaningful quality improvements on specific tasks at meaningful cost and latency penalties. The hype has swung from 'reasoning is the future' to 'reasoning is overhyped' and back twice in eighteen months. Production use has been more stable than the discourse: we've integrated reasoning models into several client systems, and there are consistent patterns for where they earn their cost.
What reasoning models actually do
Reasoning models spend inference-time compute on extended chains of thought before producing their final answer. Where a standard model generates an answer token-by-token directly, a reasoning model generates a longer internal sequence of reasoning steps, then produces the answer. From an API consumer's perspective this looks like longer latency and higher per-call cost; internally it looks like more tokens generated, most of which the user never sees.
The quality improvement varies by task. On math competition problems, ARC-AGI-style reasoning, hard coding challenges, multi-step logic puzzles — reasoning models can be 20-50 points better than the same-family non-reasoning model. On retrieval, simple chat, extraction, and summarization — reasoning adds little and often nothing.
Tasks where reasoning actually helps
Complex debugging and root-cause analysis
Reasoning models shine at 'here are logs, code, and a symptom — what's wrong?' tasks. The extended chain lets them consider multiple hypotheses, eliminate some, and converge on the likely cause. In our deployments, reasoning models catch subtle bugs that non-reasoning models miss.
Planning multi-step agent workflows
When an agent needs to plan before acting (see our function calling patterns post), a reasoning model produces better plans. The plan phase is a natural fit for extended thinking; the execution phase can use cheaper models for individual tool calls.
Hard structured extraction
Extracting information from documents where the extraction involves inference — not just 'find the date of signing' but 'what are the termination conditions, given the interaction of clauses 4, 7, and 12'. Reasoning models handle this cleanly where non-reasoning models make confident wrong guesses.
Tight constraints satisfaction
Scheduling, resource allocation, logic puzzles with multiple constraints. Reasoning models can actually check constraints iteratively in the chain; non-reasoning models satisfy 3 of 4 constraints and ship.
Tasks where reasoning is wrong
Conversational chat: latency kills the UX. Retrieval-heavy RAG: reasoning doesn't help when the answer is in the retrieved context. Classification: reasoning models overthink categorical decisions. Bulk processing: cost per call is 5-30x; reasoning on 10M documents is economic suicide.
Also, reasoning models sometimes reason themselves into confidently wrong answers. The extended chain can compound a bad assumption rather than correct it. For tasks where the non-reasoning model would have said 'I'm not sure' or retrieved an answer, a reasoning model can talk itself into a specific (wrong) conclusion.
The production pattern
Our default: reasoning models as an escalation path from the router. 90%+ of queries go to a standard model. Queries flagged as hard (explicit user request, router classifier identifies multi-step task, initial attempt triggers retry-with-reasoning) escalate to the reasoning tier. This keeps latency fast for most traffic and reserves the expensive path for where it earns payback.
We track the ratio of reasoning escalations as a cost and UX metric. Too high (>10% of traffic) means the router is over-escalating or the core workflow needs reasoning and we should budget for it. Too low (<1%) means the escalation path isn't firing when it should and simple models are failing silently.