Every year around December, I write a post looking at what actually shipped in AI that year versus what was hyped. The gap is always bigger than industry commentary suggests. 2025 was no exception: genuine progress in narrow areas, underwhelming results in others, and a few outright failures that never got honestly reported. This is our look at the year, from the perspective of a team that actually ships AI to production.
What actually shipped in 2025
Voice AI crossed the threshold
Early 2025 had voice AI that was impressive in demos and brittle in production. Late 2025 had voice AI good enough to ship to customer-facing support. Real-time STT-LLM-TTS pipelines hit sub-second latency consistently. Several of our clients moved 20-40% of inbound call volume to voice AI with CSAT parity. See our voice AI post.
Document intelligence became genuinely solved
Extracting structured data from messy documents — invoices, contracts, forms — crossed the reliability threshold for most production use. Not 100% perfect, but good enough to replace 80% of human data-entry with 5% spot-check rate. This was quietly the biggest practical win of 2025.
Code copilots became table stakes
Every professional developer uses one now. 20-30% productivity uplift is real and measured. The open question is no longer 'do these work' but 'which is best' and 'how do we integrate them into team workflows.'
Internal search and knowledge bases
Not glamorous, ubiquitous. Every mid-size company now has or is building an internal AI knowledge base. Most of these work fine. Most didn't make headlines. This is the largest category of deployments by count.
What didn't ship
Fully autonomous agents doing open-ended work
Still demoware. The announcements kept coming; the production deployments didn't. Bounded agents on narrow workflows work; open-ended agents in messy environments don't. See our agents post for the reliability failure modes.
AI replacing knowledge workers at scale
Augmentation? Yes. Replacement? No — at least not in the ways consistent with the hype cycle. Knowledge workers are using AI tools, producing more, making different kinds of mistakes. Headcount hasn't collapsed. The 'AI will replace lawyers/doctors/analysts in 3 years' predictions from 2023 aged poorly.
"AI CRM" replacing Salesforce
A wave of AI-native CRM startups promised to disrupt Salesforce with AI-first UX. Most built cosmetic AI features on top of worse CRMs. Salesforce added AI features of their own. The disruption didn't materialize.
Multimodal breakthroughs touted at keynotes
Demo videos of models understanding images, video, audio simultaneously. In production, text remains 90%+ of all LLM usage. Multimodal is genuinely useful for narrow use cases (document intelligence, accessibility) but hasn't transformed mainstream applications.
Quiet wins
Things that shipped quietly but mattered:
- Long-context models became genuinely useful, not just demos. 100K+ token contexts are now table stakes.
- Open-source models (Llama 4, DeepSeek, Qwen) closed much of the gap with frontier models. For many use cases, open-source is now sufficient.
- Evaluation tooling matured. Langfuse, Braintrust, Promptfoo all got substantially better.
- Structured output support got reliable. JSON mode and schema validation became standard.
- Costs kept dropping. Cost per token fell 50-70% for equivalent capability compared to 2024.
Quiet failures
- Several high-profile "AI-first" companies laid off meaningful percentages after their products underperformed expectations.
- AI writing detectors remained unreliable — detection accuracy barely improved and false-positive rates stayed uncomfortable.
- "Personal AI" products (meet-yourself-in-AI, AI relationship tools) failed to find sustainable usage.
- Generative video tools impressed in demos and frustrated in production. Control and consistency remained elusive.
Lessons from 2025
Narrow, well-evaluated AI in production beat broad, impressive AI in demos. Boring won. Teams that focused on eval infrastructure, operational excellence, and specific customer problems shipped. Teams that chased the next capability announcement mostly didn't. This is the persistent 2020s pattern: the AI that makes money is 20% less exciting than the AI that makes headlines.
Outlook for 2026
Our bets, stated as probabilities:
- Code: agentic coding (Devin-class) becomes genuinely useful for bounded tasks. 80%.
- Voice: voice AI continues to eat outbound and inbound calling workflows. 90%.
- Video gen: production-ready for advertising and short content. 50%.
- Autonomous agents: still demoware for most open-ended tasks. 75%.
- New reasoning benchmarks continue to move faster than deployment reality. 95%.
- Cost per token drops another 30-50%. 70%.
Closing
The industry keeps promising more than it delivers, and the actual deliveries matter more than the promises. For teams building production AI: focus on what's shipping, not what's demoed. Ignore the quarterly benchmark announcements unless they touch your use case. Invest in eval and observability. The teams doing the less-exciting work are the ones making money. We'll do this exercise again end of 2026 and see how the predictions above age.