Voice AI is the most unforgiving AI surface. Chat users tolerate a lot — typos, brief pauses, imperfect phrasing. Voice users don't. A 2-second pause feels like 10. A single misheard word and the caller hangs up. A slightly robotic tone and trust evaporates. Yet voice is also the most valuable surface: it meets users where they already are, doesn't require a screen, and can handle tasks chat still can't.
We've built voice AI systems across customer support, outbound sales, and internal workflows. This post is what we've learned about making voice AI that passes what we call the 'grandma test' — voice the average person would actually talk to without getting frustrated. It's a higher bar than most teams expect.
The latency problem
In text, 3 seconds to first token is fine. In voice, 1 second to first sound is the upper bound before users start feeling the delay. 500ms is ideal. The typical stack has six latency stages: audio capture, speech-to-text, LLM response generation, text-to-speech, network transport, audio playback. Each can steal 100-300ms. Stack them naively and you're at 2-3 seconds before the caller hears anything, which feels broken.
Latency optimization playbook
- Streaming everywhere. STT that streams partial transcripts. LLM that streams tokens. TTS that starts speaking while generation continues. No waiting for completion.
- Aggressive endpointing. Don't wait for a long silence to decide the user is done. Use ML-based endpointing that predicts end-of-turn from tone and phrasing (300-500ms faster than silence-based).
- Speculative TTS. Start synthesizing the first few words of a likely response before generation completes.
- Warm connections. Keep STT, LLM, and TTS connections warm between turns.
- Single-vendor stacks when possible. Inter-vendor handoff adds 50-200ms per step.
Speech recognition is harder than you think
'STT has been solved' is not true for production voice AI. Accents, names, domain jargon, background noise, phone line compression — each degrades accuracy. A system that achieves 95% word accuracy on benchmarks often drops to 80-85% on real customer calls. 15% word error rate breaks every downstream step.
Fixes we deploy: domain-specific vocabulary hints (boost product names, common entities), phone-optimized models (Deepgram Nova-2-phonecall, OpenAI Whisper with phone-tuned variants), post-processing confidence filtering (if confidence is low, the LLM asks clarifying questions instead of proceeding). This bumps real-world accuracy 5-10 points, which is the difference between workable and unusable.
Conversation design is a skill
Voice UX is not chat UX. Rules that matter more in voice:
- Keep responses under 40 seconds. Long responses lose listeners.
- Signal before acting. "Let me check that" before a pause feels natural; silent pauses feel broken.
- Confirm critical actions. "I heard you want to cancel order 12345 — is that right?" before acting on high-impact requests.
- Offer barge-in. Users must be able to interrupt; detect it and respond.
- Use natural turn-taking cues. "Anyway," "so," "okay" as transition markers.
For a deeper look at conversational UX across surfaces, see our conversational UX post.
When to hand off to a human
A voice AI that tries to handle everything ends up frustrating users. The high-quality move is: handle what you handle well, hand off cleanly for the rest. Signals for handoff:
- User explicitly requests a human ("speak to an agent").
- Model confidence drops below threshold for two turns.
- User emotion detection flags frustration or distress.
- Task requires authority the AI doesn't have (refunds above threshold, account changes).
- User has tried to resolve the same issue multiple times.
When handing off, pass full context to the human agent — transcript, user intent, what's been attempted. The worst voice AI handoff is the one where the human has to ask the user to explain everything again.
TTS quality matters more than you think
Cheap TTS sounds like a cheap TTS. Users notice within seconds and trust drops. The difference between a $0.015/minute TTS and a $0.030/minute TTS is usually noticeable and usually worth it. ElevenLabs, OpenAI TTS, and Deepgram Aura are the current top tier. Test with your actual content, not demos — some voices handle product names and numbers better than others.
Latency-quality tradeoff: the best-sounding voices are slowest. Use a streaming-capable TTS with reasonable quality for most turns; reserve premium voices for specific use cases.
Evaluating voice AI
Eval for voice combines several dimensions: task success rate (did the AI accomplish what the user wanted), conversation quality (was the flow natural), technical quality (STT accuracy, latency, TTS naturalness), and emotional quality (did the user feel heard). Automated evals cover task success and technical quality; human review is needed for conversational and emotional quality.
We recommend reviewing 50 random calls per week manually in early deployment, dropping to 20 per week once patterns stabilize. Score each on 1-5 across the four dimensions and feed regressions back into the eval dataset. Voice AI that ships without this cadence regresses silently.
Before shipping any voice AI to production, call it and talk to it like your grandma would — no technical vocabulary, some mumbling, occasional confusion. If it falls apart, production users will too. This is a 30-minute test that saves months of debugging.
Real numbers from production
One voice deployment we shipped — a customer service agent for a mid-market SaaS — handles 8,000 calls/day. Details from our Hearthline case study:
- 82% first-call resolution without human handoff (baseline human agents: 78%).
- 800ms average time-to-first-sound after user finishes speaking.
- 94% STT accuracy on our test set (domain-tuned).
- 67% cost reduction versus pure human agents at comparable CSAT.
- Handoff rate: 18%, with 90% of handoffs happening cleanly (user and agent both informed, context passed).
Infrastructure choices
Vapi, Bland, and Retell all provide managed voice infrastructure that handles the STT → LLM → TTS orchestration. They're the fastest way to prototype. For production scale, many of our clients graduate to self-managed stacks using LiveKit Agents or similar, with more control over latency budgets and per-call costs.
Tradeoff: managed platforms cost 2-3x per minute at scale but save 2-3 months of infra engineering. For pilots and sub-million-minute deployments, managed wins. Past that, self-managed pays off.
Compliance and recording
Voice AI often touches regulated data: healthcare (HIPAA), finance (PCI, SOX), EU data (GDPR). Get clear on: consent to record (announced at call start in most jurisdictions), PII handling (redaction in transcripts), retention policies, and data residency (where recordings live). Our security page covers how we approach this for regulated deployments.
A voice AI system is a chat system with an SLA. The SLA is: feel like a person, or the caller hangs up.
Closing
Voice AI rewards obsessive attention to the parts chat AI gets away with neglecting: latency, recognition accuracy, conversation flow, and handoff quality. Teams that treat voice as 'chat with TTS wrapped around it' consistently ship systems users hate. Teams that treat voice as its own surface — with its own evaluation, its own UX rules, its own operational cadence — ship systems users actually use. Start with the grandma test.