Voice AI

Voice AI in Production

Voice interfaces have crossed the threshold from novelty to production-grade. Here's what it takes to build one that doesn't break on real calls.

The latency budget

In a voice conversation, sub-500ms response time is the threshold for natural interaction. Above 800ms, the caller notices. Above 1200ms, they talk over the system. Your latency budget gets consumed by: speech-to-text (STT), LLM inference, text-to-speech (TTS), and network round trips. You need to measure and optimize each layer independently. Streaming STT and TTS — where you start processing before the utterance is complete — can cut perceived latency in half.

STT and TTS pipeline design

Deepgram, Whisper, and Azure dominate STT. ElevenLabs, Cartesia, and Play.ht lead TTS. The key architectural decision is whether to run these as separate services or as an integrated pipeline. Separate services give you flexibility to swap providers. Integrated pipelines (like LiveKit + Deepgram + ElevenLabs) reduce latency at the cost of vendor lock-in. For production, run STT and TTS as separate, provider-abstracted services with a lightweight orchestration layer between them.

Conversation state machines

A voice agent is not a chatbot with a microphone. Conversations have turn-taking, interruptions, pauses, and ambiguous endings. You need a state machine that tracks: is the user speaking? Is the agent speaking? Did the user interrupt? Is this a pause or the end of a turn? The state machine drives the agent's behavior — when to listen, when to respond, when to clarify. LLMs are bad at turn-taking. Don't make the model manage the conversation flow. Keep that logic deterministic.

Handling interruptions and barge-in

Barge-in — when the user starts speaking while the agent is mid-response — is the hardest problem in voice AI. You need voice activity detection (VAD) that can distinguish between a real interruption and background noise. When barge-in is detected, you need to stop TTS immediately, process the new utterance, and respond. This requires tight integration between your audio pipeline and your conversation state machine. Most production failures in voice AI are barge-in edge cases.

Provider selection and fallback

No single STT or TTS provider works for every use case. Deepgram is fast and cheap but struggles with heavy accents. Whisper is more accurate but slower. ElevenLabs sounds natural but costs more. Your pipeline should support provider fallback: if the primary STT provider times out, fall back to a secondary. If TTS quality degrades, switch providers mid-call. This is infrastructure work — boring but essential.

Observability for voice

Voice AI needs different metrics than text-based AI. Track: time-to-first-token (TTFT), STT word error rate (WER), barge-in accuracy, call completion rate, and per-call cost. Record calls (with consent) for debugging. The ability to replay a call — hearing exactly what the user heard and seeing the agent's state at each turn — is the difference between fixing a bug in an hour and chasing it for a week.

Building a voice agent?

We've shipped voice AI that handles real calls. Let's scope yours.

See our Voice AI development service →Book a call