Voice AI Latency: Why Sub-200ms Response Time Matters

Nov 18, 2025 · 7 min read

Voice AI has a problem that text-based AI doesn’t: time pressure. When a human asks a voice agent a question, they expect a response in the same cadence as a human conversation. Not in 2 seconds. Not in 1 second. In the time it takes a human to inhale before responding — roughly 200-500ms.

Miss that window and the user perceives the system as broken, stupid, or both. Hit it consistently and the interaction feels magical.

Having built voice AI pipelines that operate within WebRTC infrastructure, I want to break down exactly where latency hides, why it compounds, and how to engineer it away.

The Psychoacoustic Threshold

Conversational turn-taking follows precise timing patterns that humans internalize from birth:

Delay	Human Perception
0-200ms	Natural. Feels like a quick human response.
200-500ms	Noticeable pause. Tolerable for complex questions.
500-1000ms	Awkward. User starts to wonder if the system heard them.
1000-2000ms	Broken. User repeats themselves or gives up.
2000ms+	Abandoned. User hangs up or switches to text.

Research from telecoms consistently shows: conversations degrade sharply above 400ms total round-trip. For voice AI, you need to fit your entire pipeline — speech recognition, thinking, speech synthesis — inside that budget.

The Pipeline: Where Latency Lives

A voice AI interaction has four phases:

User speaks → [STT] → [LLM] → [TTS] → User hears response
              ~100ms   ~200ms   ~150ms
              ├─────────────────────────┤
              Total: ~450ms (too slow)

That’s a simplified view. The real pipeline is worse:

User speaks
  → Audio capture (10-30ms buffer)
  → Network transit to server (20-80ms)
  → [STT] Speech-to-Text (100-500ms)
  → [VAD] Voice Activity Detection / Endpointing (200-800ms!)
  → [LLM] Language Model inference (200-2000ms)
  → [TTS] Text-to-Speech synthesis (100-500ms)
  → Network transit back (20-80ms)
  → Audio playback buffer (10-30ms)
  → User hears response

Real-world total: 700ms - 4000ms

The two biggest culprits are endpointing and LLM inference. Let’s break each one down.

The Endpointing Problem

Endpointing is the hardest latency problem in voice AI, and most engineers don’t even think about it.

The question: How do you know the user has finished speaking?

The naive approach: wait for silence. If there’s 500ms of silence after speech, assume the user is done. Problem: that 500ms is pure added latency on every single interaction. Plus, humans pause mid-sentence — “I want to book a flight to… San Francisco” has a natural 300ms pause that isn’t the end of the utterance.

The better approach: predictive endpointing. Use a model that predicts whether the user is done speaking based on:

Prosodic features (falling intonation signals end of utterance)
Semantic completeness (is the sentence grammatically complete?)
Timing patterns (how long has the user been speaking?)

Modern endpointing models can reduce the silence threshold to 150-200ms with high accuracy. That’s 300-600ms saved per turn.

The best approach: speculative execution. Start STT processing while the user is still speaking. Start LLM inference on partial transcripts. If the final transcript changes, discard and restart. You waste some compute, but you save the user’s time.

Streaming: The Key Architecture Decision

The difference between a 2-second and a 200ms voice AI response is streaming at every layer.

Streaming STT

Batch STT: Wait for the user to finish → send complete audio → receive complete transcript. Adds the full STT processing time to latency.

Streaming STT: Send audio chunks every 20ms → receive partial transcripts in real-time → get the final transcript almost immediately after the user stops speaking.

Latency impact: Streaming STT reduces effective STT latency from 500ms+ to near-zero because processing happens while the user is speaking.

Providers that do this well: Deepgram (fastest), Google Cloud Speech (streaming mode), AssemblyAI (real-time).

Streaming LLM

Batch LLM: Wait for complete prompt → generate complete response → return it. For GPT-4 class models, this is 1-3 seconds.

Streaming LLM: Send prompt → receive tokens as they’re generated → start TTS on the first sentence while the rest is still generating.

The nuance: You don’t need the complete LLM response to start speaking. You need the first sentence. Most LLMs emit the first 10-20 tokens in 100-200ms. That’s usually enough for “Sure, I can help with that” or “Your flight options are…”

Streaming TTS

Batch TTS: Wait for complete text → synthesize complete audio → send it. Adds 300-1000ms.

Streaming TTS: Receive text token-by-token → synthesize audio incrementally → start playing the first audio chunk while the rest is still being generated.

The magic number: With streaming TTS, the user hears the first syllable of the response 50-100ms after the first tokens arrive from the LLM. The rest streams in as it’s generated.

The Streaming Architecture

When you chain streaming STT → streaming LLM → streaming TTS, the math changes dramatically:

User stops speaking
  → Final transcript available: ~50ms (was processing in parallel)
  → First LLM tokens: ~150ms
  → First TTS audio chunk: ~200ms after user stops speaking
  → User hears first word of response: ~250ms total

The response starts playing while the LLM is still generating and the TTS is still synthesizing. The user perceives a ~250ms response time, even though the complete response takes 2-3 seconds to fully generate.

This is the fundamental insight: perceived latency is time-to-first-byte, not time-to-complete-response.

WebRTC Integration: The Last Mile

If your voice AI pipeline connects to users via WebRTC, you have additional latency considerations:

Jitter Buffer

WebRTC uses a jitter buffer (typically 20-60ms) to smooth out network jitter. For voice AI, reduce this to the minimum your network conditions allow. Every millisecond in the jitter buffer is a millisecond of latency.

Codec Selection

Opus at 20ms frame size is the sweet spot. Lower frame sizes (10ms) reduce latency by 10ms but increase CPU usage. Higher frame sizes (40ms, 60ms) save bandwidth but add unacceptable latency for conversational AI.

Direct Media Path

Route audio directly from the user’s browser to the STT service. Don’t relay through a media server if it isn’t necessary. Each hop adds 10-30ms.

Echo Cancellation

If the voice AI response plays through the user’s speakers, the microphone picks it up. The echo canceller needs time to adapt. Design for headphones/earbuds when possible, or implement server-side echo cancellation to avoid the client-side processing delay.

Infrastructure Choices That Matter

Edge Deployment

Deploy STT and TTS models at the edge, close to your users. A user in Mumbai connecting to an STT service in Virginia adds 150ms of network latency each way. Deploy in ap-south-1 and that drops to 10-20ms.

GPU Selection

For self-hosted models: A100s are fast but expensive. T4s are cheap but slow for TTS. The sweet spot for voice AI inference is often L4 or A10G — good inference speed at reasonable cost.

Model Size vs Latency

Smaller models respond faster. For voice AI, a fine-tuned 7B parameter model that responds in 100ms is almost always better than a 70B model that responds in 800ms. The user doesn’t care about marginal quality improvements if the conversation feels laggy.

Measuring What Matters

The metrics that define voice AI quality:

Time to First Byte (TTFB): Time from user finishing speaking to first audio byte of response. Target: <300ms.
Interruption handling: Can the user interrupt the AI mid-response? How quickly does the system stop speaking and start listening? Target: <100ms.
End-to-end latency (P99): 99th percentile total round-trip. Target: <500ms for the first word of response.
Word Error Rate (WER): STT accuracy. Matters because misrecognition causes wasted turns, which is the worst latency of all — having to repeat yourself.
Turn completion rate: What percentage of turns complete successfully without the user abandoning or repeating? Target: >95%.

The Bottom Line

Voice AI latency isn’t a single number to optimize — it’s a pipeline of numbers that compound. The engineering challenge is:

Stream everything. Never wait for a complete result when a partial result is sufficient.
Speculate aggressively. Start processing before you’re sure the user is done speaking.
Deploy at the edge. Network latency is the one thing you can’t optimize away with better algorithms.
Measure perceived latency, not pipeline latency. Time-to-first-audio-byte is what the user experiences.

The teams building the best voice AI products in 2026 aren’t the ones with the best models. They’re the ones with the best latency engineering.

Related reading: Building Real-Time AI Pipelines goes deeper on the full STT/LLM/TTS architecture. For the WebRTC transport layer, see WebRTC: Revolutionizing Real-Time Communication. For edge deployment, see Edge Computing.

Yash Chudasama

Real-Time Communication · AI · Cloud — About me

I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.

Get in touch →