Real-Time Voice AI Pipelines: STT, LLM, and TTS
The most interesting AI systems in 2026 aren’t chatbots. They’re voice agents — AI that listens, thinks, and speaks in real-time within live communication flows. Building one that actually works in production requires solving problems at the intersection of WebRTC, speech processing, and language models.
This is the architecture guide I wish existed when I started building these systems.
The Reference Architecture
A production real-time AI voice pipeline has five layers:
┌─────────────────────────────────────────────────────┐
│ User (Browser/App) │
│ Microphone → WebRTC → [Audio Stream Out] │
│ Speaker ← WebRTC ← [Audio Stream In] │
└─────────────────┬──────────────┬────────────────────┘
│ ▲
▼ │
┌─────────────────────────────────────────────────────┐
│ Media Server (SFU) │
│ Extracts audio track → forwards to AI pipeline │
│ Receives AI audio → injects into session │
└─────────────────┬──────────────┬────────────────────┘
│ ▲
▼ │
┌─────────────────────────────────────────────────────┐
│ AI Voice Pipeline │
│ │
│ Audio In → [VAD] → [STT] → [LLM] → [TTS] → Audio │
│ 20ms stream stream stream Out │
└─────────────────────────────────────────────────────┘
Each component in the pipeline must stream. Batch processing at any stage adds hundreds of milliseconds that compound into an unusable experience.
Layer 1: Audio Extraction from WebRTC
The AI pipeline needs raw audio from the WebRTC session. There are three approaches:
Server-Side Audio Tap
The SFU/media server extracts the audio track and forwards it to the AI pipeline over a local connection (WebSocket, gRPC, or shared memory).
Pros: Lowest latency (~5ms). No additional network hop. Access to all participants’ audio. Cons: Tight coupling between media server and AI pipeline. Requires SFU customization.
Client-Side Audio Fork
The browser captures audio from getUserMedia() and sends it to both the WebRTC peer connection and a separate WebSocket to the AI service.
Pros: Works with any SFU. Simple to implement. Cons: Double bandwidth from the client. Additional connection to manage. Mobile battery drain.
Recording Bot Participant
A headless browser or bot joins the call as a participant, receives all audio via WebRTC, and feeds it to the AI pipeline.
Pros: No SFU modification needed. Works with any WebRTC service. Cons: Highest latency (full WebRTC round-trip). Counts as a participant. Scaling requires one bot per session.
Recommendation: Server-side audio tap for production systems. Client-side fork for prototyping. Bot participant for third-party integrations where you can’t modify the SFU.
Layer 2: Voice Activity Detection (VAD)
Before sending audio to STT, you need to know when someone is actually speaking. This sounds trivial. It isn’t.
Why VAD is critical
- Cost: STT services charge per second of audio processed. Sending silence and background noise wastes money.
- Quality: Background noise degrades STT accuracy. Sending only voice segments improves word error rate by 15-30%.
- Endpointing: VAD determines when the user has finished speaking, which triggers the LLM.
The Endpointing Dilemma
Simple energy-based VAD (is the audio amplitude above a threshold?) works for basic cases but fails in noisy environments. Neural VAD models (Silero VAD, WebRTC’s built-in VAD) are more robust but add 10-20ms of processing latency.
The real challenge is endpointing — deciding when the user has finished their thought. Too aggressive (short silence threshold) and you cut the user off mid-sentence. Too conservative (long silence threshold) and you add hundreds of milliseconds of unnecessary delay.
Production settings that work:
- Speech onset threshold: 250ms of voice activity to trigger “speaking” state
- Silence threshold for endpointing: 400-600ms (tuned per use case)
- Minimum speech duration: 500ms (filters out coughs, “um”, clicks)
- Use semantic endpointing from partial STT results when possible
Layer 3: Speech-to-Text (STT)
Choosing a Provider
| Provider | Streaming | Latency (P50) | WER (English) | Best For |
|---|---|---|---|---|
| Deepgram Nova-2 | Yes | 100-200ms | 8-10% | Lowest latency |
| Google Chirp | Yes | 200-400ms | 7-9% | Multi-language |
| AssemblyAI Universal-2 | Yes | 150-300ms | 6-8% | Best accuracy |
| Whisper (self-hosted) | No* | 500ms-2s | 5-7% | Privacy/cost at scale |
| Azure Speech | Yes | 200-400ms | 8-10% | Enterprise/compliance |
*Whisper can be made streaming with chunked processing, but it’s not natively streaming.
The Streaming STT Pattern
Audio chunks (20ms each) ──> STT Service
│
Partial results (every 100-300ms):
│
"I want to"
"I want to book"
"I want to book a flight"
"I want to book a flight to" ← partial
"I want to book a flight to San Francisco" ← final
Key insight: Start LLM processing on the partial result “I want to book a flight to” before the final result arrives. If the final result matches, you’ve saved 200-400ms. If it differs (rare for the beginning of an utterance), discard and restart.
Layer 4: LLM Inference
Model Selection for Voice AI
Voice AI has different requirements than chat:
- Speed over depth. A response in 200ms matters more than a response that considers 10 more edge cases.
- Short responses. Voice responses should be 1-3 sentences, not paragraphs. Configure
max_tokensaccordingly (50-150 tokens). - Conversational tone. System prompts should instruct the model to speak naturally, not write formally.
Streaming Token Generation
Use streaming inference (SSE or WebSocket). The first 5-10 tokens typically arrive within 100-200ms. That’s usually a complete first sentence like “Sure, let me check that for you” — enough to start TTS while the rest generates.
Function Calling for Actions
Voice agents often need to perform actions (check a database, book an appointment, transfer a call). Use the LLM’s function calling/tool use capability, but design for speed:
- Pre-load frequently needed data in the system prompt context
- Use async function execution — start the function call, continue generating a “holding” response (“Let me look that up…”), and inject the function result when ready
- Set strict timeouts on function calls (2 seconds max). If the backend is slow, the voice agent should say so rather than going silent
Layer 5: Text-to-Speech (TTS)
The Streaming TTS Pattern
LLM tokens arrive: "Sure," → "I" → "can" → "help" → "with" → "that."
│
▼
TTS processes first sentence as soon as period/comma detected:
│
▼
Audio chunks stream to WebRTC session:
[audio: "Sure, I can help with that."] ← plays while LLM still generating
Sentence Boundary Detection
Don’t send individual tokens to TTS. Buffer until you have a sentence or clause boundary (period, comma, question mark, or a natural pause point). Sending fragments produces choppy, unnatural speech.
Buffer strategy:
- Accumulate tokens until a sentence-ending punctuation is detected
- If 15+ tokens accumulate without punctuation, flush at the next comma or space (fallback)
- Never buffer more than 30 tokens — latency matters more than perfect prosody
Voice Selection
For production voice AI, use neural/cloned voices (ElevenLabs, PlayHT, Cartesia), not concatenative synthesis. The quality difference is immediately apparent to users and affects trust.
Latency consideration: Neural TTS models have higher first-byte latency (100-200ms) compared to older systems (20-50ms). The quality improvement is worth the trade-off for most applications.
Putting It Together: The Event Flow
A complete voice AI interaction, end to end:
T+0ms User stops speaking
T+50ms VAD detects silence, triggers endpoint
T+80ms Final STT transcript received
T+100ms LLM prompt constructed and sent
T+250ms First LLM tokens arrive ("Sure, I can help...")
T+280ms First sentence buffered, sent to TTS
T+380ms First TTS audio chunk generated
T+400ms Audio injected into WebRTC session
T+420ms User hears first word of AI response
↓
(rest streams in over next 1-3 seconds)
Total time-to-first-word: ~420ms. Below the 500ms threshold for natural conversation.
Error Handling in Real-Time
You can’t show an error dialog in a voice conversation. Every failure mode needs a graceful audio response:
- STT fails/times out: “I’m sorry, I didn’t catch that. Could you repeat?”
- LLM fails/times out: “Let me think about that for a moment.” (then retry)
- TTS fails: Fall back to a simpler TTS engine or pre-recorded audio for common phrases
- Complete pipeline failure: “I’m having some technical difficulties. Let me connect you to a human agent.”
Pre-synthesize these fallback responses at startup. When things break, the fallback audio is already in memory and can play instantly.
Scaling Considerations
Session Isolation
Each voice AI session needs its own pipeline instance with its own conversation state. Don’t share LLM context across sessions. Use session IDs to route audio to the correct pipeline instance.
Horizontal Scaling
- STT and TTS are stateless — scale horizontally behind a load balancer
- LLM inference benefits from GPU sharing — use batched inference or a model serving platform (vLLM, TensorRT-LLM)
- The media server (SFU) and AI pipeline can scale independently
Cost at Scale
At 1000 concurrent voice AI sessions:
- STT: ~$0.006/min x 1000 = $360/hour
- LLM: ~$0.002/request x 10 requests/min x 1000 = $1,200/hour
- TTS: ~$0.015/1K chars x ~200 chars/response x 10/min x 1000 = $1,800/hour
Total: ~$3,360/hour or ~$2.4M/year at full utilization. The TTS and LLM components dominate. This is why self-hosting models at scale often makes economic sense despite the infrastructure complexity.
What I’d Build Today
If I were starting a voice AI product from scratch:
- Deepgram for STT (fastest streaming, good accuracy)
- Claude or GPT-4o-mini for LLM (fast, good at conversation)
- Cartesia or ElevenLabs for TTS (low latency, natural voice)
- LiveKit or custom SFU for media transport
- Everything streaming, everything at the edge
The companies winning in voice AI aren’t building better models. They’re building better pipelines.
Related reading: Voice AI Latency covers the psychoacoustic thresholds and why sub-200ms matters. For the WebRTC transport layer, see WebRTC: Revolutionizing Real-Time Communication. For deploying these pipelines, see Cloud Architecture for Real-Time Media.
I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.
Get in touch →