Real-Time Voice AI Pipelines: STT, LLM, and TTS

Mar 25, 2026 · 8 min read

The most interesting AI systems in 2026 aren’t chatbots. They’re voice agents — AI that listens, thinks, and speaks in real-time within live communication flows. Building one that actually works in production requires solving problems at the intersection of WebRTC, speech processing, and language models.

This is the architecture guide I wish existed when I started building these systems.

The Reference Architecture

A production real-time AI voice pipeline has five layers:

┌─────────────────────────────────────────────────────┐
│                    User (Browser/App)                │
│  Microphone → WebRTC → [Audio Stream Out]           │
│  Speaker   ← WebRTC ← [Audio Stream In]            │
└─────────────────┬──────────────┬────────────────────┘
                  │              ▲
                  ▼              │
┌─────────────────────────────────────────────────────┐
│                  Media Server (SFU)                  │
│  Extracts audio track → forwards to AI pipeline     │
│  Receives AI audio   → injects into session         │
└─────────────────┬──────────────┬────────────────────┘
                  │              ▲
                  ▼              │
┌─────────────────────────────────────────────────────┐
│               AI Voice Pipeline                      │
│                                                      │
│  Audio In → [VAD] → [STT] → [LLM] → [TTS] → Audio │
│              20ms    stream   stream   stream  Out   │
└─────────────────────────────────────────────────────┘

Each component in the pipeline must stream. Batch processing at any stage adds hundreds of milliseconds that compound into an unusable experience.

Layer 1: Audio Extraction from WebRTC

The AI pipeline needs raw audio from the WebRTC session. There are three approaches:

Server-Side Audio Tap

The SFU/media server extracts the audio track and forwards it to the AI pipeline over a local connection (WebSocket, gRPC, or shared memory).

Pros: Lowest latency (~5ms). No additional network hop. Access to all participants’ audio. Cons: Tight coupling between media server and AI pipeline. Requires SFU customization.

Client-Side Audio Fork

The browser captures audio from getUserMedia() and sends it to both the WebRTC peer connection and a separate WebSocket to the AI service.

Pros: Works with any SFU. Simple to implement. Cons: Double bandwidth from the client. Additional connection to manage. Mobile battery drain.

Recording Bot Participant

A headless browser or bot joins the call as a participant, receives all audio via WebRTC, and feeds it to the AI pipeline.

Pros: No SFU modification needed. Works with any WebRTC service. Cons: Highest latency (full WebRTC round-trip). Counts as a participant. Scaling requires one bot per session.

Recommendation: Server-side audio tap for production systems. Client-side fork for prototyping. Bot participant for third-party integrations where you can’t modify the SFU.

Layer 2: Voice Activity Detection (VAD)

Before sending audio to STT, you need to know when someone is actually speaking. This sounds trivial. It isn’t.

Why VAD is critical

Cost: STT services charge per second of audio processed. Sending silence and background noise wastes money.
Quality: Background noise degrades STT accuracy. Sending only voice segments improves word error rate by 15-30%.
Endpointing: VAD determines when the user has finished speaking, which triggers the LLM.

The Endpointing Dilemma

Simple energy-based VAD (is the audio amplitude above a threshold?) works for basic cases but fails in noisy environments. Neural VAD models (Silero VAD, WebRTC’s built-in VAD) are more robust but add 10-20ms of processing latency.

The real challenge is endpointing — deciding when the user has finished their thought. Too aggressive (short silence threshold) and you cut the user off mid-sentence. Too conservative (long silence threshold) and you add hundreds of milliseconds of unnecessary delay.

Production settings that work:

Speech onset threshold: 250ms of voice activity to trigger “speaking” state
Silence threshold for endpointing: 400-600ms (tuned per use case)
Minimum speech duration: 500ms (filters out coughs, “um”, clicks)
Use semantic endpointing from partial STT results when possible

Layer 3: Speech-to-Text (STT)

Choosing a Provider

Provider	Streaming	Latency (P50)	WER (English)	Best For
Deepgram Nova-2	Yes	100-200ms	8-10%	Lowest latency
Google Chirp	Yes	200-400ms	7-9%	Multi-language
AssemblyAI Universal-2	Yes	150-300ms	6-8%	Best accuracy
Whisper (self-hosted)	No*	500ms-2s	5-7%	Privacy/cost at scale
Azure Speech	Yes	200-400ms	8-10%	Enterprise/compliance

*Whisper can be made streaming with chunked processing, but it’s not natively streaming.

The Streaming STT Pattern

Audio chunks (20ms each) ──> STT Service
                              │
                         Partial results (every 100-300ms):
                              │
                         "I want to"
                         "I want to book"
                         "I want to book a flight"
                         "I want to book a flight to" ← partial
                         "I want to book a flight to San Francisco" ← final

Key insight: Start LLM processing on the partial result “I want to book a flight to” before the final result arrives. If the final result matches, you’ve saved 200-400ms. If it differs (rare for the beginning of an utterance), discard and restart.

Layer 4: LLM Inference

Model Selection for Voice AI

Voice AI has different requirements than chat:

Speed over depth. A response in 200ms matters more than a response that considers 10 more edge cases.
Short responses. Voice responses should be 1-3 sentences, not paragraphs. Configure max_tokens accordingly (50-150 tokens).
Conversational tone. System prompts should instruct the model to speak naturally, not write formally.

Streaming Token Generation

Use streaming inference (SSE or WebSocket). The first 5-10 tokens typically arrive within 100-200ms. That’s usually a complete first sentence like “Sure, let me check that for you” — enough to start TTS while the rest generates.

Function Calling for Actions

Voice agents often need to perform actions (check a database, book an appointment, transfer a call). Use the LLM’s function calling/tool use capability, but design for speed:

Pre-load frequently needed data in the system prompt context
Use async function execution — start the function call, continue generating a “holding” response (“Let me look that up…”), and inject the function result when ready
Set strict timeouts on function calls (2 seconds max). If the backend is slow, the voice agent should say so rather than going silent

Layer 5: Text-to-Speech (TTS)

The Streaming TTS Pattern

LLM tokens arrive:  "Sure," → "I" → "can" → "help" → "with" → "that."
                      │
                      ▼
TTS processes first sentence as soon as period/comma detected:
                      │
                      ▼
Audio chunks stream to WebRTC session:
[audio: "Sure, I can help with that."]  ← plays while LLM still generating

Sentence Boundary Detection

Don’t send individual tokens to TTS. Buffer until you have a sentence or clause boundary (period, comma, question mark, or a natural pause point). Sending fragments produces choppy, unnatural speech.

Buffer strategy:

Accumulate tokens until a sentence-ending punctuation is detected
If 15+ tokens accumulate without punctuation, flush at the next comma or space (fallback)
Never buffer more than 30 tokens — latency matters more than perfect prosody

Voice Selection

For production voice AI, use neural/cloned voices (ElevenLabs, PlayHT, Cartesia), not concatenative synthesis. The quality difference is immediately apparent to users and affects trust.

Latency consideration: Neural TTS models have higher first-byte latency (100-200ms) compared to older systems (20-50ms). The quality improvement is worth the trade-off for most applications.

Putting It Together: The Event Flow

A complete voice AI interaction, end to end:

T+0ms      User stops speaking
T+50ms     VAD detects silence, triggers endpoint
T+80ms     Final STT transcript received
T+100ms    LLM prompt constructed and sent
T+250ms    First LLM tokens arrive ("Sure, I can help...")
T+280ms    First sentence buffered, sent to TTS
T+380ms    First TTS audio chunk generated
T+400ms    Audio injected into WebRTC session
T+420ms    User hears first word of AI response
           ↓
           (rest streams in over next 1-3 seconds)

Total time-to-first-word: ~420ms. Below the 500ms threshold for natural conversation.

Error Handling in Real-Time

You can’t show an error dialog in a voice conversation. Every failure mode needs a graceful audio response:

STT fails/times out: “I’m sorry, I didn’t catch that. Could you repeat?”
LLM fails/times out: “Let me think about that for a moment.” (then retry)
TTS fails: Fall back to a simpler TTS engine or pre-recorded audio for common phrases
Complete pipeline failure: “I’m having some technical difficulties. Let me connect you to a human agent.”

Pre-synthesize these fallback responses at startup. When things break, the fallback audio is already in memory and can play instantly.

Scaling Considerations

Session Isolation

Each voice AI session needs its own pipeline instance with its own conversation state. Don’t share LLM context across sessions. Use session IDs to route audio to the correct pipeline instance.

Horizontal Scaling

STT and TTS are stateless — scale horizontally behind a load balancer
LLM inference benefits from GPU sharing — use batched inference or a model serving platform (vLLM, TensorRT-LLM)
The media server (SFU) and AI pipeline can scale independently

Cost at Scale

At 1000 concurrent voice AI sessions:

STT: ~$0.006/min x 1000 = $360/hour
LLM: ~$0.002/request x 10 requests/min x 1000 = $1,200/hour
TTS: ~$0.015/1K chars x ~200 chars/response x 10/min x 1000 = $1,800/hour

Total: ~$3,360/hour or ~$2.4M/year at full utilization. The TTS and LLM components dominate. This is why self-hosting models at scale often makes economic sense despite the infrastructure complexity.

What I’d Build Today

If I were starting a voice AI product from scratch:

Deepgram for STT (fastest streaming, good accuracy)
Claude or GPT-4o-mini for LLM (fast, good at conversation)
Cartesia or ElevenLabs for TTS (low latency, natural voice)
LiveKit or custom SFU for media transport
Everything streaming, everything at the edge

The companies winning in voice AI aren’t building better models. They’re building better pipelines.

Related reading: Voice AI Latency covers the psychoacoustic thresholds and why sub-200ms matters. For the WebRTC transport layer, see WebRTC: Revolutionizing Real-Time Communication. For deploying these pipelines, see Cloud Architecture for Real-Time Media.

Yash Chudasama

Real-Time Communication · AI · Cloud — About me

I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.

Get in touch →