Cloud Architecture for Real-Time Media Workloads

Apr 15, 2026 · 7 min read

Cloud providers are optimized for request-response workloads. Real-time media — WebRTC, voice AI, live streaming — breaks their assumptions in fundamental ways.

Standard cloud architecture patterns (load balancers, auto-scaling groups, stateless microservices) don’t work for media. Here’s what does, and why the default advice from cloud providers will cost you latency, money, or both.

Why Media Workloads Are Different

A typical web request hits a load balancer, gets routed to any available instance, processes for 50-200ms, and returns. The instance holds no state. If it dies, the next request goes elsewhere.

A media session is the opposite:

Long-lived: A video call lasts 15-60 minutes, not 200ms
Stateful: The SFU holds media routing state, DTLS sessions, Simulcast layer subscriptions
Latency-critical: 50ms extra latency is perceptible; 200ms is unacceptable
Bandwidth-heavy: A single 720p stream is 1.5 Mbps sustained, not a 2KB JSON payload
UDP-based: WebRTC uses UDP, which load balancers handle differently than TCP

Instance Selection: The Non-Obvious Choices

AWS

What you’d expect: c5.xlarge (compute-optimized, 4 vCPU, 8GB RAM). Reasonable for CPU-bound workloads.

What actually works for media:

c5n.xlarge — The n variant has enhanced networking with up to 25 Gbps bandwidth and lower network jitter. For media servers that forward hundreds of streams, the network interface is the bottleneck, not the CPU. The c5n costs 10-15% more but handles 40-60% more concurrent streams.

c6in.xlarge — Latest generation with even better network performance. Up to 50 Gbps. Worth the premium for high-density SFU deployments.

Avoid: t3 instances (burstable). Media is sustained, not bursty. You’ll burn through CPU credits in minutes and get throttled mid-call.

GCP

What works: c2-standard-4 (compute-optimized) or c2d-standard-4. GCP’s network is naturally lower-latency between regions due to their private backbone, which matters for cascaded SFU deployments.

GCP advantage: Sole-tenant nodes let you dedicate physical hosts, eliminating noisy neighbor effects on network throughput. Worth it for predictable media quality at scale.

Azure

What works: Fsv2 series (compute-optimized). Enable Accelerated Networking — it’s not on by default and adds 2-5ms of latency without it.

Azure advantage: Azure Communication Services is a native WebRTC offering. If your architecture fits their model, it eliminates SFU infrastructure management entirely.

The Load Balancer Problem

Standard cloud load balancers (ALB, NLB, Cloud Load Balancing) are designed for HTTP/TCP. WebRTC media uses UDP. This creates three problems:

Problem 1: No UDP load balancing (ALB)

Application Load Balancers don’t support UDP at all. You cannot put an ALB in front of your SFU fleet.

Problem 2: NLB UDP support is limited

Network Load Balancers support UDP but do simple 5-tuple hashing. They don’t understand WebRTC session affinity. A client that reconnects (ICE restart, network change) might get routed to a different SFU instance, losing its session.

Problem 3: Signaling vs Media split

WebRTC uses two separate connections: signaling (WebSocket over TCP) and media (UDP). The signaling server and media server might be different instances. The load balancer can route signaling, but the client connects to the media server directly.

The Solution: DNS-Based or Application-Level Routing

Skip the load balancer for media. Instead:

Signaling goes through a load balancer (ALB/NLB, standard)
The signaling server assigns a specific media server based on geography, load, and capacity
The client connects directly to the media server’s IP via ICE candidates
Health checks and routing logic live in your signaling layer, not in the cloud load balancer

This is how every production WebRTC platform works. The cloud load balancer is only for signaling — never for media.

Auto-Scaling: The Trap

Standard auto-scaling (add instances when CPU > 70%, remove when < 30%) will destroy your users’ calls.

The Problem: Scale-Down Kills Sessions

When an auto-scaler removes an instance, every active call on that instance drops. Unlike HTTP requests that take 200ms (just retry), a dropped call is a permanently ruined user experience.

The Solution: Drain-Based Scaling

Scale up normally based on capacity metrics (concurrent sessions per instance, not CPU)
Never terminate instances with active sessions
When scaling down, mark instances as draining — stop routing new sessions to them, but let existing sessions complete naturally
Only terminate an instance when its active session count reaches 0

Instance Lifecycle:
  ACTIVE → (scale-down signal) → DRAINING → (0 sessions) → TERMINATED
           new sessions = yes     new sessions = no

The metric to scale on: Not CPU. Not memory. Concurrent sessions per instance relative to maximum capacity. SFU capacity is bounded by network throughput and CPU, and the ratio depends on your codec/resolution settings.

Multi-Region: Where It Gets Expensive

For global real-time communication, you need media servers in every region where your users are. A user in Tokyo connecting to an SFU in Virginia adds ~150ms of network latency — enough to make conversations uncomfortable.

Region Strategy

Tier	Regions	Coverage
Minimum Viable	us-east-1, eu-west-1, ap-southeast-1	US, Europe, SEA
Production	+ us-west-2, ap-south-1, ap-northeast-1	+ West Coast, India, Japan
Global	+ sa-east-1, me-south-1, af-south-1	+ South America, Middle East, Africa

Cascaded SFU Architecture

When participants span multiple regions, you cascade SFUs:

User A (Tokyo) ──> SFU-Tokyo ──cascade──> SFU-Virginia <── User B (NYC)
                      │                       │
                  Local latency: 10ms     Local latency: 10ms
                  Cascade latency: ~120ms

Each participant connects to their nearest SFU. SFUs exchange streams between each other. The cascade adds latency for cross-region participants but keeps intra-region latency low.

The cost trap: Multi-region media means cross-region data transfer. AWS charges $0.02/GB for cross-region traffic. A 1-hour 720p cascade between two regions costs ~$0.02 per participant per direction. At scale, this is significant.

GCP advantage: GCP doesn’t charge for ingress and has lower cross-region transfer costs on their premium tier. For global media platforms, GCP is often 20-30% cheaper than AWS for network costs.

Kubernetes for Media: Proceed with Caution

Kubernetes is excellent for stateless microservices. It’s hostile to stateful, long-lived, UDP-heavy workloads like media servers.

Problems

kube-proxy and iptables add latency to UDP. Use hostNetwork: true or DPDK to bypass.
Pod eviction during node scaling kills active calls. Same drain problem as auto-scaling.
Service mesh (Istio, Linkerd) doesn’t handle UDP well. Exclude media pods from the mesh.
Resource limits interact badly with real-time workloads. A CPU limit that triggers throttling during a packet burst causes audio glitches.

When Kubernetes Works for Media

Your signaling, API, and orchestration layers (stateless, TCP) — absolutely use K8s
Media server pods with hostNetwork: true, no resource limits, and custom drain logic — works but requires careful tuning
GPU-based workloads (MCU transcoding, AI inference) — K8s GPU scheduling is mature

When to Skip Kubernetes

Raw SFU instances that need maximum network performance — use bare EC2/GCE with custom orchestration
TURN servers — these are simple, long-lived, and benefit from dedicated instances with predictable networking

The Multi-Cloud Reality

Most production media platforms end up multi-cloud, not by choice but by necessity:

AWS for core infrastructure (broadest region coverage)
GCP for cross-region media relay (cheapest network)
Cloudflare/Fastly for TURN (edge presence)
Specialized providers (Vultr, Hetzner) for regions where hyperscalers are expensive or absent

The architecture that handles this: cloud-agnostic SFU binaries that run the same on any Linux VM, with a central orchestrator that provisions instances across providers based on demand, cost, and latency.

What I’d Build Today

SFU: Deploy on c6in.xlarge (AWS) or c2d-standard-4 (GCP) with hostNetwork if on K8s
Routing: Application-level, DNS-based. No cloud LB for media.
Scaling: Custom orchestrator with drain logic. Scale on session count, not CPU.
Regions: Start with 3, expand based on user distribution data.
Network: Use GCP’s premium tier for cross-region cascades. Use AWS for everything else.
TURN: Globally distributed, TURN/TLS on 443, sized for 10-15% of traffic.

The cloud was built for web. Media is not web. Respect the difference and your users will thank you.

Related reading: SFU vs MCU for the media server architecture decisions. WebRTC Debugging in Production for the network-level issues you’ll encounter. Edge Computing for the edge deployment strategy.

Yash Chudasama

Real-Time Communication · AI · Cloud — About me

I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.

Get in touch →