SFU vs MCU: Choosing the Right Video Architecture

Aug 12, 2025 · 6 min read

The first architecture decision in any real-time video system is the one that defines everything after it: SFU or MCU? Get it wrong and you’re rebuilding six months later. Get it right and your system scales naturally.

Having built media infrastructure that handles thousands of concurrent sessions, I want to share the decision framework I use — not the textbook version, but the one shaped by production pain.

The Fundamentals

Selective Forwarding Unit (SFU)

An SFU receives media streams from each participant and forwards them to everyone else. No transcoding. No mixing. It’s a smart router for media packets.

Participant A ──video──> ┌─────┐ ──video A──> Participant B
Participant B ──video──> │ SFU │ ──video B──> Participant A
Participant C ──video──> └─────┘ ──video C──> Participant A & B

What the SFU does:

Receives each participant’s stream once
Selectively forwards streams to other participants
Handles Simulcast — participants send multiple quality layers (e.g., 720p, 360p, 180p) and the SFU picks the right one for each receiver based on bandwidth
Manages subscription logic — who sees whom

What the SFU doesn’t do:

Decode or re-encode video
Mix audio
Create composite layouts

Multipoint Control Unit (MCU)

An MCU receives all streams, decodes them, mixes them into a single composite stream, re-encodes, and sends one stream to each participant.

Participant A ──video──> ┌─────┐
Participant B ──video──> │ MCU │ ──single composite──> All Participants
Participant C ──video──> └─────┘

What the MCU does:

Decodes every incoming stream
Mixes audio into a single track
Composites video into a grid/layout
Re-encodes the result
Sends one stream per participant

The Real Trade-offs

1. Server-Side CPU Cost

This is the trade-off most people underestimate.

SFU: Minimal CPU. It’s forwarding packets, not processing media. A single SFU instance on a c5.xlarge (4 vCPU) can handle 200-400+ concurrent streams depending on bitrate.

MCU: Massive CPU. Decoding + compositing + re-encoding for every room is computationally brutal. The same c5.xlarge might handle 5-10 rooms of 4 participants each. At scale, your compute bill becomes the conversation.

The nuance nobody mentions: MCU cost scales with participants x rooms. SFU cost scales with streams x subscribers, which is worse in theory (O(n^2) per room) but better in practice because you’re not burning CPU cycles on transcoding.

2. Client-Side Bandwidth

SFU: Each client downloads n-1 streams (one per other participant). In a 10-person call, each client downloads 9 streams. With Simulcast, the SFU sends lower-quality layers to clients on constrained connections, but the total downstream bandwidth grows linearly with participants.

MCU: Each client downloads exactly 1 composite stream, regardless of participant count. A 10-person call and a 100-person call have the same client bandwidth requirement.

When this matters: Mobile networks, emerging markets, IoT endpoints, PSTN gateways. If your users are on 3G in rural India, MCU wins. If they’re on fiber in San Francisco, SFU wins.

3. Latency

SFU: Lower latency. Packets are forwarded, not processed. End-to-end latency is typically 100-300ms depending on network conditions.

MCU: Higher latency. The decode → mix → encode pipeline adds 200-500ms. For conversational applications, this pushes total latency above the 400ms threshold where conversations feel awkward.

The 400ms rule: Research consistently shows that conversational quality degrades sharply above 400ms one-way delay. MCU latency alone can eat half this budget before network latency is even considered.

4. Flexibility and Layout Control

SFU: The client controls the layout. Each participant’s stream arrives separately, and the client-side application decides how to render them. Participants can pin speakers, resize tiles, or apply custom UI.

MCU: The server controls the layout. Every participant sees the same composite grid. Changing the layout means re-compositing on the server. Custom layouts require server-side development, not just CSS.

5. Recording

SFU: Recording requires a server-side process that subscribes to all streams and composites them — essentially running an MCU just for recording. Or you record individual tracks and composite later (cheaper but delayed).

MCU: Recording is trivial. The composite stream already exists. Tap into it and write to disk.

The Decision Framework

Choose SFU when:

Participant count per room is < 50
You need low latency (voice AI, gaming, live communication)
You want client-side layout flexibility
You’re building for modern browsers/devices on decent networks
Cost efficiency at scale matters
You need Simulcast for adaptive quality

Choose MCU when:

You need to support low-bandwidth clients (mobile, IoT, PSTN)
Every participant must see the same layout (broadcasting, webinars)
You need dead-simple recording
Participant count per room is very high (100+)
Clients are resource-constrained (can only decode one stream)

Choose Hybrid when:

Different room types have different needs (1:1 calls use SFU, large webinars use MCU)
You need SFU for active speakers and MCU for passive viewers
Recording needs to happen in real-time but communication must stay low-latency

The Hybrid Pattern: What Production Systems Actually Do

Most production systems at scale don’t choose one or the other. They use both:

Small rooms (2-20 participants): Pure SFU. Low latency, low cost, full flexibility.
Large rooms (20-100): SFU with Simulcast + last-n. Only forward streams for the N most recent speakers. Viewers who aren’t speaking don’t send video.
Massive rooms (100+): SFU for active participants, MCU or HLS/DASH for passive viewers.
Recording: Dedicated MCU process that subscribes to the SFU, composites, and writes to storage. Decoupled from the live session.

Scaling Considerations

SFU Scaling

SFUs scale horizontally by cascading — connecting SFU instances across regions. Participant A connects to SFU-US, Participant B to SFU-EU, and the SFUs exchange streams between each other.

The challenge: cascade latency. Each hop adds 50-100ms. A three-hop cascade can push latency above the conversation threshold.

The solution: Smart routing that minimizes hops, and geographic awareness in room assignment.

MCU Scaling

MCUs scale vertically — you need bigger machines. Horizontal scaling requires splitting rooms across instances and handling failover, which is complex because the MCU holds decoded state for every stream.

GPU-accelerated MCUs (using NVENC/NVDEC) can dramatically improve density, but add infrastructure complexity and cost.

Cost Analysis (Real Numbers)

For a 1000-concurrent-user platform with average 4 participants per room (250 rooms):

SFU (AWS):

3-5x c5.xlarge instances at ~$0.17/hr = ~$370-620/month
Bandwidth: ~2-4 TB/month at $0.09/GB = ~$180-360/month
Total: ~$550-980/month

MCU (AWS):

25-50x c5.xlarge instances = ~$3,060-6,120/month
Bandwidth: ~0.5-1 TB/month (single composite) = ~$45-90/month
Total: ~$3,100-6,200/month

The SFU is 5-10x cheaper at this scale. The gap widens as you grow.

For most real-time communication products in 2026: start with SFU, add MCU capabilities surgically.

SFU gives you the lowest latency, lowest cost, and most flexibility. Layer in Simulcast from day one — it’s not optional, it’s fundamental. Add MCU only for specific use cases: recording, broadcasting to low-bandwidth viewers, or PSTN integration.

The mistake I see most teams make: choosing MCU because it feels simpler. It is simpler — until you need to scale. Then it becomes your most expensive line item and the bottleneck you can’t architect around.

Build for the architecture you’ll need at 10x your current scale, not the one that’s easiest to prototype.

Related reading: WebRTC: Revolutionizing Real-Time Communication covers the protocol fundamentals. For production pitfalls, see WebRTC Debugging in Production. For cloud deployment, see Cloud Architecture for Real-Time Media.

Yash Chudasama

Real-Time Communication · AI · Cloud — About me

I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.

Get in touch →