SFU vs MCU: Choosing the Right Video Architecture
The first architecture decision in any real-time video system is the one that defines everything after it: SFU or MCU? Get it wrong and you’re rebuilding six months later. Get it right and your system scales naturally.
Having built media infrastructure that handles thousands of concurrent sessions, I want to share the decision framework I use — not the textbook version, but the one shaped by production pain.
The Fundamentals
Selective Forwarding Unit (SFU)
An SFU receives media streams from each participant and forwards them to everyone else. No transcoding. No mixing. It’s a smart router for media packets.
Participant A ──video──> ┌─────┐ ──video A──> Participant B
Participant B ──video──> │ SFU │ ──video B──> Participant A
Participant C ──video──> └─────┘ ──video C──> Participant A & B
What the SFU does:
- Receives each participant’s stream once
- Selectively forwards streams to other participants
- Handles Simulcast — participants send multiple quality layers (e.g., 720p, 360p, 180p) and the SFU picks the right one for each receiver based on bandwidth
- Manages subscription logic — who sees whom
What the SFU doesn’t do:
- Decode or re-encode video
- Mix audio
- Create composite layouts
Multipoint Control Unit (MCU)
An MCU receives all streams, decodes them, mixes them into a single composite stream, re-encodes, and sends one stream to each participant.
Participant A ──video──> ┌─────┐
Participant B ──video──> │ MCU │ ──single composite──> All Participants
Participant C ──video──> └─────┘
What the MCU does:
- Decodes every incoming stream
- Mixes audio into a single track
- Composites video into a grid/layout
- Re-encodes the result
- Sends one stream per participant
The Real Trade-offs
1. Server-Side CPU Cost
This is the trade-off most people underestimate.
SFU: Minimal CPU. It’s forwarding packets, not processing media. A single SFU instance on a c5.xlarge (4 vCPU) can handle 200-400+ concurrent streams depending on bitrate.
MCU: Massive CPU. Decoding + compositing + re-encoding for every room is computationally brutal. The same c5.xlarge might handle 5-10 rooms of 4 participants each. At scale, your compute bill becomes the conversation.
The nuance nobody mentions: MCU cost scales with participants x rooms. SFU cost scales with streams x subscribers, which is worse in theory (O(n^2) per room) but better in practice because you’re not burning CPU cycles on transcoding.
2. Client-Side Bandwidth
SFU: Each client downloads n-1 streams (one per other participant). In a 10-person call, each client downloads 9 streams. With Simulcast, the SFU sends lower-quality layers to clients on constrained connections, but the total downstream bandwidth grows linearly with participants.
MCU: Each client downloads exactly 1 composite stream, regardless of participant count. A 10-person call and a 100-person call have the same client bandwidth requirement.
When this matters: Mobile networks, emerging markets, IoT endpoints, PSTN gateways. If your users are on 3G in rural India, MCU wins. If they’re on fiber in San Francisco, SFU wins.
3. Latency
SFU: Lower latency. Packets are forwarded, not processed. End-to-end latency is typically 100-300ms depending on network conditions.
MCU: Higher latency. The decode → mix → encode pipeline adds 200-500ms. For conversational applications, this pushes total latency above the 400ms threshold where conversations feel awkward.
The 400ms rule: Research consistently shows that conversational quality degrades sharply above 400ms one-way delay. MCU latency alone can eat half this budget before network latency is even considered.
4. Flexibility and Layout Control
SFU: The client controls the layout. Each participant’s stream arrives separately, and the client-side application decides how to render them. Participants can pin speakers, resize tiles, or apply custom UI.
MCU: The server controls the layout. Every participant sees the same composite grid. Changing the layout means re-compositing on the server. Custom layouts require server-side development, not just CSS.
5. Recording
SFU: Recording requires a server-side process that subscribes to all streams and composites them — essentially running an MCU just for recording. Or you record individual tracks and composite later (cheaper but delayed).
MCU: Recording is trivial. The composite stream already exists. Tap into it and write to disk.
The Decision Framework
Choose SFU when:
- Participant count per room is < 50
- You need low latency (voice AI, gaming, live communication)
- You want client-side layout flexibility
- You’re building for modern browsers/devices on decent networks
- Cost efficiency at scale matters
- You need Simulcast for adaptive quality
Choose MCU when:
- You need to support low-bandwidth clients (mobile, IoT, PSTN)
- Every participant must see the same layout (broadcasting, webinars)
- You need dead-simple recording
- Participant count per room is very high (100+)
- Clients are resource-constrained (can only decode one stream)
Choose Hybrid when:
- Different room types have different needs (1:1 calls use SFU, large webinars use MCU)
- You need SFU for active speakers and MCU for passive viewers
- Recording needs to happen in real-time but communication must stay low-latency
The Hybrid Pattern: What Production Systems Actually Do
Most production systems at scale don’t choose one or the other. They use both:
-
Small rooms (2-20 participants): Pure SFU. Low latency, low cost, full flexibility.
-
Large rooms (20-100): SFU with Simulcast + last-n. Only forward streams for the N most recent speakers. Viewers who aren’t speaking don’t send video.
-
Massive rooms (100+): SFU for active participants, MCU or HLS/DASH for passive viewers.
-
Recording: Dedicated MCU process that subscribes to the SFU, composites, and writes to storage. Decoupled from the live session.
Scaling Considerations
SFU Scaling
SFUs scale horizontally by cascading — connecting SFU instances across regions. Participant A connects to SFU-US, Participant B to SFU-EU, and the SFUs exchange streams between each other.
The challenge: cascade latency. Each hop adds 50-100ms. A three-hop cascade can push latency above the conversation threshold.
The solution: Smart routing that minimizes hops, and geographic awareness in room assignment.
MCU Scaling
MCUs scale vertically — you need bigger machines. Horizontal scaling requires splitting rooms across instances and handling failover, which is complex because the MCU holds decoded state for every stream.
GPU-accelerated MCUs (using NVENC/NVDEC) can dramatically improve density, but add infrastructure complexity and cost.
Cost Analysis (Real Numbers)
For a 1000-concurrent-user platform with average 4 participants per room (250 rooms):
SFU (AWS):
- 3-5x
c5.xlargeinstances at ~$0.17/hr = ~$370-620/month - Bandwidth: ~2-4 TB/month at $0.09/GB = ~$180-360/month
- Total: ~$550-980/month
MCU (AWS):
- 25-50x
c5.xlargeinstances = ~$3,060-6,120/month - Bandwidth: ~0.5-1 TB/month (single composite) = ~$45-90/month
- Total: ~$3,100-6,200/month
The SFU is 5-10x cheaper at this scale. The gap widens as you grow.
What I’d Recommend
For most real-time communication products in 2026: start with SFU, add MCU capabilities surgically.
SFU gives you the lowest latency, lowest cost, and most flexibility. Layer in Simulcast from day one — it’s not optional, it’s fundamental. Add MCU only for specific use cases: recording, broadcasting to low-bandwidth viewers, or PSTN integration.
The mistake I see most teams make: choosing MCU because it feels simpler. It is simpler — until you need to scale. Then it becomes your most expensive line item and the bottleneck you can’t architect around.
Build for the architecture you’ll need at 10x your current scale, not the one that’s easiest to prototype.
Related reading: WebRTC: Revolutionizing Real-Time Communication covers the protocol fundamentals. For production pitfalls, see WebRTC Debugging in Production. For cloud deployment, see Cloud Architecture for Real-Time Media.
I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.
Get in touch →