WebRTC Debugging: Production Problems and Fixes
WebRTC documentation teaches you the happy path. Production teaches you everything else.
After years of building real-time communication infrastructure, I’ve compiled the debugging scenarios that consumed the most engineering hours — the ones that don’t appear in tutorials but appear in your error logs at 3 AM.
1. The Symmetric NAT Problem
What the docs say
“Use STUN to discover your public IP, then use it to establish a peer connection.”
What production teaches you
Roughly 8-15% of users are behind Symmetric NATs — typically corporate networks, university campuses, and certain mobile carriers. Symmetric NATs assign a different external port for each destination, which means STUN-discovered addresses don’t work for peer-to-peer connections.
The symptom: Connection works fine in your office, in testing, on your phone. Fails silently for a subset of users. No error in the console. The ICE state just sits at “checking” until it times out.
The diagnosis: Check RTCPeerConnection.getStats() for ICE candidate pair states. If you see only srflx (server reflexive) candidates failing and no relay candidates, TURN isn’t configured or isn’t reachable.
The fix: TURN is not optional. It’s the fallback that makes WebRTC work everywhere. Budget for TURN infrastructure from day one.
The cost reality: TURN relays all media through your server, consuming bandwidth. For a 720p video call, that’s ~1.5 Mbps per direction per participant. At scale, TURN bandwidth becomes a significant cost center. The trick: use TURN/TLS on port 443 — it punches through virtually every firewall and proxy, and it looks like HTTPS traffic to network middleboxes.
2. The One-Way Audio Problem
What happens
User A can hear User B, but User B can’t hear User A. Or video works but audio doesn’t. Or it works for 30 seconds then stops.
Why it’s deceptive
One-way media is almost always a network issue, not a code issue. But it presents as a code bug because “it works sometimes.”
Common causes
Asymmetric NAT/firewall rules: The NAT allows outbound UDP but drops inbound. The ICE connection establishes (because signaling works over TCP/WebSocket), but media packets in one direction get dropped.
Browser tab throttling: Chrome aggressively throttles background tabs. If User A switches to another tab, WebRTC timers fire less frequently, DTLS heartbeats miss, and the connection degrades or drops. This is especially brutal on mobile browsers.
SDP munging gone wrong: If you modify the SDP (Session Description Protocol) — which many applications do for codec selection, bandwidth control, or Simulcast — a malformed SDP silently drops media lines. The connection “succeeds” but certain media types are missing.
The debugging toolkit
-
chrome://webrtc-internals— Your best friend. Shows ICE candidates, DTLS state, packet counts per direction, codec negotiation, and bandwidth estimates. If inbound packets show 0 while outbound shows activity, it’s a network-level drop. -
getStats()API — Programmatic access to the same data. LogbytesReceivedandbytesSentper track every 5 seconds. If one direction flatlines, you’ve confirmed one-way media. -
TURN server logs — If both clients are relaying through TURN, check if packets arrive at TURN but aren’t forwarded. This indicates a TURN allocation issue, not a client issue.
3. The Safari Tax
Safari’s WebRTC implementation is perpetually 2 years behind Chrome. Here are the landmines:
VP9 is missing. Safari doesn’t support VP9. If your SDP offers VP9 first (which Chrome does by default), Safari will negotiate H.264 instead. If your SFU assumes VP9, it breaks. Always include H.264 as a fallback codec.
Unified Plan vs Plan B. Safari adopted Unified Plan later than Chrome. Older Safari versions (pre-15.4) still use Plan B for multi-stream handling. If you’re not handling both, multi-party calls break on older iPads and iPhones.
getDisplayMedia quirks. Screen sharing on Safari has different constraints, different resolution limits, and occasionally drops frames in ways Chrome doesn’t.
The debugging approach: Maintain a Safari-specific test suite. Not because Safari is wrong — because it’s different, and your users don’t care whose fault it is.
4. Bandwidth Estimation Oscillation
WebRTC’s bandwidth estimation (BWE) algorithm — GCC (Google Congestion Control) — is reactive. It probes for available bandwidth, overshoots, gets packet loss, backs off, probes again. This creates oscillation:
Bandwidth
2 Mbps ───┐ ┌───┐
│ ┌────┘ │ ┌──
1 Mbps ───│ │ │ │
└────┘ └────┘
Time →
The user experience: Video quality swings between crisp and blurry every 5-10 seconds. Users describe it as “the video keeps getting pixelated.”
The fix:
- Set realistic bitrate bounds:
maxBitrateandminBitratein the sender’s encoding parameters - Use Simulcast — let the SFU switch between quality layers based on receiver bandwidth, rather than the sender adjusting encoding on the fly
- Implement server-side bandwidth estimation (REMB or Transport-CC) rather than relying solely on client-side GCC
5. The Memory Leak Nobody Finds
WebRTC objects — RTCPeerConnection, MediaStream, MediaStreamTrack — don’t garbage-collect automatically when you’re done with them. If you don’t explicitly call .close() on peer connections and .stop() on tracks, they leak.
The symptom: Your application works fine for 30-minute calls. In a 4-hour call center session, the browser tab consumes 2GB of RAM and eventually crashes.
The debugging approach:
- Open Chrome DevTools, Memory tab, take heap snapshot
- Search for
RTCPeerConnectioninstances - If you see more instances than active connections, you have a leak
- Check that every
peerConnection.close()also calls.stop()on every track
The lesson: WebRTC lifecycle management is your responsibility. The browser won’t clean up after you.
6. onetwork Onetwork onetwork onetwork onetwork onetwork ICE Restart vs Reconnection
When a connection drops (user switches from Wi-Fi to cellular, laptop wakes from sleep), you have two options:
ICE Restart: Call peerConnection.restartIce(). This renegotiates the network path without tearing down the DTLS session. Fast (~500ms) but doesn’t work if the underlying network change broke the DTLS association.
Full reconnection: Create a new RTCPeerConnection, redo the offer/answer exchange, re-establish DTLS. Slower (~2-3 seconds) but reliable.
The production pattern: Try ICE restart first. If media doesn’t resume within 3 seconds, fall back to full reconnection. Log which method succeeded — the ratio tells you about your users’ network conditions.
The Meta-Lesson
WebRTC debugging is fundamentally different from debugging web applications because:
- It’s real-time. You can’t add a breakpoint in the middle of a video call.
- It’s network-dependent. The bug might not reproduce on your network.
- It’s multi-party. The problem might only occur with 3+ participants.
- It’s browser-dependent. Chrome, Firefox, and Safari have different implementations.
The only reliable strategy: instrument everything. Log ICE state changes, track events, connection stats, and error events. Send them to your analytics backend. When a user reports “the call dropped,” you need enough data to reconstruct what happened without asking them to reproduce it.
The bugs that cost the most engineering time are never the ones in your code. They’re in the space between your code and the network.
Related reading: SFU vs MCU covers the architecture decisions that prevent many of these issues. For cloud-specific challenges, see Cloud Architecture for Real-Time Media.
I write about the hard problems in real-time communication, AI, and cloud infrastructure. If you're working on something in this space, I'd enjoy hearing about it.
Get in touch →