A daily LLM-powered health check flagged that 8 out of 10 cameras had crash counts in the hundreds. The root cause turned out to be two baby monitor cameras, a go2rtc reconnect window, and a vaapi cascade failure — none of which were directly obvious. Here’s how we found it and fixed it.
How the Issue Was Found#
I’ve been building a daily home health agent — a scheduled script that queries all home services (Frigate, Home Assistant, Paperless, the arr stack) and passes the data to a local LLM for analysis. The idea: instead of manually checking dashboards, get a morning summary that flags anything unusual.
The Frigate check queries /api/stats and looks at per-camera crash counts. One morning, the report came back with this:
| |
Without the health check, I wouldn’t have noticed — Frigate itself never restarted, the web UI still showed cameras as “online,” and there were no obvious alerts.
Root Cause: A Crash Cascade#
Tracing it back, the chain of events every morning at 9am:
- Cron stops the
nanitcontainer (baby monitors only needed overnight) - The nanit RTMP publisher disconnects from go2rtc
- go2rtc holds the nanit RTSP streams alive for ~37 minutes (its internal reconnect window)
- At ~09:37, go2rtc gives up and returns
404 Not Foundfor both nanit streams - Frigate’s ffmpeg crashes on the 404; the watchdog immediately restarts it → tight ~10s crash loop
- Both nanit ffmpeg processes crash-looping simultaneously overwhelm the Intel iGPU vaapi context
- All other cameras — which use the same vaapi device — crash with
Failed to sync surfaceerrors - The non-nanit cameras self-recover over 5–6 hours as their ffmpeg processes restart one by one
Frigate itself had 0 restarts. Everything was happening inside the ffmpeg subprocesses. The crash counts were accurate — 2228 crashes is roughly one every ~13 seconds over 8 hours, which matches a 10-second watchdog cycle.
The RTMP Architecture#
The nanit baby monitors use a third-party container (ghcr.io/gregory-m/nanit) that authenticates with the Nanit cloud and runs an RTMP server on host port 1935. go2rtc (embedded in Frigate) pulls from it as a client:
| |
The fix needed to ensure go2rtc always had a valid RTMP stream to pull from, even when the nanit container was stopped.
Failed Approaches#
go2rtc ffmpeg: fallback source#
go2rtc supports multiple sources per stream — primary, then fallback if the primary fails. The plan: add a fallback that generates a black screen via ffmpeg.
| |
Problem 1: Frigate preprocesses {...} as env var templates. {output} → Invalid substitution found. Fixed with {{output}} (double-braces escape).
Problem 2: go2rtc’s ffmpeg: source doesn’t shell-split arguments. The entire string after ffmpeg: is passed as a single token. Result: Error opening input file -re.
Neither {output} nor #video shorthand variants worked.
exec:ffmpeg with GO2RTC_ALLOW_ARBITRARY_EXEC=true#
go2rtc has an exec: source type for running arbitrary commands. With the env var set and correct syntax in config, the exec: line appeared correctly in /dev/shm/go2rtc.yaml — but the ffmpeg process never actually started. No error, no process in ps aux. Cause unclear; possibly a silent startup failure.
Placeholder container on a separate port#
Created a nanit-placeholder container with two linuxserver/ffmpeg instances pushing black frames to go2rtc at a different path. Then use go2rtc’s fallback: primary on port 1935, backup on port 1936.
Problem: go2rtc doesn’t auto-switch back to the primary source once it has latched onto a fallback. It stays on the backup until the session drops. Forcing a switch back requires restarting go2rtc, which interrupts all other cameras for 2–3 seconds. Not acceptable.
Placeholder container pushing to go2rtc RTMP#
The fallback container tried to push to rtmp://frigate:1935/... over the frigate_default Docker bridge network. Got Connection refused.
go2rtc’s RTMP port is only reachable from network_mode: host containers (like the real nanit container), not from bridge network containers. The mechanism: go2rtc’s RTMP port is bound to the host interface, not the Docker bridge.
The Fix: Same-Port Swap via mediamtx#
The key insight: if the placeholder takes over the same port (1935), go2rtc never notices the swap. No fallback logic needed, no reconnect triggers, no disruption to other cameras.
| |
The placeholder uses mediamtx (bluenviron/mediamtx:latest-ffmpeg) with ffmpeg generating black H264 frames:
docker-compose.yml:
| |
mediamtx.yml:
| |
A few ffmpeg parameters that matter:
| Flag | Reason |
|---|---|
-profile:v baseline -level 3.0 | Maximum H264 compatibility; prevents vaapi decode errors |
-g 1 | Keyframe every frame; ensures clean stream start, no “Invalid data” at RTSP relay |
no -tune stillimage | This tune produces non-standard H264 structure that breaks RTSP relay after ~40 seconds |
Cron on yang@debian.lan:
| |
There’s a ~2–3 second gap during the swap while go2rtc reconnects. Not a problem — Frigate recovers immediately with no crash loop because it’s only a momentary disconnect, not a sustained 404.
Result#
After the fix, crash counts dropped to zero across all cameras. The non-nanit cameras that had been spending hours recovering from vaapi cascade failures are now stable all day.
The go2rtc config stayed minimal — single source per stream, no fallback:
| |
Takeaway#
The health check made this visible. Without per-camera crash counts in the daily summary, this would have continued silently — Frigate showing green while the camera ffmpeg processes churned through thousands of crash/restart cycles every morning, and the non-nanit cameras spending half the day recovering.
The fix itself is straightforward once the root cause is clear. The hard part was getting there.