Skip to main content
  1. Posts/

Fixing a Camera Crash Cascade: How an LLM Health Check Found a Hidden Frigate Bug

Author
Yang Hu

A daily LLM-powered health check flagged that 8 out of 10 cameras had crash counts in the hundreds. The root cause turned out to be two baby monitor cameras, a go2rtc reconnect window, and a vaapi cascade failure — none of which were directly obvious. Here’s how we found it and fixed it.

How the Issue Was Found
#

I’ve been building a daily home health agent — a scheduled script that queries all home services (Frigate, Home Assistant, Paperless, the arr stack) and passes the data to a local LLM for analysis. The idea: instead of manually checking dashboards, get a morning summary that flags anything unusual.

The Frigate check queries /api/stats and looks at per-camera crash counts. One morning, the report came back with this:

1
2
3
4
5
6
nanit_cam1:   2228 crashes today
nanit_cam2:  2228 crashes today
backyard:        847 crashes today
front_door:      391 crashes today
side_a:          203 crashes today
...

Without the health check, I wouldn’t have noticed — Frigate itself never restarted, the web UI still showed cameras as “online,” and there were no obvious alerts.

Root Cause: A Crash Cascade
#

Tracing it back, the chain of events every morning at 9am:

  1. Cron stops the nanit container (baby monitors only needed overnight)
  2. The nanit RTMP publisher disconnects from go2rtc
  3. go2rtc holds the nanit RTSP streams alive for ~37 minutes (its internal reconnect window)
  4. At ~09:37, go2rtc gives up and returns 404 Not Found for both nanit streams
  5. Frigate’s ffmpeg crashes on the 404; the watchdog immediately restarts it → tight ~10s crash loop
  6. Both nanit ffmpeg processes crash-looping simultaneously overwhelm the Intel iGPU vaapi context
  7. All other cameras — which use the same vaapi device — crash with Failed to sync surface errors
  8. The non-nanit cameras self-recover over 5–6 hours as their ffmpeg processes restart one by one

Frigate itself had 0 restarts. Everything was happening inside the ffmpeg subprocesses. The crash counts were accurate — 2228 crashes is roughly one every ~13 seconds over 8 hours, which matches a 10-second watchdog cycle.

The RTMP Architecture
#

The nanit baby monitors use a third-party container (ghcr.io/gregory-m/nanit) that authenticates with the Nanit cloud and runs an RTMP server on host port 1935. go2rtc (embedded in Frigate) pulls from it as a client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Nanit camera hardware
      │  (proprietary protocol → cloud auth)
 nanit container      ← RTMP server on :1935
   go2rtc             ← pulls rtmp://10.0.10.11:1935/local/<uid>
      │  RTSP
Frigate ffmpeg

The fix needed to ensure go2rtc always had a valid RTMP stream to pull from, even when the nanit container was stopped.

Failed Approaches
#

go2rtc ffmpeg: fallback source
#

go2rtc supports multiple sources per stream — primary, then fallback if the primary fails. The plan: add a fallback that generates a black screen via ffmpeg.

1
2
3
nanit_cam1:
  - rtmp://10.0.10.11:1935/local/<stream-uid-1>
  - "ffmpeg:-re -f lavfi -i color=black:size=640x480:rate=1 -c:v libx264 -preset ultrafast {output}"

Problem 1: Frigate preprocesses {...} as env var templates. {output}Invalid substitution found. Fixed with {{output}} (double-braces escape).

Problem 2: go2rtc’s ffmpeg: source doesn’t shell-split arguments. The entire string after ffmpeg: is passed as a single token. Result: Error opening input file -re.

Neither {output} nor #video shorthand variants worked.

exec:ffmpeg with GO2RTC_ALLOW_ARBITRARY_EXEC=true
#

go2rtc has an exec: source type for running arbitrary commands. With the env var set and correct syntax in config, the exec: line appeared correctly in /dev/shm/go2rtc.yaml — but the ffmpeg process never actually started. No error, no process in ps aux. Cause unclear; possibly a silent startup failure.

Placeholder container on a separate port
#

Created a nanit-placeholder container with two linuxserver/ffmpeg instances pushing black frames to go2rtc at a different path. Then use go2rtc’s fallback: primary on port 1935, backup on port 1936.

Problem: go2rtc doesn’t auto-switch back to the primary source once it has latched onto a fallback. It stays on the backup until the session drops. Forcing a switch back requires restarting go2rtc, which interrupts all other cameras for 2–3 seconds. Not acceptable.

Placeholder container pushing to go2rtc RTMP
#

The fallback container tried to push to rtmp://frigate:1935/... over the frigate_default Docker bridge network. Got Connection refused.

go2rtc’s RTMP port is only reachable from network_mode: host containers (like the real nanit container), not from bridge network containers. The mechanism: go2rtc’s RTMP port is bound to the host interface, not the Docker bridge.

The Fix: Same-Port Swap via mediamtx
#

The key insight: if the placeholder takes over the same port (1935), go2rtc never notices the swap. No fallback logic needed, no reconnect triggers, no disruption to other cameras.

1
2
3
4
5
6
nanit container         ← 19:00–09:00, RTMP server on host :1935
      or
nanit-placeholder       ← 09:00–19:00, mediamtx RTMP server on host :1935
   go2rtc               ← always pulls rtmp://10.0.10.11:1935/local/<uid>

The placeholder uses mediamtx (bluenviron/mediamtx:latest-ffmpeg) with ffmpeg generating black H264 frames:

docker-compose.yml:

1
2
3
4
5
6
7
8
services:
  nanit-placeholder:
    image: bluenviron/mediamtx:latest-ffmpeg
    container_name: nanit-placeholder
    restart: "no"
    network_mode: host
    volumes:
      - ./mediamtx.yml:/mediamtx.yml

mediamtx.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
rtsp: no
rtmp: yes
rtmpAddress: :1935

paths:
  "local/<stream-uid-1>":
    runOnInit: ffmpeg -re -f lavfi -i color=c=black:s=640x480:r=1
      -c:v libx264 -preset ultrafast -profile:v baseline -level 3.0
      -g 1 -pix_fmt yuv420p -f flv rtmp://localhost:1935/local/<stream-uid-1>
    runOnInitRestart: yes

  "local/<stream-uid-2>":
    runOnInit: ffmpeg -re -f lavfi -i color=c=black:s=640x480:r=1
      -c:v libx264 -preset ultrafast -profile:v baseline -level 3.0
      -g 1 -pix_fmt yuv420p -f flv rtmp://localhost:1935/local/<stream-uid-2>
    runOnInitRestart: yes

A few ffmpeg parameters that matter:

FlagReason
-profile:v baseline -level 3.0Maximum H264 compatibility; prevents vaapi decode errors
-g 1Keyframe every frame; ensures clean stream start, no “Invalid data” at RTSP relay
no -tune stillimageThis tune produces non-standard H264 structure that breaks RTSP relay after ~40 seconds

Cron on yang@debian.lan:

1
2
3
4
5
0  9 * * *  docker compose -f /home/yang/docker/nanit/docker-compose.yml stop \
            && docker compose -f /home/yang/docker/nanit-placeholder/docker-compose.yml up -d

0 19 * * *  docker compose -f /home/yang/docker/nanit-placeholder/docker-compose.yml down \
            && docker compose -f /home/yang/docker/nanit/docker-compose.yml up -d

There’s a ~2–3 second gap during the swap while go2rtc reconnects. Not a problem — Frigate recovers immediately with no crash loop because it’s only a momentary disconnect, not a sustained 404.

Result
#

After the fix, crash counts dropped to zero across all cameras. The non-nanit cameras that had been spending hours recovering from vaapi cascade failures are now stable all day.

The go2rtc config stayed minimal — single source per stream, no fallback:

1
2
3
4
5
6
go2rtc:
  streams:
    nanit_cam1:
      - rtmp://10.0.10.11:1935/local/<stream-uid-1>
    nanit_cam2:
      - rtmp://10.0.10.11:1935/local/<stream-uid-2>

Takeaway
#

The health check made this visible. Without per-camera crash counts in the daily summary, this would have continued silently — Frigate showing green while the camera ffmpeg processes churned through thousands of crash/restart cycles every morning, and the non-nanit cameras spending half the day recovering.

The fix itself is straightforward once the root cause is clear. The hard part was getting there.