automate handling of hanging/dead/stuck loaders
When oom killed when the cgroup runs out of memory, the git loaders have a tendency to not come back.
Unfortunately, there's several symptoms that happen when these workers stop processing tasks.
- some workers still respond to
celery inspect active
, but there's a task that doesn't go away - some workers stop responding completely, even to
celery ping
s.
I'm not sure if the workers still answer to pings when one of the processes is stuck, but I'll probably start that way because that's the most generic way we can monitor these processes externally.
We could enable the systemd watchdog on the processes, with a timeout of n minutes, and have the ping job run every n-1 minutes to reset the watchdog clock.
Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.
Migrated from T2335 (view on Phabricator)