When oom killed when the cgroup runs out of memory, the git loaders have a tendency to not come back.
Unfortunately, there's several symptoms that happen when these workers stop processing tasks.
some workers still respond to celery inspect active, but there's a task that doesn't go away
some workers stop responding completely, even to celery pings.
I'm not sure if the workers still answer to pings when one of the processes is stuck, but I'll probably start that way because that's the most generic way we can monitor these processes externally.
We could enable the systemd watchdog on the processes, with a timeout of n minutes, and have the ping job run every n-1 minutes to reset the watchdog clock.
Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.
Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.
More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.
! In #2335 (closed), @ardumont wrote:
Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.
More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.
AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.
import timefrom ast import literal_evalfrom swh.scheduler.celery_backend.config import appdestination = ["celery@loader_git.worker%02d" % i for i in range(1, 17)]inspect = app.control.inspect(destination=destination, timeout=3)long_running_tasks_by_worker = { w: [(task["worker_pid"], literal_eval(task["kwargs"])["url"], time.time() - task["time_start"]) for task in tasks if time.time() - task["time_start"] > 3600] for w, tasks in sorted(inspect.active().items())}dead_workers = ','.join(w.rsplit('.', 1)[-1] for w in sorted(set(destination) - set(long_running_tasks_by_worker)))
This lists all workers that reply to the celery remote control within 3 seconds (timeout=3 in the inspect command), and on these workers, this lists the tasks that have been running for more than an hour (> 3600 in the list comprehension).
The dead_workers are not answering to the celery remote control. They probably got oom killed and stopped working.
I inspect the long-running tasks. Most tasks running for more than a few hours are just stuck. These workers get added to the dead workers list.
I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders. Doing a "proper" restart won't work when some of the worker processes are hung, so we have to kill them hard.
! In #2335 (closed), @olasd wrote:
I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders
As I understand this, you're here killing hard all processes of the given service. Hence, pivoting back to the original proposals, this:
Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.
looks like the most sensible to try next?
It should be easier to try out than other options and will also allow to verify that these hangs happen only due to OOM killing.
I'll look at upgrading/rebooting the workers once that's pulled in.
jsyk, I've upgraded the staging nodes with those changes. They are fine after
this. I did not reboot those though, simply restarted some systemd services.
I no longer see the warning about OOMPolicy so win, i think ;)
The deployed cron invocation was buggy (fixed via rSPSITE3317ea30)
What the current cron does, is ping (via celery) the worker, and restart it if it doesn't respond after a few attempts. This catches the case where the top-level celery process doesn't respond.
However, some workers are currently stuck after a MemoryError, but the celery process itself is still processing messages and responding to them. so the current cron doesn't help too much.
We need to add an actual activity check on workers, e.g. "restart if nothing was processed in the last three hours".
Something like journalctl -u swh-worker@loader_git --since '6 hours ago' -o json showing no output would be a fair sign that something has gone wrong.
if test `journalctl -u 'swh-worker@loader_git' -o json --since '3 hours ago' | wc -l` -eq 0; then systemctl kill --signal 9 --kill-who all swh-worker@loader_git; fi