automate handling of hanging/dead/stuck loaders

assigned to @olasd

added Scheduling utilities priority:High labels

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

! In #2335 (closed), @ardumont wrote: Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

More generally, if we do this to all workers, that'd also cleanup the systemd service allocated temporary folder some workers use. Which could be a good thing as well.

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

To find out what loaders are currently hung, I do the following:

Set SWH_CONFIG_FILENAME to a file containing proper credentials for rabbitmq:

celery:
  task_broker: amqp://<login>:<pass>@rabbitmq:5672/%2f

import time
from ast import literal_eval

from swh.scheduler.celery_backend.config import app

destination = ["celery@loader_git.worker%02d" % i for i in range(1, 17)]

inspect = app.control.inspect(destination=destination, timeout=3)
long_running_tasks_by_worker = {
    w: [(task["worker_pid"], literal_eval(task["kwargs"])["url"], time.time() - task["time_start"])
        for task in tasks
        if time.time() - task["time_start"] > 3600]
    for w, tasks in sorted(inspect.active().items())
}

dead_workers = ','.join(w.rsplit('.', 1)[-1] for w in sorted(set(destination) - set(long_running_tasks_by_worker)))

This lists all workers that reply to the celery remote control within 3 seconds (timeout=3 in the inspect command), and on these workers, this lists the tasks that have been running for more than an hour (> 3600 in the list comprehension).

The dead_workers are not answering to the celery remote control. They probably got oom killed and stopped working.

I inspect the long-running tasks. Most tasks running for more than a few hours are just stuck. These workers get added to the dead workers list.

I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders. Doing a "proper" restart won't work when some of the worker processes are hung, so we have to kill them hard.

! In #2335 (closed), @olasd wrote: I then run clush -w <dead_workers> 'systemctl kill --kill-who all --signal 9 swh-worker@loader_git' on pergamon to restart these git loaders

As I understand this, you're here killing hard all processes of the given service. Hence, pivoting back to the original proposals, this:

Another possibility would be to set systemd's OOM policy to kill the whole service instead of just the single process when one of the threads gets oom-killed. That way the service will be able to autorestart properly.

looks like the most sensible to try next? It should be easier to try out than other options and will also allow to verify that these hangs happen only due to OOM killing.

My 0.02 €

mentioned in commit swh/infra/puppet/puppet-swh-site@ebf885eb

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

AFAIK, private temporary directories survive across service restarts, so I don't think that's correct.

Hum, i thought that was one of the main interest in those... Must be misremembering.

Looks like that OOMPolicy option doesn't exist in buster systemd. I'm very tempted to upgrade systemd to buster-backports...

sounds fair ;)

I've bumped the apt-preferences config to pull systemd and related packages from buster-backports.

I'll look at upgrading/rebooting the workers once that's pulled in.

I'll look at upgrading/rebooting the workers once that's pulled in.

jsyk, I've upgraded the staging nodes with those changes. They are fine after this. I did not reboot those though, simply restarted some systemd services.

I no longer see the warning about OOMPolicy so win, i think ;)

Thanks.

added state:wip label

changed title from Automate handling hanging or dead loaders to automate handling of hanging/dead/stuck loaders

mentioned in commit swh/infra/puppet/puppet-swh-site@046310fd

mentioned in commit swh/infra/puppet/puppet-swh-site@4f5958e2

I've deployed the ping, kill and restart bandaid referenced in puppet.

I've also ended up backporting recent versions of celery, kombu and billiard.

We'll see whether we end up with lots of cronspam or not.

The deployed cron invocation was buggy (fixed via rSPSITE3317ea30)

What the current cron does, is ping (via celery) the worker, and restart it if it doesn't respond after a few attempts. This catches the case where the top-level celery process doesn't respond.

However, some workers are currently stuck after a MemoryError, but the celery process itself is still processing messages and responding to them. so the current cron doesn't help too much.

We need to add an actual activity check on workers, e.g. "restart if nothing was processed in the last three hours".

Something like journalctl -u swh-worker@loader_git --since '6 hours ago' -o json showing no output would be a fair sign that something has gone wrong.

if test `journalctl -u 'swh-worker@loader_git' -o json --since '3 hours ago' | wc -l` -eq 0; then systemctl kill --signal 9 --kill-who all swh-worker@loader_git; fi

was run on all workers.

mentioned in merge request !138 (closed)

That's been working for a while now.

closed

marked this issue as related to swh/infra/sysadm-environment#5053 (closed)

automate handling of hanging/dead/stuck loaders

Designs

Child items ...

Activity