I'm not sure what happened when you restarted the production services, but they all got restarted by the crontab rather than by (or in addition to ?) your systemctl command. (noticed as a flood of mails to root by @guillaume)
Did you stop the services manually before the upgrade?
Did you stop the services manually before the upgrade?
No, i did not.
I've "restart"ed the systemd services after the package upgrade.
It stuck because some loading were still happening. And it finally restarted after the 15min timeout.
I've "restart"ed the systemd services after the package upgrade. It stuck because some loading were still happening. And it finally restarted after the 15min timeout.
OK, then I think what's happening is that, while the service is stopping:
systemctl is-enabled foo.service returns true
systemctl is-active foo.service returns false (so the cron triggers a systemctl start foo.service, and complains)
But practically the start command is deferred until after the service times out during stop (and gets coalesced with the start operation that has been triggered by the admin-provided restart command)
Now, given that, I still don't know how to avoid the spammy situation.
It does the cron/email thing each time I upgrade a package with associated long-running task/service.
It's a bug in the check-restart script, it should avoid doing anything if there's a restart job queued.
To avoid spamming on upgrades I've now grown accustomed to using the following (pretty silly, I'll admit) procedure
before the upgrade:
puppet agent --disable "upgrading $service" (this prevents the services from being re-enabled behind my back)
systemctl disable the relevant services, without stopping them (the only side effect of this operation is disabling the ping-restart cron)
use the celery remote control to stop consuming the queues
wait a sensible amount of time (10-15 minutes?)
do the upgrade with apt
after the upgrade
systemctl stop --no-block the services (this returns immediately)
wait a bit for the services that were already done to stop
systemctl kill --signal=9 --kill-who=all the (remaining) services
run puppet agent --enable && puppet agent -t; systemctl default (this re-enables the services, which re-enables the cron, and systemctl default restarts them)
The celery remote control is an ungodly command that only really lives in my zsh history (it uses some zsh-specific join syntax)
# adapt this for the right hostshosts=(worker{01..16}.internal.softwareheritage.org)# and the right worker instancesworkers=(checker_deposit lister loader_{archive,bzr,cran,cvs,debian,deposit,git,mercurial,nixguix,npm,oneshot,opam,pypi,svn})# these are genericall_instances=(${^workers}@${^hosts})for queue in`SWH_CONFIG_FILENAME=~/work/swh-environment/scratch/scheduler.yml celery -A swh.scheduler.celery_backend.config.app inspect -d${(j:,:)all_instances} active_queues -j | head-n +1 | jq -r'[. | to_entries | .[].value | flatten | .[].name] | unique | .[]' | grep-v oneshot3`;doSWH_CONFIG_FILENAME=~/work/swh-environment/scratch/scheduler.yml celery -A swh.scheduler.celery_backend.config.app control -d${(j:,:)all_instances} cancel_consumer $queuedone
scheduler.yml contains a celery: task_broker: amqp:/// entry with credentials to write remote control commands to the (production or staging) rabbitmq broker.
Clearly not really tractable for what should be a simple deployment task...