Make opam shared root initialization more robust
The way we initialize the opam shared root on workers, through multiple timer units, is pretty brittle. A lot of the services actually fail to run because of a race condition or because of "lack of a switch", whatever that means.
Jan 03 00:00:00 worker01 systemd[1]: Started Software Heritage Manage OPAM shared state (coq.inria.fr).
Jan 03 00:00:01 worker01 opam-manage-shared-state.sh[1475377]: [WARNING] No switch is currently set, perhaps you meant '--set-default'?
Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [coq.inria.fr] Initialised
Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch
Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Main process exited, code=exited, status=50/n/a
Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Failed with result 'exit-code'.
Instead of having separate timer units, we should probably have a single one which would run a script that would create the default root first, then run snippets updating the created root for each separate instance afterwards.
-
use set -e
in the main script if it's not already set -
make the update script update all instances at once, instead of a single one separately: write a main script which handles the creation of the opam root and the update of the main instance, and run snippets generated for each other instance, using run-parts
-
make sure the worker only starts after a successful run of the main service -
make sure the timer unit runs at different times, rather than all at midnight -
consider moving the shared root to /var/tmp
to avoid it being blown away by reboots (probably not needed if we make sure the service dependencies are correct)
Migrated from T3826 (view on Phabricator)
Edited by Antoine R. Dumont