opam: Automate the opam repository initialization
Prior to this, this waited for a flag to be passed along to the constructor to determine whether the 'opam init' step should happen or not. As this was never activated (neither passed along by the lister nor configured in various loader configurations), this is now source of issues.
With the following, this lets the loader determine whether it needs to initialize the opam repository or not.
Merge request reports
Activity
mentioned in issue swh/infra/sysadm-environment#4971 (closed)
Jenkins job DLDBASE/gitlab-builds #143 succeeded .
See Console Output and Coverage Report for more details.heh, it's true and i had forgotten about it. So and i'll have another closer look there.
Either way, that's a very specific solution for the static workers which is not available in the new worker. So that might be worth considering nonetheless?
Edited by Antoine R. Dumont
The initial issue (which may be different, idk) is a suspicion of being out-of-sync.
When trying to reproduce the issue ^, i may have been side-tracked by another issue (i'm not sure whether that's the same issue or not) I encountered both in production and staging (missing directories for packages which fail the ingestion). As it's about ~75% of the cases i've triggered (14/20 or so), i've set my mind on fixing that first with this mr (well, i suppose that can fix it, i'm still unsure).
That's what we can infer from the trail of messages in the related task.
Edited by Antoine R. DumontBut at least I can see that the production workers do have an opam repository set up and kept up to date every day.
and yes, i see it too now that you've reminded me.
root@worker08:~# systemctl list-timers | grep -i opam Thu 2023-06-29 11:44:10 UTC 1h 48min left Wed 2023-06-28 11:44:12 UTC 22h ago opam-manage-shared-state.timer opam-manage-shared-state.service root@worker08:~# systemctl status opam-manage-shared-state.service ● opam-manage-shared-state.service - Software Heritage Manage OPAM shared state Loaded: loaded (/etc/systemd/system/opam-manage-shared-state.service; disabled; vendor preset: enabled) Active: inactive (dead) since Wed 2023-06-28 11:44:45 UTC; 22h ago TriggeredBy: ● opam-manage-shared-state.timer Process: 608119 ExecStart=/usr/local/bin/opam-manage-shared-state.sh (code=exited, status=0/SUCCESS) Main PID: 608119 (code=exited, status=0/SUCCESS) Jun 28 11:44:12 worker08 systemd[1]: Starting Software Heritage Manage OPAM shared state... Jun 28 11:44:13 worker08 opam-manage-shared-state.sh[608122]: <><> Updating package repositories ><><><><><><><><><><><><><><><><><><><><><><> Jun 28 11:44:17 worker08 opam-manage-shared-state.sh[608122]: [opam.ocaml.org] synchronised from https://opam.ocaml.org Jun 28 11:44:24 worker08 opam-manage-shared-state.sh[608122]: Now run 'opam upgrade' to apply any package updates. Jun 28 11:44:25 worker08 opam-manage-shared-state.sh[608147]: <><> Updating package repositories ><><><><><><><><><><><><><><><><><><><><><><> Jun 28 11:44:28 worker08 opam-manage-shared-state.sh[608147]: [opam.ocaml.org] no changes from https://opam.ocaml.org Jun 28 11:44:36 worker08 opam-manage-shared-state.sh[608169]: <><> Updating package repositories ><><><><><><><><><><><><><><><><><><><><><><> Jun 28 11:44:39 worker08 opam-manage-shared-state.sh[608169]: [opam.ocaml.org] no changes from https://opam.ocaml.org Jun 28 11:44:45 worker08 systemd[1]: opam-manage-shared-state.service: Succeeded. Jun 28 11:44:45 worker08 systemd[1]: Finished Software Heritage Manage OPAM shared state.
Now, I wouldn't be too surprised if the shared repository could end up out of sync somehow. We should probably reinitialize it from scratch every week or so.
For the transient workers it might make sense to have it on a volume as well, so we don't pull the full index from opam.ocaml.org at every init?
Those are the currently listed opam instances.
softwareheritage-scheduler=> select instance_name from listers where name='opam'; instance_name ---------------- opam.ocaml.org coq.inria.fr (2 rows)
FYI, when calling the opam command used by the opam lister to get all packages, we get the following output.
anlambert@worker08:~$ opam list --all --no-switch --safe --repos opam.ocaml.org --root /tmp/opam_root --normalise --short [ERROR] Opam has not been initialised, please run `opam init'
If we look at the opam_init function from the lister, we can see that opam is not initialized when the opam root folder exists.
Also the lister does not check the return code of the opam listing command so I suspect that each lister run does not list any packages at all.
Now, I wouldn't be too surprised if the shared repository could end up out of sync somehow. We should probably reinitialize it from scratch every week or so.
Yes. What would be a sensible approach for that, adding another systemd timer which scratches it weekly?
... Also the lister does not check the return code of the opam listing command so I suspect that each lister run does not list any packages at all.
When i triggered the listing, i missed that, so thanks.
So many stuff to fix...
To sum up, is the following a viable path towards some fixes?
- Ensure the loader & lister checks their initialization command return code.
- static workers: Regularly scratch the shared opam repository
- short-term: elastic workers: activate the configuration initialize_opam_root: True for the loader
- middle-term: elastic workers: use a shared volume with the opam repository
- drop this mr.
- Deploy and check whether issues subside (including the out-of-sync state)
Note that I'm also a bit annoyed at the mostly duplicated initialization code between the lister and loader. Once it's settled, can we try and push it to swh.core (as an optional deps swh.core[opam] or something)?
Edited by Antoine R. Dumont- short-term: elastic workers: activate the configuration initialize_opam_root: True for the loader
It's there already [1] (and I'm the author... ;p). So it seems not to be enough?
Ensure the loader & lister checks their initialization command return code.
Done in swh-lister!481 (merged) for the opam lister.
And this is what I obtain now after triggering a second opam listing in the docker environment:
docker-swh-lister-1 | [WARNING] No switch is currently set, perhaps you meant '--set-default'? docker-swh-lister-1 | [opam.ocaml.org] no changes from https://opam.ocaml.org docker-swh-lister-1 | [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch docker-swh-lister-1 | [2023-06-29 14:15:29,850: ERROR/ForkPoolWorker-1] Task swh.lister.opam.tasks.OpamListerTask[423d61c5-e715-45c1-9293-88c976a71b99] raised unexpected: CalledProcessError(50, ['/usr/bin/opam', 'repository', 'add', '--root', '/tmp/opam/', 'opam.ocaml.org', 'https://opam.ocaml.org']) docker-swh-lister-1 | Traceback (most recent call last): docker-swh-lister-1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task docker-swh-lister-1 | R = retval = fun(*args, **kwargs) docker-swh-lister-1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 61, in __call__ docker-swh-lister-1 | result = super().__call__(*args, **kwargs) docker-swh-lister-1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__ docker-swh-lister-1 | return self.run(*args, **kwargs) docker-swh-lister-1 | File "/src/swh-lister/swh/lister/opam/tasks.py", line 13, in list_opam docker-swh-lister-1 | return OpamLister.from_configfile(**lister_args).run().dict() docker-swh-lister-1 | File "/src/swh-lister/swh/lister/pattern.py", line 216, in run docker-swh-lister-1 | for page in self.get_pages(): docker-swh-lister-1 | File "/src/swh-lister/swh/lister/opam/lister.py", line 82, in get_pages docker-swh-lister-1 | opam_init(self.opam_root, self.instance, self.url, self.env) docker-swh-lister-1 | File "/src/swh-lister/swh/lister/opam/lister.py", line 167, in opam_init docker-swh-lister-1 | run(command, env=env, check=True) docker-swh-lister-1 | File "/usr/local/lib/python3.7/subprocess.py", line 512, in run docker-swh-lister-1 | output=stdout, stderr=stderr) docker-swh-lister-1 | subprocess.CalledProcessError: Command '['/usr/bin/opam', 'repository', 'add', '--root', '/tmp/opam/', 'opam.ocaml.org', 'https://opam.ocaml.org']' returned non-zero exit status 50.
The opam error is
[ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch
.Ok, so the first listing is fine (which triggers an initialization from scratch or something). But then second one fails.
Ok, so the 'else' conditional in the opam_init function (of the lister) fails [1]. It seems to "now" be missing some other call to execute (the 'opam switch' instruction call).