Skip to content
Snippets Groups Projects

opam: Automate the opam repository initialization

2 unresolved threads

Prior to this, this waited for a flag to be passed along to the constructor to determine whether the 'opam init' step should happen or not. As this was never activated (neither passed along by the lister nor configured in various loader configurations), this is now source of issues.

With the following, this lets the loader determine whether it needs to initialize the opam repository or not.

Refs. swh/infra/sysadm-environment#4971 (closed)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Huh ? We have a systemd timer to manage the state of the opam repository

  • Jenkins job DLDBASE/gitlab-builds #143 succeeded .
    See Console Output and Coverage Report for more details.

  • It's not clear to me what "that" issue is. But at least I can see that the production workers do have an opam repository set up and kept up to date every day.

    For workers with transient storage, the flag to manage a transient opam repository should definitely be set, of course.

  • Author Maintainer

    The initial issue (which may be different, idk) is a suspicion of being out-of-sync.

    When trying to reproduce the issue ^, i may have been side-tracked by another issue (i'm not sure whether that's the same issue or not) I encountered both in production and staging (missing directories for packages which fail the ingestion). As it's about ~75% of the cases i've triggered (14/20 or so), i've set my mind on fixing that first with this mr (well, i suppose that can fix it, i'm still unsure).

    That's what we can infer from the trail of messages in the related task.

    Edited by Antoine R. Dumont
  • Author Maintainer

    For workers with transient storage, the flag to manage a transient opam repository should definitely be set, of course.

    ok, i'll do that soon.

  • Author Maintainer

    But at least I can see that the production workers do have an opam repository set up and kept up to date every day.

    and yes, i see it too now that you've reminded me.

    root@worker08:~# systemctl list-timers | grep -i opam
    Thu 2023-06-29 11:44:10 UTC 1h 48min left Wed 2023-06-28 11:44:12 UTC 22h ago     opam-manage-shared-state.timer opam-manage-shared-state.service
    root@worker08:~# systemctl status opam-manage-shared-state.service
    ● opam-manage-shared-state.service - Software Heritage Manage OPAM shared state
         Loaded: loaded (/etc/systemd/system/opam-manage-shared-state.service; disabled; vendor preset: enabled)
         Active: inactive (dead) since Wed 2023-06-28 11:44:45 UTC; 22h ago
    TriggeredBy: ● opam-manage-shared-state.timer
        Process: 608119 ExecStart=/usr/local/bin/opam-manage-shared-state.sh (code=exited, status=0/SUCCESS)
       Main PID: 608119 (code=exited, status=0/SUCCESS)
    
    Jun 28 11:44:12 worker08 systemd[1]: Starting Software Heritage Manage OPAM shared state...
    Jun 28 11:44:13 worker08 opam-manage-shared-state.sh[608122]: <><> Updating package repositories ><><><><><><><><><><><><><><><><><><><><><><>
    Jun 28 11:44:17 worker08 opam-manage-shared-state.sh[608122]: [opam.ocaml.org] synchronised from https://opam.ocaml.org
    Jun 28 11:44:24 worker08 opam-manage-shared-state.sh[608122]: Now run 'opam upgrade' to apply any package updates.
    Jun 28 11:44:25 worker08 opam-manage-shared-state.sh[608147]: <><> Updating package repositories ><><><><><><><><><><><><><><><><><><><><><><>
    Jun 28 11:44:28 worker08 opam-manage-shared-state.sh[608147]: [opam.ocaml.org] no changes from https://opam.ocaml.org
    Jun 28 11:44:36 worker08 opam-manage-shared-state.sh[608169]: <><> Updating package repositories ><><><><><><><><><><><><><><><><><><><><><><>
    Jun 28 11:44:39 worker08 opam-manage-shared-state.sh[608169]: [opam.ocaml.org] no changes from https://opam.ocaml.org
    Jun 28 11:44:45 worker08 systemd[1]: opam-manage-shared-state.service: Succeeded.
    Jun 28 11:44:45 worker08 systemd[1]: Finished Software Heritage Manage OPAM shared state.
  • Now, I wouldn't be too surprised if the shared repository could end up out of sync somehow. We should probably reinitialize it from scratch every week or so.

    For the transient workers it might make sense to have it on a volume as well, so we don't pull the full index from opam.ocaml.org at every init?

  • Those are the currently listed opam instances.

    softwareheritage-scheduler=> select instance_name from listers where name='opam';
     instance_name  
    ----------------
     opam.ocaml.org
     coq.inria.fr
    (2 rows)

    FYI, when calling the opam command used by the opam lister to get all packages, we get the following output.

    anlambert@worker08:~$ opam list --all --no-switch --safe --repos opam.ocaml.org --root /tmp/opam_root --normalise --short
    [ERROR] Opam has not been initialised, please run `opam init'

    If we look at the opam_init function from the lister, we can see that opam is not initialized when the opam root folder exists.

    Also the lister does not check the return code of the opam listing command so I suspect that each lister run does not list any packages at all.

  • Author Maintainer

    Now, I wouldn't be too surprised if the shared repository could end up out of sync somehow. We should probably reinitialize it from scratch every week or so.

    Yes. What would be a sensible approach for that, adding another systemd timer which scratches it weekly?

    ... Also the lister does not check the return code of the opam listing command so I suspect that each lister run does not list any packages at all.

    When i triggered the listing, i missed that, so thanks.


    So many stuff to fix...

    To sum up, is the following a viable path towards some fixes?

    1. Ensure the loader & lister checks their initialization command return code.
    2. static workers: Regularly scratch the shared opam repository
    3. short-term: elastic workers: activate the configuration initialize_opam_root: True for the loader
    4. middle-term: elastic workers: use a shared volume with the opam repository
    5. drop this mr.
    6. Deploy and check whether issues subside (including the out-of-sync state)

    Note that I'm also a bit annoyed at the mostly duplicated initialization code between the lister and loader. Once it's settled, can we try and push it to swh.core (as an optional deps swh.core[opam] or something)?

    Edited by Antoine R. Dumont
    • Author Maintainer
      1. short-term: elastic workers: activate the configuration initialize_opam_root: True for the loader

      It's there already [1] (and I'm the author... ;p). So it seems not to be enough?

      [1] https://gitlab.softwareheritage.org/swh/infra/ci-cd/swh-charts/-/blob/production/swh/values/staging.yaml#L268-270

    • Ensure the loader & lister checks their initialization command return code.

      Done in swh-lister!481 (merged) for the opam lister.

    • And this is what I obtain now after triggering a second opam listing in the docker environment:

      docker-swh-lister-1  | [WARNING] No switch is currently set, perhaps you meant '--set-default'?
      docker-swh-lister-1  | [opam.ocaml.org] no changes from https://opam.ocaml.org
      docker-swh-lister-1  | [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch
      docker-swh-lister-1  | [2023-06-29 14:15:29,850: ERROR/ForkPoolWorker-1] Task swh.lister.opam.tasks.OpamListerTask[423d61c5-e715-45c1-9293-88c976a71b99] raised unexpected: CalledProcessError(50, ['/usr/bin/opam', 'repository', 'add', '--root', '/tmp/opam/', 'opam.ocaml.org', 'https://opam.ocaml.org'])
      docker-swh-lister-1  | Traceback (most recent call last):
      docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task
      docker-swh-lister-1  |     R = retval = fun(*args, **kwargs)
      docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 61, in __call__
      docker-swh-lister-1  |     result = super().__call__(*args, **kwargs)
      docker-swh-lister-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__
      docker-swh-lister-1  |     return self.run(*args, **kwargs)
      docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/opam/tasks.py", line 13, in list_opam
      docker-swh-lister-1  |     return OpamLister.from_configfile(**lister_args).run().dict()
      docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/pattern.py", line 216, in run
      docker-swh-lister-1  |     for page in self.get_pages():
      docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/opam/lister.py", line 82, in get_pages
      docker-swh-lister-1  |     opam_init(self.opam_root, self.instance, self.url, self.env)
      docker-swh-lister-1  |   File "/src/swh-lister/swh/lister/opam/lister.py", line 167, in opam_init
      docker-swh-lister-1  |     run(command, env=env, check=True)
      docker-swh-lister-1  |   File "/usr/local/lib/python3.7/subprocess.py", line 512, in run
      docker-swh-lister-1  |     output=stdout, stderr=stderr)
      docker-swh-lister-1  | subprocess.CalledProcessError: Command '['/usr/bin/opam', 'repository', 'add', '--root', '/tmp/opam/', 'opam.ocaml.org', 'https://opam.ocaml.org']' returned non-zero exit status 50.

      The opam error is [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch.

    • Author Maintainer

      Ok, so the first listing is fine (which triggers an initialization from scratch or something). But then second one fails.

      Ok, so the 'else' conditional in the opam_init function (of the lister) fails [1]. It seems to "now" be missing some other call to execute (the 'opam switch' instruction call).

      [1] https://gitlab.softwareheritage.org/swh/devel/swh-lister/-/blob/b9815ed577618a0db5d6894cd8c788193bbb781b/swh/lister/opam/lister.py#L152-162

    • FYI, passing the --set-default parameter to opam init for the second listing seems to fix the issue.

    • Please register or sign in to reply
Please register or sign in to reply
Loading