save code now: also add new origins for unknown repos

mentioned in merge request !576 (closed)

mentioned in merge request swh-environment!125 (closed)

added Save Code Now Web app priority:Low labels

marked this issue as related to #3082

changed the description

added priority:Normal label and removed priority:Low label

It is possibly also the only reasonable place where we can have heuristics to de-duplicate URLs that point to the same repo, e.g., non-canonical GitHub repos URLs.

That concern will be kept out of this task for now. It can be dealt with alongside [1]

[1] swh-model#2187

When we save an unknown origin due to a Save code now request, we schedule a one-shot task for the ingestion, but don't add the origin for future crawling. It might make sense to do both.

It makes sense.

Implementation wise, considering the "save code now" as a Lister, it thus references origins into the scheduler model which shall get scheduled by the "next-gen" scheduler.

mentioned in commit 01ffb31f

mentioned in commit swh-environment@a65287a8

mentioned in commit swh/infra/puppet/puppet-swh-site@e5d3c792

mentioned in commit 871bec66

The first part got deployed (modification in the webapp routines to update the save code now statuses).

The webapp now records successfull save code now origins in the listed origins models of the scheduler. [1]

Remain the runner to actually consume those regularly to be deployed (or maybe it's already the case, i need to check that part).

[1]

softwareheritage-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where name='save-code-now';
+-------------------------------+-------+
|              now              | count |
+-------------------------------+-------+
| 2021-06-16 06:36:44.269692+00 |    22 |
+-------------------------------+-------+
(1 row)

Time: 27.663 ms

Remain the runner to actually consume those regularly to be deployed (or maybe it's already the case, i need to check that part).

So that part [1] is not actually deployed, there is a dedicated task for it.

[1] swh-scheduler#2345

The scheduler is getting there. We are now able to trigger a runner for that part:

(ve) swhscheduler@saatchi:~$ swh scheduler -C $SWH_CONFIG_FILENAME    origin send-to-celery     --policy origins_without_last_update      --lister-uuid '860d41f8-d0c0-4733-a4d8-437c386bc31f'     --queue save_code_now:swh.loader.git.tasks.
UpdateGitRepository     git
10000 slots available in celery queue
5348 visits to send to celery

The uuid is the lister id for the save code now lister.

It's regularly crawled now so closing this.

closed

mentioned in issue swh-scheduler#4272

save code now: also add new origins for unknown repos

Child items ...

Activity