Skip to content

packagist: Make lister more resilient & actually list origins

4 commits:

    1. packagist: Randomize the packages list

That avoids listing always the same origins when some issues happened in the previous listing

    1. packagist: Ensure to continue listing even if github hangs up on us (when requiring canonical urls [1])

Currently the packagist listing takes a lot of time and at some point, github just hangs up and that fails the listing

    1. packagist: Allow to override the record_batch_size from constructor

This is for cli testing, to not wait the hard-coded 1000 origins which can take some time with packagist

    1. packagist: Skip package if unable to parse the last update date

Some date are apparently not parseable which break the process. They are not caught and the package is skipped (same behavior than currently when no last date is provided).

    1. packaging: Yield pages of origins to regularly record origins

This is the most important one for packagist. That allows to flush regularly origins to the scheduler db during listing. So now, instead of one huge page of origins, this yields pages of
records that are flushed along the way (we currently lose everything at each failed attempt and we only got failed attempts so far, hence the bunch of mrs since the beginning of the week).

nth docker run check ongoing (so far so good).

$ swh-doco exec swh-lister swh lister run -l packagist record_batch_size=10
+ cd /home/tony/work/inria/repo/swh/swh-environment/docker
+ docker compose -f docker-compose.yml -f docker-compose.override.yml exec swh-lister swh lister run -l packagist record_batch_size=10
+ ...
 'record_batch_size': 10,
 'scheduler': {'cls': 'remote', 'url': 'http://swh-scheduler:5008/'}}
INFO:swh.lister.pattern:record-batch-size: 10
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
WARNING:swh.lister.pattern:Skipping invalid origin: git@gitlab.com:lighty/installers.git
WARNING:swh.lister.pattern:Skipping invalid origin:
INFO:swh.lister.pattern:Record valid 8 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
INFO:swh.lister.pattern:Record valid 10 origins in the scheduler
...

2023-08-03 16:18:08 swh-scheduler@localhost:5433 λ select now(), instance_name, visit_type, count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where lister_id in (select id from listers where name='Packagist') group by instance_name, visit_type order by count asc;
+-------------------------------+---------------+------------+-------+
|              now              | instance_name | visit_type | count |
+-------------------------------+---------------+------------+-------+
| 2023-08-03 14:18:13.235783+00 | packagist     | hg         |     1 |
| 2023-08-03 14:18:13.235783+00 | packagist     | svn        |     2 |
| 2023-08-03 14:18:13.235783+00 | packagist     | git        | 23516 |
+-------------------------------+---------------+------------+-------+
(3 rows)

Time: 28.171 ms
2023-08-03 16:18:13 swh-scheduler@localhost:5433 λ \watch 60
             Thu 03 Aug 2023 04:18:19 PM CEST (every 60s)

+-------------------------------+---------------+------------+-------+
|              now              | instance_name | visit_type | count |
+-------------------------------+---------------+------------+-------+
| 2023-08-03 14:18:19.755392+00 | packagist     | hg         |     1 |
| 2023-08-03 14:18:19.755392+00 | packagist     | svn        |     2 |
| 2023-08-03 14:18:19.755392+00 | packagist     | git        | 23533 |
+-------------------------------+---------------+------------+-------+
(3 rows)

Time: 17.521 ms

Refs. swh/meta#5001 (closed)

[1] http://kibana0.internal.softwareheritage.org:5601/goto/84b652d3f37e6c1e6b08607557e9ecf1

Edited by Antoine R. Dumont

Merge request reports