Even though the XMLRPC api for PyPI is "on the way out", it's still the recommended way of subscribing to changes for packages.
Following the instructions at https://warehouse.pypa.io/api-reference/feeds.html, it should be possible for the PyPI lister to populate a "last update" field for most listed origins. This will help us to schedule the origin visits more effectively, and will reduce the loader thrashing on origins that haven't been updated since the last visit.
From a quick test, it looks like the "Project and release activity details" feed can go back multiple years without any issue, allowing us to backfill the data for all known origins, before adding the incremental behavior to the lister.
There is now only 8 origins without last_update there:
$ psql service=staging-swh-scheduler14:01:49 swh-scheduler@db1:5432=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='pypi' and last_update is null;+-------+| count |+-------+| 8 |+-------+(1 row)Time: 270.575 ms
[1]
14:49:52 swh-scheduler@db1:5432=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='pypi' and lo.last_update is null order by url;+-------------------------------------------------+| url |+-------------------------------------------------+| https://pypi.org/project/f-luhn/ || https://pypi.org/project/int-hash-int-hash-lib/ || https://pypi.org/project/linkedin-user-scraper/ || https://pypi.org/project/lm-decoder/ || https://pypi.org/project/lyra2rec0ban-hash/ || https://pypi.org/project/micro-api-ext/ || https://pypi.org/project/pokemon-yeet/ || https://pypi.org/project/rasa-print/ |+-------------------------------------------------+(8 rows)
From a quick test, it looks like the "Project and release activity details" feed can go back multiple years without any issue, allowing us to backfill the data for all known origins, before adding the incremental behavior to the lister.
The new implementation actually deals with the backfilling.
Deployed in production as well and triggered a run:
14:46:29 softwareheritage-scheduler@belvedere:5432=> select now(), count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='pypi' and lo.last_update is null;+------------------------------+-------+| now | count |+------------------------------+-------+| 2021-07-09 12:46:48.01577+00 | 8 |+------------------------------+-------+(1 row)Time: 8643.946 ms (00:08.644)14:47:30 softwareheritage-scheduler@belvedere:5432=> select * from listers where name='pypi';+--------------------------------------+------+---------------+-------------------------------+---------------------------+-------------------------------+| id | name | instance_name | created | current_state | updated |+--------------------------------------+------+---------------+-------------------------------+---------------------------+-------------------------------+| 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi | pypi | 2021-04-30 11:14:03.440526+00 | {"last_serial": 10864686} | 2021-07-09 12:46:07.863475+00 |+--------------------------------------+------+---------------+-------------------------------+---------------------------+-------------------------------+(1 row)Time: 9.949 ms
This now displays 8 origins without any last_update. This is marginal enough to not bother too much about [1]
Given that we started at 316958 without any last_update and now we got 8, i'd say that's win enough.
[1]
14:50:02 softwareheritage-scheduler@belvedere:5432=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='pypi' and lo.last_update is null order by url;+-------------------------------------------------+| url |+-------------------------------------------------+| https://pypi.org/project/f-luhn/ || https://pypi.org/project/int-hash-int-hash-lib/ || https://pypi.org/project/linkedin-user-scraper/ || https://pypi.org/project/lm-decoder/ || https://pypi.org/project/lyra2rec0ban-hash/ || https://pypi.org/project/micro-api-ext/ || https://pypi.org/project/pokemon-yeet/ || https://pypi.org/project/rasa-print/ |+-------------------------------------------------+(8 rows)Time: 7172.285 ms (00:07.172)