Make PyPI lister incremental and complete in regards to last_update
This rewrote the current implementation to actually use pypi's xml-rpc api which allows to be incremental. It also allows to fetch the last release date per package. This last part actually make it possible to update the "last_update" entry in the ListedOrigin model.
Related to #3399 (closed)
Test Plan
tox
actually run it within docker:
$ swh-doco exec swh-lister swh lister run --lister pypi
+ cd /home/tony/work/inria/repo/swh/swh-environment/docker
+ docker-compose -f docker-compose.yml -f docker-compose.override.yml exec swh-lister swh lister run --lister pypi
WARNING:swh.lister.pypi.lister:Retrying swh.lister.pypi.lister.PyPILister._changelog_since_serial in 1.0 seconds as it raised Fault: <Fault -32500: 'HTTPTooManyRequests: The action could not be performed because there were too many requests by the client. Limit may reset in 1 seconds.'>.
...
Within the db scheduler:
$ psql service=swh-scheduler-dev
11:08:51 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
| id | name | instance_name | created | current_state | updated |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {} | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi | pypi | 2021-07-09 08:34:56.26739+00 | {} | 2021-07-09 08:34:56.26739+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
(2 rows)
Time: 0.578 ms
11:10:27 swh-scheduler@localhost:5433=# select * from listed_origins;
...
+-[ RECORD 12678 ]-------+----------------------------------------------------------------------------------------------------------------------+
| lister_id | aab1eb6b-1f3d-4227-ae7a-24113ed8c22e |
| url | https://pypi.org/project/message/ |
| visit_type | pypi |
| extra_loader_arguments | {} |
| enabled | t |
| first_seen | 2021-07-09 09:08:50.61344+00 |
| last_seen | 2021-07-09 09:08:50.61344+00 |
| last_update | 2011-01-13 11:43:17+00 |
+-[ RECORD 12679 ]-------+----------------------------------------------------------------------------------------------------------------------+
| lister_id | aab1eb6b-1f3d-4227-ae7a-24113ed8c22e |
| url | https://pypi.org/project/pytest-pep8/ |
| visit_type | pypi |
| extra_loader_arguments | {} |
| enabled | t |
| first_seen | 2021-07-09 09:08:50.61344+00 |
| last_seen | 2021-07-09 09:08:50.61344+00 |
| last_update | 2010-12-06 15:38:50+00 |
+------------------------+----------------------------------------------------------------------------------------------------------------------+
The run is actually done now and all last_update is filled in:
11:39:30 swh-scheduler@localhost:5433=# select count(*) from listed_origins ;
+--------+
| count |
+--------+
| 390619 |
+--------+
(1 row)
Time: 42.307 ms
11:39:41 swh-scheduler@localhost:5433=# select count(*) from listed_origins where last_update is null;
+-------+
| count |
+-------+
| 0 |
+-------+
(1 row)
11:41:03 swh-scheduler@localhost:5433=# select count(distinct url) from listed_origins ;
+--------+
| count |
+--------+
| 390619 |
+--------+
(1 row)
Time: 3489.885 ms (00:03.490) -- ^ just in case
The state got updated:
11:45:35 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| id | name | instance_name | created | current_state | updated |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {} | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi | pypi | 2021-07-09 08:34:56.26739+00 | {"last_serial": 10863341} | 2021-07-09 09:38:01.877813+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
(2 rows)
Time: 1.554 ms
Running another run, the lister is indeed incremental (the state of the lister got updated and 1 new origin got found):
11:45:43 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| id | name | instance_name | created | current_state | updated |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {} | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi | pypi | 2021-07-09 08:34:56.26739+00 | {"last_serial": 10863632} | 2021-07-09 09:46:48.406973+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
(2 rows)
Time: 2.386 ms
11:46:52 swh-scheduler@localhost:5433=# select count(distinct url) from listed_origins ;
+--------+
| count |
+--------+
| 390620 |
+--------+
(1 row)
Time: 3504.825 ms (00:03.505)
Migrated from D5977 (view on Phabricator)