Skip to content

Make PyPI lister incremental and complete in regards to last_update

This rewrote the current implementation to actually use pypi's xml-rpc api which allows to be incremental. It also allows to fetch the last release date per package. This last part actually make it possible to update the "last_update" entry in the ListedOrigin model.

Related to #3399 (closed)

Test Plan

tox

actually run it within docker:

$ swh-doco exec swh-lister swh lister run --lister pypi
+ cd /home/tony/work/inria/repo/swh/swh-environment/docker
+ docker-compose -f docker-compose.yml -f docker-compose.override.yml exec swh-lister swh lister run --lister pypi
WARNING:swh.lister.pypi.lister:Retrying swh.lister.pypi.lister.PyPILister._changelog_since_serial in 1.0 seconds as it raised Fault: <Fault -32500: 'HTTPTooManyRequests: The action could not be performed because there were too many requests by the client. Limit may reset in 1 seconds.'>.
...

Within the db scheduler:

$ psql service=swh-scheduler-dev
11:08:51 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
|                  id                  |     name      |            instance_name            |            created            | current_state |            updated            |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {}            | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi          | pypi                                | 2021-07-09 08:34:56.26739+00  | {}            | 2021-07-09 08:34:56.26739+00  |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------+-------------------------------+
(2 rows)

Time: 0.578 ms
11:10:27 swh-scheduler@localhost:5433=# select * from listed_origins;
...
+-[ RECORD 12678 ]-------+----------------------------------------------------------------------------------------------------------------------+
| lister_id              | aab1eb6b-1f3d-4227-ae7a-24113ed8c22e                                                                                 |
| url                    | https://pypi.org/project/message/                                                                                    |
| visit_type             | pypi                                                                                                                 |
| extra_loader_arguments | {}                                                                                                                   |
| enabled                | t                                                                                                                    |
| first_seen             | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_seen              | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_update            | 2011-01-13 11:43:17+00                                                                                               |
+-[ RECORD 12679 ]-------+----------------------------------------------------------------------------------------------------------------------+
| lister_id              | aab1eb6b-1f3d-4227-ae7a-24113ed8c22e                                                                                 |
| url                    | https://pypi.org/project/pytest-pep8/                                                                                |
| visit_type             | pypi                                                                                                                 |
| extra_loader_arguments | {}                                                                                                                   |
| enabled                | t                                                                                                                    |
| first_seen             | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_seen              | 2021-07-09 09:08:50.61344+00                                                                                         |
| last_update            | 2010-12-06 15:38:50+00                                                                                               |
+------------------------+----------------------------------------------------------------------------------------------------------------------+

The run is actually done now and all last_update is filled in:

11:39:30 swh-scheduler@localhost:5433=# select count(*) from listed_origins ;
+--------+
| count  |
+--------+
| 390619 |
+--------+
(1 row)

Time: 42.307 ms
11:39:41 swh-scheduler@localhost:5433=# select count(*) from listed_origins where last_update is null;
+-------+
| count |
+-------+
|     0 |
+-------+
(1 row)

11:41:03 swh-scheduler@localhost:5433=# select count(distinct url) from listed_origins ;
+--------+
| count  |
+--------+
| 390619 |
+--------+
(1 row)

Time: 3489.885 ms (00:03.490)  -- ^ just in case

The state got updated:

11:45:35 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
|                  id                  |     name      |            instance_name            |            created            |       current_state       |            updated            |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {}                        | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi          | pypi                                | 2021-07-09 08:34:56.26739+00  | {"last_serial": 10863341} | 2021-07-09 09:38:01.877813+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
(2 rows)

Time: 1.554 ms

Running another run, the lister is indeed incremental (the state of the lister got updated and 1 new origin got found):

11:45:43 swh-scheduler@localhost:5433=# select * from listers;
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
|                  id                  |     name      |            instance_name            |            created            |       current_state       |            updated            |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
| 815d8f84-804c-4e9a-ab2e-9b08c1dae02d | save-code-now | archive-docker.softwareheritage.org | 2021-07-09 08:26:27.221771+00 | {}                        | 2021-07-09 08:26:27.221771+00 |
| aab1eb6b-1f3d-4227-ae7a-24113ed8c22e | pypi          | pypi                                | 2021-07-09 08:34:56.26739+00  | {"last_serial": 10863632} | 2021-07-09 09:46:48.406973+00 |
+--------------------------------------+---------------+-------------------------------------+-------------------------------+---------------------------+-------------------------------+
(2 rows)

Time: 2.386 ms
11:46:52 swh-scheduler@localhost:5433=# select count(distinct url) from listed_origins ;
+--------+
| count  |
+--------+
| 390620 |
+--------+
(1 row)

Time: 3504.825 ms (00:03.505)

Migrated from D5977 (view on Phabricator)

Merge request reports