Skip to content

rubygems: Use gems database dump to improve listing output

Instead of using an undocumented rubygems HTTP endpoint that only gives us the names of the gems, prefer to exploit the daily PostgreSQL dump of the rubygems.org database.

It enables to list all gems but also all versions of a gem and its release artifacts. For each relase artifact, the following info are extracted: version, download URL, sha256 checksum, release date plus a couple of extra metadata.

The lister will now set list of artifacts and list of metadata as extra loader arguments when sending a listed origin to the scheduler database. A last_update date is also computed which should ensure loading tasks for rubygems will be scheduled only when new releases are available since last loadings.

To be noted, the lister will spawn a temporary postgres instance so this require the initdb executable from postgres server installation to be available in the execution environment.

Related to #1777

This implements the proposal of @nahimilega in #1777.

This is what I obtained when testing the lister in docker, around 187000 origins listed and processed in 25 minutes.

docker-swh-lister-1  | [2022-10-06 20:40:11,169: INFO/ForkPoolWorker-1] Task swh.lister.rubygems.tasks.RubyGemsListerTask[21911097-b9e7-48f8-a47c-ada105f4725a] succeeded in 1549.4076449069835s: {'pages': 186993, 'origins': 186993}

Migrated from D8639 (view on Phabricator)

Merge request reports