packagist: Reimplement lister using new Lister API
The previous implementation was generating tasks for a non implemented Packagist loader.
The new implementation extracts source repository URL, VCS type and last update date for each package referenced by Packagist and send those info to the scheduler.
Packages metadata are retrieved using Packagist API endpoints whose
responses are served from static files, which are guaranteed to be
efficient on the Packagist side (no dymamic queries).
Furthermore, subsequent listing will send the If-Modified-Since
HTTP
header to only retrieve packages metadata updated since the previous
listing operation in order to save bandwidth and return only origins
which might have new released versions.
I tested intensively the lister yersteday and it worked without any
issues each time I executed it. First execution took around 90 minutes
and listed 286510 origins with three different visit types: git, hg and
svn. Subsequent calls took less time thanks to the If-Mofified-Since
HTTP header use and only returned packages modified since last listing.
Closes #2991 (closed)
Migrated from D4990 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D4990 (id=17798)
Rebasing onto 8e4dd178...
Current branch diff-target is up to date.
Changes applied before test
commit 478081c1513b240f85c78cc66e9a3109eff91608 Author: Antoine Lambert <antoine.lambert@inria.fr> Date: Mon Feb 1 17:34:10 2021 +0100 packagist: Reimplement lister using new Lister API The previous implementation was generating tasks for a non implemented Packagist loader. The new implementation extracts source repository URL, VCS type and last update date for each package referenced by Packagist and send those info to the scheduler. Packages metadata are retrieved using Packagist API endpoints whose responses are served from static files, which are guaranteed to be efficient on the Packagist side (no dymamic queries). Furthermore, subsequent listing will send the "If-Modified-Since" HTTP header to only retrieve packages metadata updated since the previous listing operation in order to save bandwidth and return only origins which might have new released versions. Closes #2991
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/234/ for more details.
! In !204 (closed), @ardumont wrote: lgtm
But it's missing some coverage on conditionals (according to jenkins).
Maybe simply enrich the current test dataset with some of those skipped packages in the current new dataset you added (one bitbucket entry, another with missing origin_url, another with missing time, etc...)
Ack, will improve coverage then.
Some references in the commit message have been migrated:
- T2991 is now #2991 (closed)
Build is green
Patch application report for D4990 (id=17810)
Rebasing onto 82ab96ad...
Current branch diff-target is up to date.
Changes applied before test
commit ff05191b7db7b217c8682e9888338b8813e2df6a Author: Antoine Lambert <antoine.lambert@inria.fr> Date: Mon Feb 1 17:34:10 2021 +0100 packagist: Reimplement lister using new Lister API The previous implementation was generating tasks for a non implemented Packagist loader. The new implementation extracts source repository URL, VCS type and last update date for each package referenced by Packagist and send those info to the scheduler. Packages metadata are retrieved using Packagist API endpoints whose responses are served from static files, which are guaranteed to be efficient on the Packagist side (no dymamic queries). Furthermore, subsequent listing will send the "If-Modified-Since" HTTP header to only retrieve packages metadata updated since the previous listing operation in order to save bandwidth and return only origins which might have new released versions. Closes #2991
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/238/ for more details.
mentioned in merge request !357 (closed)