lister/loader: Ingest archived artifacts from cran mirror
keyword is archived
, as of now, we only ingest the main one.
2 sides of that coin (which can be done independently and in any order we so choose):
Lister
algo:
- drop the R cran script
- parse the listing page instead (as in simple_lister, check lister cgit's way of doing it) [1]
- for each package found there, send the origin url [2] to the loader (as
recurring
task)
schema adaptations:
- make the tasks outputed by the lister as
recurring
(currentlyoneshot
) - Adapt uid field to be the origin_url's value
migration plan:
- truncate cran_repo table
- trigger back a full listing
Loader
algo:
-
Improve the loader so it scrapes that origin url [2] page.
-
It then determines itself what the artifact urls it needs to ingest
-
In the [2] page, there is an archive link
Old source
which lists the previous artifact version. -
[1] https://cran.r-project.org/web/packages/available_packages_by_date.html This can be subject to discussion with the cran community to ask for a better api endpoint (if it's not too much hassle for them to adapt and provide ;)
Related to #2029 (closed)
Migrated from T2241 (view on Phabricator)
Edited by Phabricator Migration user