Replace the Nixguix loader with a lister
Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong. We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests. This would be closer to what we do with Debian/Ubuntu.
Define the following (see the hedgedoc [1] which details a proposition):
-
target structure sketch of the data in the archive -
define origin urls -
what kind of extrinsic metadata and/or extids are we storing -
what kind of snapshots we're generating
Plan:
-
swh/devel/swh-lister!427 (closed): Implement lister - [ ] swh/devel/swh-loader-core!446 (closed), ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests(cannot work [2]) -
swh/devel/swh-loader-core!447 (closed): Implement ContentLoader (possibly as a package[2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests) -
swh/devel/swh-loader-core!436 (closed): Implement DirectoryLoader (possibly as a package[2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests) -
swh/devel/swh-loader-core!437 (closed): Update implementations ^ dealing with unsupported integrity hash (sha512) -
#3781: lister run through docker -
swh/devel/swh-loader-core!438 (closed), #3781: loaders run through docker (directory ok, contents ok too but they are creating mismatchs due to faulty manifest integrity references) -
swh/devel/swh-lister!428 (closed): lister: Randomize origins order to ingest -
swh/devel/swh-lister!429 (closed): lister: Deal with mistyped origins -
swh/devel/swh-lister!430 (closed): lister: Fix expired ssl certificate -
swh/devel/swh-lister!432 (closed): lister: Fix connection error -
swh/devel/swh-lister!433 (closed): lister: Deal with pseudo url with missing schema -
swh/devel/swh-lister!435 (closed): lister> Deal with exotic urls so tarballs are recognized -
swh/devel/swh-lister!436 (closed): lister: Deal with misplaced git urls -
swh/devel/swh-lister!437 (closed): nixguix: Improve content type detection (those with charset were off) -
swh/devel/swh-core!332 (closed): swh.core.tarball: Add missing mimetype application/x-gzip -
swh/devel/swh-lister!438 (closed): lister: Refactor to simplify some computations -
swh/infra/ci-cd/swh-jenkins-dockerfiles!49 (closed): Make jenkins build with nix-store inside so future builds that needs it run correctly -
#3781: Fix mismatched computations for nixpkgs manifests -> nar hash support (impacts both lister and loader) -
swh/devel/swh-lister!434 (closed): lister adaptation to provide the correct information to the loaders -
swh/devel/swh-loader-core!440 (closed): {Content|Directory}Loader adaptation to be able to check this ^ -
swh/devel/swh-loader-core!441 (closed): Adapt standard/nar hash mismatch computation behavior (so they fail loading) -
swh/devel/swh-loader-core!439 (closed): Content "nar" checksum computation. files with "recursive" hashOutputMode exist - [ ] #3781: $1477: $1478: hash mismatch edge cases (so far) we cannot do anything about (yet?!),see next point
-
-
#4608: swh/devel/swh-lister!448 (closed): lister: Exclude faulty origins -
#4608: Notify upstream nixpkgs community about the missing information on "faulty" origins -
#4609: Notify upstream nixpkgs community about the misqualified "git" repositories as urls -
$1470: ContentLoader run in docker -
$1471: DirectoryLoader run in docker -
swh/devel/swh-environment!248 (closed), swh/devel/swh-environment!216 (closed): Deploy in docker -
$1474: Fix misqualified repositories detected as file (see pastes) -
$1475: Contents -
$1476: Directories
-
-
swh/devel/swh-lister!449 (closed): Add support for more tarball/zip extension -
swh/devel/swh-core!333 (closed): swh.core: Wire war support (and check other tarballs are already supported) -
swh/devel/swh-lister!450 (closed): Harden tarball support test dataset -
swh/devel/swh-lister!451 (closed): lister: Add another diff to filter out irrelevant origins (.iso, .bin, ...) -
#3781: Status -> further fixes (/me sighs) -
swh/devel/swh-lister!441 (closed): nixguix: Deal with edge case url with version instead of extension -
swh/devel/swh-lister!442 (closed): Use content-disposition -
infra/sysadm-environment#4655 Deploy in staging -
#4979 (closed): Store NAR hashes in ExtID mapping while loading -
Drop no longer relevant nixguix loader -
Call for public review -
Deploy in production when ok ^ -
[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw
-
[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.
Migrated from T3781 (view on Phabricator)