Replace the Nixguix loader with a lister
Currently, loading Nix and Guix as single origins with a huge snapshot, with each branch name being a URL is wrong. We need to replace the Nixguix loader with a lister, which creates as many origins referenced by Nix and Guix public manifests. This would be closer to what we do with Debian/Ubuntu.
Define the following (see the hedgedoc [1] which details a proposition):
- target structure sketch of the data in the archive
- define origin urls
- what kind of extrinsic metadata and/or extids are we storing
- what kind of snapshots we're generating
Plan:
-
swh/devel/swh-lister!427 (closed): Implement lister
- [ ] swh/devel/swh-loader-core!446 (closed), ...: Adapt archive loader (package loader) to accept tarball from nixguix manifests(cannot work [2]) -
swh/devel/swh-loader-core!447 (closed): Implement ContentLoader (possibly as a
package[2] core loader) to deal with content file with intrinsic metadata (out of nixguix manifests) -
swh/devel/swh-loader-core!436 (closed): Implement DirectoryLoader (possibly as a
package[2] core loader~~) to deal with tarball with intrinsic metadata (out of nixguix manifests) - swh/devel/swh-loader-core!437 (closed): Update implementations ^ dealing with unsupported integrity hash (sha512)
- #3781 (closed): lister run through docker
- swh/devel/swh-loader-core!438 (closed), #3781 (closed): loaders run through docker (directory ok, contents ok too but they are creating mismatchs due to faulty manifest integrity references)
- swh/devel/swh-lister!428 (closed): lister: Randomize origins order to ingest
- swh/devel/swh-lister!429 (closed): lister: Deal with mistyped origins
- swh/devel/swh-lister!430 (closed): lister: Fix expired ssl certificate
- swh/devel/swh-lister!432 (closed): lister: Fix connection error
- swh/devel/swh-lister!433 (closed): lister: Deal with pseudo url with missing schema
- swh/devel/swh-lister!435 (closed): lister> Deal with exotic urls so tarballs are recognized
- swh/devel/swh-lister!436 (closed): lister: Deal with misplaced git urls
- swh/devel/swh-lister!437 (closed): nixguix: Improve content type detection (those with charset were off)
- swh/devel/swh-core!332 (closed): swh.core.tarball: Add missing mimetype application/x-gzip
- swh/devel/swh-lister!438 (closed): lister: Refactor to simplify some computations
- swh/infra/ci-cd/swh-jenkins-dockerfiles!49 (closed): Make jenkins build with nix-store inside so future builds that needs it run correctly
-
#3781 (closed): Fix mismatched computations for nixpkgs manifests -> nar hash support (impacts both lister and loader)
- swh/devel/swh-lister!434 (closed): lister adaptation to provide the correct information to the loaders
- swh/devel/swh-loader-core!440 (closed): {Content|Directory}Loader adaptation to be able to check this ^
- swh/devel/swh-loader-core!441 (closed): Adapt standard/nar hash mismatch computation behavior (so they fail loading)
-
swh/devel/swh-loader-core!439 (closed): Content "nar" checksum computation. files with "recursive" hashOutputMode exist
- [ ] #3781 (closed): $1477: $1478: hash mismatch edge cases (so far) we cannot do anything about (yet?!),see next point
- #4608 (closed): swh/devel/swh-lister!448 (closed): lister: Exclude faulty origins
- #4608 (closed): Notify upstream nixpkgs community about the missing information on "faulty" origins
- #4609 (closed): Notify upstream nixpkgs community about the misqualified "git" repositories as urls
- $1470: ContentLoader run in docker
- $1471: DirectoryLoader run in docker
- swh/devel/swh-environment!248 (closed), swh/devel/swh-environment!216 (closed): Deploy in docker
-
$1474: Fix misqualified repositories detected as file (see pastes)
- $1475: Contents
- $1476: Directories
- swh/devel/swh-lister!449 (closed): Add support for more tarball/zip extension
- swh/devel/swh-core!333 (closed): swh.core: Wire war support (and check other tarballs are already supported)
- swh/devel/swh-lister!450 (closed): Harden tarball support test dataset
- swh/devel/swh-lister!451 (closed): lister: Add another diff to filter out irrelevant origins (.iso, .bin, ...)
- #3781 (closed): Status -> further fixes (/me sighs)
- swh/devel/swh-lister!441 (closed): nixguix: Deal with edge case url with version instead of extension
- swh/devel/swh-lister!442 (closed): Use content-disposition
- infra/sysadm-environment#4655 Deploy in staging
- #4979 (closed): Store NAR hashes in ExtID mapping while loading
- Call for public review
- swh/infra/sysadm-environment#5223 (closed): Deploy in production when ok ^
- swh/devel/swh-loader-core!518 (merged), swh/devel/docker!8 (merged): Drop no longer relevant nixguix loader
- swh/devel/swh-loader-core#4749 (closed): Document nixguix lister & loader
[1] Draft pad: https://hedgedoc.softwareheritage.org/2AQFbVB0S-OrOtkJV2yNJw
[2] It cannot. We may not have any versions received and package loader are currently relying on that particular data for its main ingestion algorithm.
Migrated from T3781 (view on Phabricator)