Skip to content
Snippets Groups Projects
  1. Jan 13, 2023
    • vlorentz's avatar
      Move py.typed from swh/loader/{package,core}/ to swh/loader/ · fed8fc3e
      vlorentz authored
      There is code in swh/loader/cli.py, and swh-loader-metadata will need
      to import cli.py, causing mypy to complain when py.typed is missing.
      fed8fc3e
    • Antoine R. Dumont's avatar
      Allow partial snapshot creation during ingestion · fc1adf07
      Antoine R. Dumont authored
      This introduces a `create_partial_snapshot` parameter to the base loader constructor.
      When activated, during each call of the `store_data` method, if there are more data to
      fetch, this will create a partial snapshot (and an associated visit status).
      
      The final loop behaves as before, create the last visit with status 'full' targeting the
      final snapshot.
      
      The main difference between the 2 behaviors is that an ingestion with that parameter on
      is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
      in subsequent visits for the same origin. This may especially be interesting for cases
      when loading fail due to out of hand resources issues (e.g. large svn or git
      repositories).
      
      Related to T3625
      fc1adf07
  2. Dec 20, 2022
    • Antoine Lambert's avatar
      conda: Fix versions sorting and update release names · a63b39e5
      Antoine Lambert authored
      Release 22.0 of packaging module can no longer parse invalid Python version
      number, an exception is now raised.
      
      Conda loader used the keys of the packages dict as version numbers to sort,
      which are in the form "<arch>/<version>-<build>", but those cannot be parsed
      anymore.
      
      So extract intrinsic version numbers of packages instead to sort the list of
      versions.
      
      Also update snapshot release names to "<version>-<build>-<arch>" as each
      release for a given architecture targets a different directory.
      a63b39e5
    • Antoine Lambert's avatar
      rpm: Fix package versions sorting · b6231045
      Antoine Lambert authored
      Release 22.0 of packaging module can no longer parse invalid Python version
      number, an exception is now raised.
      
      RPM loader used the keys of the packages dict as version numbers to sort,
      which are in the form "<distribution>/<edition>/<package_version_number>",
      but those cannot be parsed anymore.
      
      So use intrinsic version numbers of packages instead to sort the list of
      versions.
      b6231045
    • vlorentz's avatar
      e7ac7a34
  3. Dec 19, 2022
  4. Nov 21, 2022
  5. Nov 16, 2022
  6. Nov 15, 2022
  7. Nov 14, 2022
  8. Nov 03, 2022
  9. Nov 02, 2022
  10. Oct 27, 2022
  11. Oct 25, 2022
  12. Oct 21, 2022
    • Franck Bret's avatar
      Conda: Anaconda packages archive loader · e7ba6316
      Franck Bret authored
      For each origin it takes advantage of 'artifacts' data send through
      'extra_loader_arguments' of the conda lister, providing versions,
      archive url, checksum, etc.
      Author extracted from intrinsic metadata.
      
      Related T4579
      e7ba6316
  13. Oct 18, 2022
  14. Oct 17, 2022
  15. Oct 11, 2022
  16. Oct 07, 2022
  17. Oct 05, 2022
    • Antoine R. Dumont's avatar
      {Cnt|Dir}Loader: Fix standard/nar hash mismatch behavior to fail loading · 8aa6dab7
      Antoine R. Dumont authored
      Prior to this commit, there was a discrepancy between the hash mismatch computations
      with "standard" and "nar" computations. This commit fixes the gap between those.
      
      When a hash mismatch occurs, either "nar" or "standard", the issue is caught and the
      next mirror url is checked. At the end of it all, if nothing is loaded and errors
      exist, this is raised. This fails the visit.
      
      This also adds the missing tests.
      
      Related to T3781
      8aa6dab7
    • Antoine R. Dumont's avatar
      DirectoryLoader: Check nar hashes when provided · 4d51ad99
      Antoine R. Dumont authored
      The lister now provides the "checksums_computation". This is either "standard" (for most
      cases as in bare checksums on the object retrieved) or "nar" for some edge case. In that
      case the computation is delegated to the "nix-store" command (which should be present in
      the system running the loading).
      
      This adapts the directory loader to deal with this case.
      
      No work has been done for the ContentLoader yet besides failing the case if a call
      happens with such case.
      
      Related to T3781
      4d51ad99
  18. Oct 04, 2022
  19. Oct 03, 2022
  20. Sep 30, 2022
    • Antoine Lambert's avatar
      Use tarball checksum to check download integrity in package loaders · 5482a48e
      Antoine Lambert authored
      When one or multiple tarball checksums are available, either from listers
      output or from Web APIs calls perfomed by some loaders, use them to check
      integrity of downloaded tarballs.
      5482a48e
    • Antoine R. Dumont's avatar
      Add Content Loader to ingest raw content file · f774aba5
      Antoine R. Dumont authored
      In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
      This commit adds a loader to ingest those. The output of the ingestion is a snapshot
      with 1 branch, one HEAD branch targetting the file content ingested.
      
      This expects to receive a mandatory 'integrity' field. It is used to check the content
      match the declaration.
      
      This can also optionally receive a list of mirror urls in case the main origin url is no
      longer available. Those mirror urls are solely used as fallback to retrieve the content.
      
      Related to T3781
      f774aba5
  21. Sep 29, 2022
    • Franck Bret's avatar
      Puppet: The puppet loader loads origins from https://forge.puppet.com · 6299c091
      Franck Bret authored
      For each origin it takes advantage of 'artifacts' data send through
      'extra_loader_arguments' from the Puppet lister, providing versions,
      archive url, last_update, filename.
      Author and description are extracted from intrinsic metadata.
      
      Related T4580
      6299c091
    • Franck Bret's avatar
      Cpan: Cpan loader loads Perl modules from cpan.org · 2db1a754
      Franck Bret authored
      For each origin it calls an http api endpoint to retrieve extrinsic
      metadata for each version of a module.
      Author and package description are extracted from intrinsic metadata
      parsing data from META.json or META.yml at the root of the archive.
      
      Related T2833
      2db1a754
  22. Sep 28, 2022
    • Antoine Lambert's avatar
      discovery: Fix compatibility with storage RPC API · 7375a83c
      Antoine Lambert authored
      Software Heritage homemade RPC layer does not known how to serialize
      set objects so we need to pass lists as parameters of *_missing
      methods from storage API.
      7375a83c
    • Raphaël Gomès's avatar
      Setup async interface for discovery module · 1facea3c
      Raphaël Gomès authored
      This will allow us to use this interface in async code like ``swh-scanner``.
      
      Unfortunately, this means calling ``asyncio.run`` for sync code, but the
      performance impact should be negligible.
      
      The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
      requires a lot of boilerplate code. This should be addressed in a
      follow-up.
      1facea3c
  23. Sep 26, 2022
    • Raphaël Gomès's avatar
      Use a Merkle discovery algorithm with archives · 798f749e
      Raphaël Gomès authored
      "Discovery" is the term used to find out the differences between two
      Merkle graphs. Using such an algorithm is useful in that it drastically
      reduces the amount of data that needs to be transferred.
      
      This commit introduces an efficient but simple algorithm that is a good
      starting point for improved performance: random sampling of directories,
      the details of which are explained in the docstrings.
      
      Mercurial uses a more sophisticated algorithm for its discovery, but it
      is quite a bit more involved and would introduce too much complexity at
      once. Also, the constraints for speed that Mercurial has (in the order
      of milliseconds) don't apply as obviously to this context without
      further investigation.
      
      Benchmarks
      ==========
      
      Setup
      -----
      - With a local postgresql storage (so no network overhead), a local
        tmpfs obstorage on a fast NVME SSD, all of which should make this
        improvement look less good than it will be in production
      - With a tarball of the linux kernel at commit
        d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
      - Loading a tarball of 20 commits earlier
        (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
      - Only taking into account the loading (not the downloading of the
        tarball, or its decompression)
      
      Result
      ------
      
      before: ~30s
      after: ~17s
      
      Reproduced 4 times.
      798f749e
    • Antoine Lambert's avatar
      26fe954b
Loading