Skip to content
Snippets Groups Projects
  1. Nov 04, 2022
  2. Oct 28, 2022
  3. Oct 26, 2022
  4. Oct 25, 2022
  5. Oct 21, 2022
    • Antoine R. Dumont's avatar
      gogs/lister: Allow public gogs instance listing · 8a82bbf9
      Antoine R. Dumont authored
      Prior to this commit, the lister assumed authentication was required. It exists public
      gogs instances which do not require it.
      
      This also updates documentation to mention the usual api location. This is useful when
      people wants to actually trigger a listing as a pre-check flight.
      
      This drops repetitive instruction in the gitea lister as well.
      
      Co-authored with Antoine Lambert (@anlambert) <anlambert@softwareheritage.org>.
      
      Related to infra/sysadm-environment#4644
      v4.0.1
      8a82bbf9
  6. Oct 19, 2022
  7. Oct 18, 2022
  8. Oct 13, 2022
  9. Oct 11, 2022
    • Antoine Lambert's avatar
      cpan: Fix module version extraction for some edge cases · 05cd1de1
      Antoine Lambert authored
      CPAN API can return versions that are not of str type: either
      int or float.
      
      When version equals 0, it means that version failed to be parsed
      by CPAN so we try to extract it from release name in that case.
      
      Otherwise we ensure to convert the version to str type.
      
      Related to T2833
      05cd1de1
    • Antoine Lambert's avatar
      cpan: Improve listing process by querying the metacpan release endpoint · f57b8f3a
      Antoine Lambert authored
      Instead of querying the metacpan distribution endpoint to list origins,
      prefer to use the release endpoint instead enabling to list all artifacts
      associated to CPAN packages by scrolling results.
      
      Compared to previous implementation, it enables to compute a last_update
      date for all CPAN packages but also to obtain artifact sha256 checksums
      that will be used by the CPAN loader to check downloads integrity.
      
      As the multiple versions of a module are spread across multiple pages
      from the CPAN API, origins are sent to the scheduler once all pages
      processed, it is also faster to proceed that way.
      
      Related to T2833
      f57b8f3a
  10. Oct 07, 2022
    • Antoine Lambert's avatar
      rubygems: Use gems database dump to improve listing output · 108816f2
      Antoine Lambert authored
      Instead of using an undocumented rubygems HTTP endpoint that only
      gives us the names of the gems, prefer to exploit the daily PostgreSQL
      dump of the rubygems.org database.
      
      It enables to list all gems but also all versions of a gem and its
      release artifacts. For each relase artifact, the following info are
      extracted: version, download URL, sha256 checksum, release date
      plus a couple of extra metadata.
      
      The lister will now set list of artifacts and list of metadata as extra
      loader arguments when sending a listed origin to the scheduler database.
      A last_update date is also computed which should ensure loading tasks
      for rubygems will be scheduled only when new releases are available since
      last loadings.
      
      To be noted, the lister will spawn a temporary postgres instance so this
      require the initdb executable from postgres server installation to be
      available in the execution environment.
      
      Related to T1777
      108816f2
    • Antoine R. Dumont's avatar
      nixguix: Exclude faulty "recursive" file origins from listing · c22f41a6
      Antoine R. Dumont authored
      For now, those can be faulty as the manifest is missing 'critical' information about how
      to recompute the hash (e.g. fs layout, executable bit, ...).
      
      Related to T4608
      Related to T3781
      c22f41a6
  11. Oct 05, 2022
    • Antoine R. Dumont's avatar
      nixguix: Refactor by renaming success or failure the different datasets · 5a53243b
      Antoine R. Dumont authored
      It's more explicit that way.
      
      Related to T3781
      5a53243b
    • Franck Bret's avatar
      Crates.io: Add last_update for each version of a crate · 4a09f660
      Franck Bret authored
      In order to reduce http api call amount made by the loader, download a
      crates.io database dump, and parse its csv files to get a last_update
      value for each versions of a Crate.
      Those values are sent to the loader through extra_loader_arguments
      'crates_metadata'.
      
      'artifacts' and 'crates_metadata' now uses "version" as key.
      
      Related T4104, D8171
      4a09f660
    • Antoine R. Dumont's avatar
      nixguix: Deal with manifest entries without an integrity field · 2e6e282d
      Antoine R. Dumont authored
      In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
      integrity one except it's for "recursive" outputHashMode. This adds the necessary
      assertions around this case so correct data is sent to loaders as well.
      
      Related to T3781
      2e6e282d
    • Antoine R. Dumont's avatar
      nixguix: Improve is_tarball detection pattern · f2377c28
      Antoine R. Dumont authored
      This actually includes all query param values as paths to check. When paths have
      extensions, it then pattern matches against tarballs if any. When no extension is
      detected, it's doing as before, fallbacks to head query the url to have more information
      on the file.
      
      Prior to this commit, this only looked over a hard-coded list of values (for hard-coded
      keys: file, f, name, url) detected through docker runs. This way of doing it should
      decrease future misdetections (when new unknown "keys" show up in the wild).
      
      Related to T3781
      f2377c28
    • Antoine R. Dumont's avatar
      nixguix: Improve further tarball detection · 2ee103e2
      Antoine R. Dumont authored
      The current content type detection was a bit off mostly for content which includes
      charset. This commit fixes it.
      
      Related to T3781
      2ee103e2
    • Antoine R. Dumont's avatar
      nixguix: Improve git origins detection · ff80a91f
      Antoine R. Dumont authored
      Without this, some git repositories are detected as file (due to upstream
      misqualification too). This does some extra effort to detect those to avoid sending
      noise to loaders.
      
      This also refactors some common code to build vcs artifacts to avoid duplication.
      
      Related to T3781
      ff80a91f
    • Antoine R. Dumont's avatar
      nixguix: Improve tarball detection · 2fbd6677
      Antoine R. Dumont authored
      Without this, some tarballs hidden within query parameters are not detected. This does
      some extra effort to detect those to avoid sending noise to loaders.
      
      Related to T3781
      2fbd6677
  12. Oct 04, 2022
  13. Oct 03, 2022
Loading