Skip to content
Snippets Groups Projects
  1. Mar 21, 2025
  2. Feb 10, 2025
    • Antoine Lambert's avatar
      maven: Update test that is now failing since beautifulsoup4 4.13 · a3d66736
      Antoine Lambert authored
      Latest beautifulsoup4 release (4.13) seems to have fixed issues
      related to unexpected encodings in XML files so a test that was
      passing previously is now failing.
      
      Update that test to check origin URL and visit type can be
      successfully extracted from a POM file with unexpected encoding.
      a3d66736
  3. Sep 04, 2024
    • Antoine Lambert's avatar
      Add save-bulk lister to check origins prior their insertion in database · af24960b
      Antoine Lambert authored
      This new and special lister enables to verify a list of origins to archive
      provided by users (for instance through the Web API).
      
      Its purpose is to avoid polluting the scheduler database with origins that
      cannot be loaded into the archive.
      
      Each origin is identified by an URL and a visit type. For a given visit type
      the lister is checking if the origin URL can be found and if the visit type
      is valid.
      
      The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus
      the one for loading a tarball content into the archive.
      
      Accepted origins are inserted or upserted in the scheduler database.
      
      Rejected origins are stored in the lister state.
      
      Related to #4709
      af24960b
  4. Aug 27, 2024
    • Antoine Lambert's avatar
      crates: Use looseversion.LooseVersion2 to parse crate versions · aafaebd5
      Antoine Lambert authored
      packaging.version.parse is dedicated to parse Python package version
      numbers but crate versions do not necessarily respect Python version
      number conventions and thus some crate versions cannot be parsed.
      
      Prefer to use looseversion.LooseVersion2 instead which in a drop-in
      replacement for deprecated distutils.version.LooseVersion and enables
      to parse all kind of version numbers.
      aafaebd5
  5. Jun 28, 2024
  6. Nov 14, 2023
    • Antoine Lambert's avatar
      cran: Use pyreadr instead of rpy2 to read a RDS file from Python · 4aee4da7
      Antoine Lambert authored
      The CRAN lister improvements introduced in 91e4e33d originally used pyreadr
      to read a RDS file from Python instead of rpy2.
      
      As swh-lister was still packaged for debian at the time, the choice of using
      rpy2 instead was made as a debian package is available for it while it is not
      for pyreadr.
      
      Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
      based implementation which has the advantages of being faster and not depending
      on the R language runtime.
      
      Related to swh/meta#1709.
      v6.3.0
      4aee4da7
  7. Oct 09, 2023
    • Franck Bret's avatar
      Add Julia Lister for listing Julia Packages · f8cfa05f
      Franck Bret authored
      This module introduce Julia Lister.
      It retrieves Julia packages origins from the Julia General Registry, a Git
      repository made of per package directory with Toml definition files.
      f8cfa05f
  8. Aug 21, 2023
    • Antoine Lambert's avatar
      cran: Improve listing of R packages · 91e4e33d
      Antoine Lambert authored
      Previously, the lister was relying on the use of the CRANtools R module
      but it has the drawback to only list the latest version of each registered
      package in the CRAN registry.
      
      In order to get all possible versions for each CRAN package, prefer to exploit
      the content of the weekly dump of the CRAN database in RDS format.
      
      To read the content of the RDS file from Python, the rpy2 package is used as
      it has the advantage to be packaged in debian.
      
      Related to swh/meta#1709.
      v5.9.7
      91e4e33d
  9. Aug 17, 2023
  10. Jul 10, 2023
    • Antoine R. Dumont's avatar
      Add Gitweb lister · 573958ce
      Antoine R. Dumont authored
      Depending on some instances, we have some specific heuristics, some instances:
      - have summary pages which do not not list metadata_url (so some
        computation happens to list git:// origins which are cloneable)
      - have summary page which reference metadata_url as a multiple comma separated urls
      - lists relative urls of the repository so we need to join it with the main instance url
        to have a complete cloneable origins (or summary page)
      - lists "down" http origins (cloning those won't work) so lists those as cloneable https
        ones (when the main url is behind https).
      
      Refs. #1800
      v5.8.0
      573958ce
  11. Nov 15, 2022
  12. Oct 07, 2022
    • Antoine Lambert's avatar
      rubygems: Use gems database dump to improve listing output · 108816f2
      Antoine Lambert authored
      Instead of using an undocumented rubygems HTTP endpoint that only
      gives us the names of the gems, prefer to exploit the daily PostgreSQL
      dump of the rubygems.org database.
      
      It enables to list all gems but also all versions of a gem and its
      release artifacts. For each relase artifact, the following info are
      extracted: version, download URL, sha256 checksum, release date
      plus a couple of extra metadata.
      
      The lister will now set list of artifacts and list of metadata as extra
      loader arguments when sending a listed origin to the scheduler database.
      A last_update date is also computed which should ensure loading tasks
      for rubygems will be scheduled only when new releases are available since
      last loadings.
      
      To be noted, the lister will spawn a temporary postgres instance so this
      require the initdb executable from postgres server installation to be
      available in the execution environment.
      
      Related to T1777
      108816f2
  13. Aug 09, 2022
  14. Aug 05, 2022
  15. Apr 21, 2022
  16. Dec 08, 2021
  17. Nov 29, 2021
    • Boris Baldassari's avatar
      lister: Add new maven lister · 8991c625
      Boris Baldassari authored
      The Maven lister retrieves the maven central indexes, exports them in a
      convenient text format, and parse them to identify all src archives and
      pom files in the maven repository. Then the pom files are downloaded and
      analysed to find and yield any scm reference.
      
      Note: This is a new version of the maven lister diff D6133 which takes
      into account the initial round of reviews.
      
      Related to T1724
      8991c625
  18. Feb 05, 2021
  19. Feb 02, 2021
    • Antoine Lambert's avatar
      Remove no longer used legacy Lister API and update CLI options · 89335445
      Antoine Lambert authored
      Legacy Lister classes from the swh.lister.core mdule are no longer
      used in swh-lister codebase so it is time to remove them.
      
      Also remove lister CLI options related to legacy Lister API.
      
      As a consequence, the following requirements are no longer needed:
      arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.
      
      Closes T2442
      89335445
    • Antoine Lambert's avatar
      gnu: Remove dependency on pytz · 82ab96ad
      Antoine Lambert authored
      UTC timezone settings can be obtained from the datetime.timezone
      module from Python standard library so remove dependency on external
      pytz module.
      82ab96ad
  20. Jan 18, 2021
    • Antoine Lambert's avatar
      lister: Add utility decorator to ease HTTP requests rate limit handling · d1fbccd9
      Antoine Lambert authored
      Add swh.lister.utils.throttling_retry decorator enabling to retry a
      function that performs an HTTP request who can return a 429 status code.
      
      The implementation is based on the tenacity module and it is assumed
      that the requests library is used when querying an URL.
      
      The default wait strategy is based on exponential backoff.
      
      The default max number of attempts is set to 5, HTTPError exception
      will then be reraised.
      
      All tenacity.retry parameters can also be overridden in client code.
      d1fbccd9
  21. Apr 11, 2020
  22. Nov 04, 2019
  23. Jun 28, 2019
    • Archit Agrawal's avatar
      swh.lister.cgit · b972a2a8
      Archit Agrawal authored
      Implemented a lister to list the repos for a given CGit instance.
      
      Closes T1659
      b972a2a8
  24. Feb 01, 2019
  25. Oct 30, 2017
  26. Sep 05, 2017
  27. Apr 12, 2017
  28. Mar 06, 2017
    • Avi Kelman's avatar
      Refactor lister code · 68d77fd4
      Avi Kelman authored
      Streamline production of new listers by aggressively moving core
      functionality into progressively inherited (A->B->C) base classes
      with the transport layer abstracted.
      This should make common individual forge listers straightforward to
      produce with minimal customization. Github and Bitbucket listers
      can be used as examples of the indexing type.
      68d77fd4
  29. Feb 09, 2017
  30. Dec 15, 2016
  31. Oct 20, 2016
  32. Oct 19, 2016
  33. Sep 13, 2016
  34. Mar 17, 2016
  35. Mar 09, 2016
  36. Sep 21, 2015
Loading