- Jan 13, 2023
-
-
vlorentz authored
There is code in swh/loader/cli.py, and swh-loader-metadata will need to import cli.py, causing mypy to complain when py.typed is missing.
-
Antoine R. Dumont authored
This introduces a `create_partial_snapshot` parameter to the base loader constructor. When activated, during each call of the `store_data` method, if there are more data to fetch, this will create a partial snapshot (and an associated visit status). The final loop behaves as before, create the last visit with status 'full' targeting the final snapshot. The main difference between the 2 behaviors is that an ingestion with that parameter on is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental in subsequent visits for the same origin. This may especially be interesting for cases when loading fail due to out of hand resources issues (e.g. large svn or git repositories). Related to T3625
-
- Dec 20, 2022
-
-
Antoine Lambert authored
Release 22.0 of packaging module can no longer parse invalid Python version number, an exception is now raised. Conda loader used the keys of the packages dict as version numbers to sort, which are in the form "<arch>/<version>-<build>", but those cannot be parsed anymore. So extract intrinsic version numbers of packages instead to sort the list of versions. Also update snapshot release names to "<version>-<build>-<arch>" as each release for a given architecture targets a different directory.
-
Antoine Lambert authored
Release 22.0 of packaging module can no longer parse invalid Python version number, an exception is now raised. RPM loader used the keys of the packages dict as version numbers to sort, which are in the form "<distribution>/<edition>/<package_version_number>", but those cannot be parsed anymore. So use intrinsic version numbers of packages instead to sort the list of versions.
-
vlorentz authored
-
- Dec 19, 2022
-
-
Antoine Lambert authored
In order to remove warnings about /apidoc/*.rst files being included multiple times in toc when building full swh documentation, prefer to include module indices only when building standalone package documentation. Also include them the proper sphinx way. Related to T4496
-
- Nov 21, 2022
-
-
Franck Bret authored
The loader make an http api call to retrieve package related versions. It then download tar.gz archive for each version.
-
- Nov 16, 2022
-
-
Kumar Shivendu authored
-
- Nov 15, 2022
-
-
Antoine R. Dumont authored
This got migrated in the sole swh-loader-git module using it. Related to D7868
-
- Nov 14, 2022
-
-
Antoine Lambert authored
Some maven artifacts do not have any sha1 sums computed but rather md5 ones so handle these edge cases to still check download integrity of jar files.
-
Antoine Lambert authored
Use mocked network requests to get jar and pom files instead of reading them from the datadir directory.
-
- Nov 03, 2022
-
-
Antoine Lambert authored
It enables to avoid downloading and processing a release archive for a CPAN module if it has already been archived by Software Heritage. Related to T2833
-
- Nov 02, 2022
-
-
Franck Bret authored
provided by the lister extra_loader_arguments Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url Add an EXTID manifest Set metadata from extrinsic metadata
-
- Oct 27, 2022
-
-
Reviewers: #reviewers, anlambert Subscribers: anlambert Maniphest Tasks: T4581 Differential Revision: https://forge.softwareheritage.org/D8569
-
- Oct 25, 2022
-
-
Franck Bret authored
As a follow up of Puppet lister evolution D8762, manage artifacts as lists Remove description from release message Related T4580
-
- Oct 21, 2022
-
-
Franck Bret authored
For each origin it takes advantage of 'artifacts' data send through 'extra_loader_arguments' of the conda lister, providing versions, archive url, checksum, etc. Author extracted from intrinsic metadata. Related T4579
-
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- Oct 17, 2022
-
-
Antoine Lambert authored
Fetch extrinsic metadata by computing URLs from the metadata provided by the lister and store them as release extrinsic metadata. Related to T2833
-
- Oct 11, 2022
-
-
Antoine Lambert authored
Parsing perl module metadata files trigger a lot of errors due to badly formatted JSON or YAML and module author info is already provided by the cpan lister as extra loader arguments so remove that no longer needed metadata parsing step. Related to T2833
-
Antoine Lambert authored
Artifacts info for a package are now provided as loader arguments so no need to query metacpan Web API anymore to get list of versions and their related info. Related to T2833
-
Antoine Lambert authored
Module description is not related to a particular release so we should not add it in release message.
-
Franck Bret authored
The loader get enough information from extrinsic metadata to build a release object, checking intrinsic metadata was more error prone than useful. It should fix some Sentry reported errors. Remove 'information' and adapt release message Adapt loader specifications documentation Related T4465, T4530, T4583
-
- Oct 07, 2022
-
-
Antoine R. Dumont authored
"nar" computation checks can happen on files too. This also deduplicate tests code on content and directory ones. Related to T3781
-
- Oct 05, 2022
-
-
Antoine R. Dumont authored
Prior to this commit, there was a discrepancy between the hash mismatch computations with "standard" and "nar" computations. This commit fixes the gap between those. When a hash mismatch occurs, either "nar" or "standard", the issue is caught and the next mirror url is checked. At the end of it all, if nothing is loaded and errors exist, this is raised. This fails the visit. This also adds the missing tests. Related to T3781
-
Antoine R. Dumont authored
The lister now provides the "checksums_computation". This is either "standard" (for most cases as in bare checksums on the object retrieved) or "nar" for some edge case. In that case the computation is delegated to the "nix-store" command (which should be present in the system running the loading). This adapts the directory loader to deal with this case. No work has been done for the ContentLoader yet besides failing the case if a call happens with such case. Related to T3781
-
- Oct 04, 2022
-
-
Antoine Lambert authored
Add a dedicated fixture implementing loader task creation check for a given lister and listed origin and use it in tasks tests for available loaders. Also remove redundant tests performing the same checks as that new fixture.
-
- Oct 03, 2022
-
-
Antoine R. Dumont authored
Related to T3781
-
Antoine Lambert authored
Previous regexp does not seem to work anymore so use a simpler one.
-
Antoine Lambert authored
Also fix a debug log template.
-
Antoine Lambert authored
This function downloads a file and computes hashes on it, there is no archive extraction step.
-
Antoine R. Dumont authored
This adapts the content/directory loader implementations to use directly a checksums dict which is now sent by the listers. This improves the loader to check those checksums when retrieving the artifact (content or tarball). Thanks to a bump in the swh.model version, this is now able to deal with sha512 checksums checks as well. This also aligns with the current package loaders which now are also checking the integrity of the tarballs they ingest. Related to T3781
-
Antoine R. Dumont authored
In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to ingest. This commit adds a loader to ingest those. The output of the ingestion is a snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained within the tarball). This expects to receive a mandatory 'integrity' field. It is used to check the tarball received out of the origin. This can also optionally receive a list of mirror urls in case the main origin url is no longer available. Those mirror urls are solely used as fallback to retrieve the tarball. Related to T3781
-
- Sep 30, 2022
-
-
Antoine Lambert authored
When one or multiple tarball checksums are available, either from listers output or from Web APIs calls perfomed by some loaders, use them to check integrity of downloaded tarballs.
-
Antoine R. Dumont authored
In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest. This commit adds a loader to ingest those. The output of the ingestion is a snapshot with 1 branch, one HEAD branch targetting the file content ingested. This expects to receive a mandatory 'integrity' field. It is used to check the content match the declaration. This can also optionally receive a list of mirror urls in case the main origin url is no longer available. Those mirror urls are solely used as fallback to retrieve the content. Related to T3781
-
- Sep 29, 2022
-
-
https://forge.puppet.comFranck Bret authored
For each origin it takes advantage of 'artifacts' data send through 'extra_loader_arguments' from the Puppet lister, providing versions, archive url, last_update, filename. Author and description are extracted from intrinsic metadata. Related T4580
-
Franck Bret authored
For each origin it calls an http api endpoint to retrieve extrinsic metadata for each version of a module. Author and package description are extracted from intrinsic metadata parsing data from META.json or META.yml at the root of the archive. Related T2833
-
- Sep 28, 2022
-
-
Antoine Lambert authored
Software Heritage homemade RPC layer does not known how to serialize set objects so we need to pass lists as parameters of *_missing methods from storage API.
-
Raphaël Gomès authored
This will allow us to use this interface in async code like ``swh-scanner``. Unfortunately, this means calling ``asyncio.run`` for sync code, but the performance impact should be negligible. The ``swh_storage.*missing*`` APIs are inconsistent for each type, which requires a lot of boilerplate code. This should be addressed in a follow-up.
-
- Sep 26, 2022
-
-
Raphaël Gomès authored
"Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
-
Antoine Lambert authored
-