- Oct 13, 2022
-
-
vlorentz authored
-
vlorentz authored
By using set equality, pytest can diff both operands; whereas equality comparisons failures are harder to read.
-
vlorentz authored
In particular, there seems to be a negligeable number of origins using SSH instead of HTTPS, which the git loader cannot deal with.
-
Antoine Lambert authored
-
vlorentz authored
Tests implemented roughly the same algorithm as the lister, and compared both values...
- Oct 11, 2022
-
-
Antoine Lambert authored
CPAN API can return versions that are not of str type: either int or float. When version equals 0, it means that version failed to be parsed by CPAN so we try to extract it from release name in that case. Otherwise we ensure to convert the version to str type. Related to T2833
-
Antoine Lambert authored
Instead of querying the metacpan distribution endpoint to list origins, prefer to use the release endpoint instead enabling to list all artifacts associated to CPAN packages by scrolling results. Compared to previous implementation, it enables to compute a last_update date for all CPAN packages but also to obtain artifact sha256 checksums that will be used by the CPAN loader to check downloads integrity. As the multiple versions of a module are spread across multiple pages from the CPAN API, origins are sent to the scheduler once all pages processed, it is also faster to proceed that way. Related to T2833
-
- Oct 07, 2022
-
-
Antoine Lambert authored
Instead of using an undocumented rubygems HTTP endpoint that only gives us the names of the gems, prefer to exploit the daily PostgreSQL dump of the rubygems.org database. It enables to list all gems but also all versions of a gem and its release artifacts. For each relase artifact, the following info are extracted: version, download URL, sha256 checksum, release date plus a couple of extra metadata. The lister will now set list of artifacts and list of metadata as extra loader arguments when sending a listed origin to the scheduler database. A last_update date is also computed which should ensure loading tasks for rubygems will be scheduled only when new releases are available since last loadings. To be noted, the lister will spawn a temporary postgres instance so this require the initdb executable from postgres server installation to be available in the execution environment. Related to T1777
-
Antoine R. Dumont authored
For now, those can be faulty as the manifest is missing 'critical' information about how to recompute the hash (e.g. fs layout, executable bit, ...). Related to T4608 Related to T3781
-
- Oct 05, 2022
-
-
Antoine R. Dumont authored
It's more explicit that way. Related to T3781
-
Franck Bret authored
In order to reduce http api call amount made by the loader, download a crates.io database dump, and parse its csv files to get a last_update value for each versions of a Crate. Those values are sent to the loader through extra_loader_arguments 'crates_metadata'. 'artifacts' and 'crates_metadata' now uses "version" as key. Related T4104, D8171
-
Antoine R. Dumont authored
In that case, this fallbacks to use the "outputHash" which is an equivalent field of the integrity one except it's for "recursive" outputHashMode. This adds the necessary assertions around this case so correct data is sent to loaders as well. Related to T3781
-
Antoine R. Dumont authored
This actually includes all query param values as paths to check. When paths have extensions, it then pattern matches against tarballs if any. When no extension is detected, it's doing as before, fallbacks to head query the url to have more information on the file. Prior to this commit, this only looked over a hard-coded list of values (for hard-coded keys: file, f, name, url) detected through docker runs. This way of doing it should decrease future misdetections (when new unknown "keys" show up in the wild). Related to T3781
-
Antoine R. Dumont authored
The current content type detection was a bit off mostly for content which includes charset. This commit fixes it. Related to T3781
-
Antoine R. Dumont authored
Without this, some git repositories are detected as file (due to upstream misqualification too). This does some extra effort to detect those to avoid sending noise to loaders. This also refactors some common code to build vcs artifacts to avoid duplication. Related to T3781
-
Antoine R. Dumont authored
Without this, some tarballs hidden within query parameters are not detected. This does some extra effort to detect those to avoid sending noise to loaders. Related to T3781
-
- Oct 04, 2022
-
-
Antoine R. Dumont authored
Without this distinction the current directory or content loader will fail the download as they currently expect the checksums to be about the tarball. When a recursive "integrity" is provided, it's actually about the uncompressed tarball as per the nix-store computation. It's detailed within the code. Related to T3294 Related to T3781
-
Antoine R. Dumont authored
Related to T3294 Related to T3781
-
Antoine R. Dumont authored
When that arises, we skip the origins. Related to T3781
-
Antoine R. Dumont authored
Related to T3781
-
Antoine R. Dumont authored
When that arises, we skip the origins. Related to T3781
-
Antoine R. Dumont authored
Some origins are listed as urls while they are not. They are possibly vcs. So this commit tries to detect and and deal with those if possible. If not possible, they are skipped. Related to T3781 Related to P1470
-
Antoine R. Dumont authored
The end goal is to ingest sparsely the origins, that would avoid hitting the various servers around the same time for colocated origins in the upstream manifest (especially file or tarball). Related to T3781
-
- Oct 03, 2022
-
-
Antoine R. Dumont authored
Related to T3781
-
Antoine R. Dumont authored
Related to T3781
-
Antoine R. Dumont authored
Related to T3781
-
- Sep 30, 2022
-
-
Antoine Lambert authored
In listers collecting artifacts for each package to load, add artifacts checksums, when that info is available, in parameters sent to loaders in order to check downloaded artifact integrity.
-
Franck Bret authored
'artifacts' extra_loader_arguments should be a list
-
- Sep 29, 2022
-
-
-
Antoine Lambert authored
Prefer to execute lister through a celery task as it also enables to catch possible issues with task implementation. Also use docker compose v2 commands.
-
Antoine Lambert authored
The base lister class now ensures the count of listed origins will be accurate.
-
Antoine Lambert authored
Previously, the run method was returning the total count of ListedOrigin objects sent to scheduler database. However, some listers can send multiple ListedOrigin objects for a given origin URL during the listing process, for instance when an origin is contained in multiple pages (e.g. gogs listing) or when the listing is gathering multiple versions of an origin spread across multiple pages (e.g. maven listing). This changes ensures an accurate count of listed origins by maintaining a set of origin URLs associated to the sent ListedOrigin objects.
-
- Sep 27, 2022
-
-
Franck Bret authored
Related T1718
-
Franck Bret authored
The puppet lister retrieves origins from from https://forge.puppet.com/modules Related T4519
-
Franck Bret authored
Related T2833
-
Franck Bret authored
Use http api point to get package names and build origin urls.
-
Franck Bret authored
Related T4547
-
- Sep 26, 2022
-
-
Antoine R. Dumont authored
With the extension, the readme is included in the swh-docs build and fails. It's not intended for the documentation build so renaming it keep it out of the doc build loop. This fixes build [1]. [1] https://jenkins.softwareheritage.org/view/all/job/DDOC/job/dev/2395/
-
Antoine Lambert authored
That HTTP header value will now contain the lister name but also a link to our contact form in order for sysadmins to easily reach us if needed. The following template is used to generate it: "Software Heritage <lister_name> lister v<swh-lister version> (+https://www.softwareheritage.org/contact)"
-