Commits · fed8fc3ee002502cd1a1824c0fe7d7c5de9aafa8 · Kumar Shivendu / swh-loader-core

Jan 13, 2023

Move py.typed from swh/loader/{package,core}/ to swh/loader/ · fed8fc3e

vlorentz authored 2 years ago

There is code in swh/loader/cli.py, and swh-loader-metadata will need
to import cli.py, causing mypy to complain when py.typed is missing.

fed8fc3e

Allow partial snapshot creation during ingestion · fc1adf07

Antoine R. Dumont authored 2 years ago

This introduces a `create_partial_snapshot` parameter to the base loader constructor.
When activated, during each call of the `store_data` method, if there are more data to
fetch, this will create a partial snapshot (and an associated visit status).

The final loop behaves as before, create the last visit with status 'full' targeting the
final snapshot.

The main difference between the 2 behaviors is that an ingestion with that parameter on
is more verbose in terms of origin_visit_status. This, in turn, allows to be incremental
in subsequent visits for the same origin. This may especially be interesting for cases
when loading fail due to out of hand resources issues (e.g. large svn or git
repositories).

Related to T3625

fc1adf07

Dec 20, 2022

conda: Fix versions sorting and update release names · a63b39e5

Antoine Lambert authored 2 years ago

Release 22.0 of packaging module can no longer parse invalid Python version
number, an exception is now raised.

Conda loader used the keys of the packages dict as version numbers to sort,
which are in the form "<arch>/<version>-<build>", but those cannot be parsed
anymore.

So extract intrinsic version numbers of packages instead to sort the list of
versions.

Also update snapshot release names to "<version>-<build>-<arch>" as each
release for a given architecture targets a different directory.

a63b39e5

rpm: Fix package versions sorting · b6231045

Antoine Lambert authored 2 years ago

Release 22.0 of packaging module can no longer parse invalid Python version
number, an exception is now raised.

RPM loader used the keys of the packages dict as version numbers to sort,
which are in the form "<distribution>/<edition>/<package_version_number>",
but those cannot be parsed anymore.

So use intrinsic version numbers of packages instead to sort the list of
versions.

b6231045

Disable reporting of NotFound exceptions to Sentry · e7ac7a34
vlorentz authored 2 years ago

e7ac7a34

Dec 19, 2022

docs: Include module indices only when building standalone package doc · 9f94c73b

Antoine Lambert authored 2 years ago

In order to remove warnings about /apidoc/*.rst files being included
multiple times in toc when building full swh documentation, prefer to
include module indices only when building standalone package documentation.

Also include them the proper sphinx way.

Related to T4496

9f94c73b

Nov 21, 2022

Hackage: Loads Hackage Listed origins · b9bd1287

Franck Bret authored 2 years ago

The loader make an http api call to retrieve package related versions.
It then download tar.gz archive for each version.

b9bd1287

Nov 16, 2022
- feat: Incremental RPM loader implementation · a196c85d
  Kumar Shivendu authored 2 years ago
  
  a196c85d
Nov 15, 2022
- Drop unused core.loader.DVCSLoader class · 31ab1aa6
  Antoine R. Dumont authored 2 years ago
  
  This got migrated in the sole swh-loader-git module using it. Related to D7868
  31ab1aa6
Nov 14, 2022

maven: Add support for md5 checkums to check download integrity · 301502cb

Antoine Lambert authored 2 years ago

Some maven artifacts do not have any sha1 sums computed but rather md5
ones so handle these edge cases to still check download integrity of
jar files.

301502cb

maven: Simplify tests with requests_mock_datadir fixture · 5778cfce
Antoine Lambert authored 2 years ago
```
Use mocked network requests to get jar and pom files instead of
reading them from the datadir directory.
```
5778cfce

Nov 03, 2022

cpan: Add extid manifest to CpanPackageInfo · bf2cb039

Antoine Lambert authored 2 years ago

It enables to avoid downloading and processing a release archive for
a CPAN module if it has already been archived by Software Heritage.

Related to T2833

bf2cb039

Nov 02, 2022

Rubygems: Improve lister to make use of artifacts and rubygems_metadata · 8e34a6d7

Franck Bret authored 2 years ago

provided by the lister extra_loader_arguments

Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
Add an EXTID manifest
Set metadata from extrinsic metadata

8e34a6d7

Oct 27, 2022

Add rubygems loader · 0022bb50

Raphaël Gomès authored 2 years ago and

Franck Bret committed 2 years ago

Reviewers: #reviewers, anlambert

Subscribers: anlambert

Maniphest Tasks: T4581

Differential Revision: https://forge.softwareheritage.org/D8569

0022bb50

Oct 25, 2022

Puppet: Artifacts as lists · e6847f36

Franck Bret authored 2 years ago

As a follow up of Puppet lister evolution D8762, manage artifacts as lists
Remove description from release message

Related T4580

e6847f36

Oct 21, 2022

Conda: Anaconda packages archive loader · e7ba6316

Franck Bret authored 2 years ago

For each origin it takes advantage of 'artifacts' data send through
'extra_loader_arguments' of the conda lister, providing versions,
archive url, checksum, etc.
Author extracted from intrinsic metadata.

Related T4579

e7ba6316

Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · 6d8e2aba

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

6d8e2aba

Oct 17, 2022

cpan: Collect extrinsic metadata for each module release · 85963318

Antoine Lambert authored 2 years ago

Fetch extrinsic metadata by computing URLs from the metadata provided
by the lister and store them as release extrinsic metadata.

Related to T2833

85963318

Oct 11, 2022

cpan: Do not parse intrinsic metadata for getting module author · 7b929606

Antoine Lambert authored 2 years ago

Parsing perl module metadata files trigger a lot of errors due to badly
formatted JSON or YAML and module author info is already provided by
the cpan lister as extra loader arguments so remove that no longer
needed metadata parsing step.

Related to T2833

7b929606

cpan: Align loader implementation with latest lister improvements · a13e3e6f

Antoine Lambert authored 2 years ago

Artifacts info for a package are now provided as loader arguments so
no need to query metacpan Web API anymore to get list of versions
and their related info.

Related to T2833

a13e3e6f

cpan: Remove module description from release message · e17ee9e0
Antoine Lambert authored 2 years ago
```
Module description is not related to a particular release so we
should not add it in release message.
```
e17ee9e0

Pubdev: Do not rely on intrinsic metadata · 4cb85e15

Franck Bret authored 2 years ago

The loader get enough information from extrinsic metadata to build a release object, checking intrinsic metadata was more error prone than useful.

It should fix some Sentry reported errors.

Remove 'information' and adapt release message

Adapt loader specifications documentation

Related T4465, T4530, T4583

4cb85e15

Oct 07, 2022

ContentLoader: Allow nar computation checks · 028b7c04

Antoine R. Dumont authored 2 years ago

"nar" computation checks can happen on files too.

This also deduplicate tests code on content and directory ones.

Related to T3781

028b7c04

Oct 05, 2022

{Cnt|Dir}Loader: Fix standard/nar hash mismatch behavior to fail loading · 8aa6dab7

Antoine R. Dumont authored 2 years ago

Prior to this commit, there was a discrepancy between the hash mismatch computations
with "standard" and "nar" computations. This commit fixes the gap between those.

When a hash mismatch occurs, either "nar" or "standard", the issue is caught and the
next mirror url is checked. At the end of it all, if nothing is loaded and errors
exist, this is raised. This fails the visit.

This also adds the missing tests.

Related to T3781

8aa6dab7

DirectoryLoader: Check nar hashes when provided · 4d51ad99

Antoine R. Dumont authored 2 years ago

The lister now provides the "checksums_computation". This is either "standard" (for most
cases as in bare checksums on the object retrieved) or "nar" for some edge case. In that
case the computation is delegated to the "nix-store" command (which should be present in
the system running the loading).

This adapts the directory loader to deal with this case.

No work has been done for the ContentLoader yet besides failing the case if a call
happens with such case.

Related to T3781

4d51ad99

Oct 04, 2022

test: Deduplicate tests implementation for loader tasks creation · b26b9881

Antoine Lambert authored 2 years ago

Add a dedicated fixture implementing loader task creation check for
a given lister and listed origin and use it in tasks tests for
available loaders.

Also remove redundant tests performing the same checks as that
new fixture.

b26b9881

Oct 03, 2022

{Content|Directory}Loader: Register tasks · a5255f10
Antoine R. Dumont authored 2 years ago
```
Related to T3781
```
a5255f10
pre-commit: Fix tests data exclusion from codespell check · c631349a
Antoine Lambert authored 2 years ago
```
Previous regexp does not seem to work anymore so use a simpler one.
```
c631349a
cli: Use memory storage as fallback when no configuration detected · 13e9bf4d
Antoine Lambert authored 2 years ago
```
Also fix a debug log template.
```
13e9bf4d
package/utils: Fix download function documentation · f6a3ed11
Antoine Lambert authored 2 years ago
```
This function downloads a file and computes hashes on it, there is no
archive extraction step.
```
f6a3ed11

{Content|Directory}Loader: Adapt support for checksums · 39c33a66

Antoine R. Dumont authored 2 years ago

This adapts the content/directory loader implementations to use directly a checksums
dict which is now sent by the listers.

This improves the loader to check those checksums when retrieving the artifact (content
or tarball). Thanks to a bump in the swh.model version, this is now able to deal with
sha512 checksums checks as well.

This also aligns with the current package loaders which now are also checking the
integrity of the tarballs they ingest.

Related to T3781

39c33a66

Add Directory Loader to allow tarball ingestion as Directory · dbf7f3dc

Antoine R. Dumont authored 2 years ago

In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
ingest. This commit adds a loader to ingest those. The output of the ingestion is a
snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
within the tarball).

This expects to receive a mandatory 'integrity' field. It is used to check the tarball
received out of the origin.

This can also optionally receive a list of mirror urls in case the main origin url is no
longer available. Those mirror urls are solely used as fallback to retrieve the tarball.

Related to T3781

dbf7f3dc

Sep 30, 2022

Use tarball checksum to check download integrity in package loaders · 5482a48e

Antoine Lambert authored 2 years ago

When one or multiple tarball checksums are available, either from listers
output or from Web APIs calls perfomed by some loaders, use them to check
integrity of downloaded tarballs.

5482a48e

Add Content Loader to ingest raw content file · f774aba5

Antoine R. Dumont authored 2 years ago

In some marginal listing cases (Nix or Guix for now), we can receive raw file to ingest.
This commit adds a loader to ingest those. The output of the ingestion is a snapshot
with 1 branch, one HEAD branch targetting the file content ingested.

This expects to receive a mandatory 'integrity' field. It is used to check the content
match the declaration.

This can also optionally receive a list of mirror urls in case the main origin url is no
longer available. Those mirror urls are solely used as fallback to retrieve the content.

Related to T3781

f774aba5

Sep 29, 2022

Puppet: The puppet loader loads origins from https://forge.puppet.com · 6299c091

Franck Bret authored 2 years ago

For each origin it takes advantage of 'artifacts' data send through
'extra_loader_arguments' from the Puppet lister, providing versions,
archive url, last_update, filename.
Author and description are extracted from intrinsic metadata.

Related T4580

6299c091

Cpan: Cpan loader loads Perl modules from cpan.org · 2db1a754

Franck Bret authored 2 years ago

For each origin it calls an http api endpoint to retrieve extrinsic
metadata for each version of a module.
Author and package description are extracted from intrinsic metadata
parsing data from META.json or META.yml at the root of the archive.

Related T2833

2db1a754

Sep 28, 2022

discovery: Fix compatibility with storage RPC API · 7375a83c

Antoine Lambert authored 2 years ago

Software Heritage homemade RPC layer does not known how to serialize
set objects so we need to pass lists as parameters of *_missing
methods from storage API.

7375a83c

Setup async interface for discovery module · 1facea3c

Raphaël Gomès authored 2 years ago

This will allow us to use this interface in async code like ``swh-scanner``.

Unfortunately, this means calling ``asyncio.run`` for sync code, but the
performance impact should be negligible.

The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
requires a lot of boilerplate code. This should be addressed in a
follow-up.

1facea3c

Sep 26, 2022

Use a Merkle discovery algorithm with archives · 798f749e

Raphaël Gomès authored 2 years ago

"Discovery" is the term used to find out the differences between two
Merkle graphs. Using such an algorithm is useful in that it drastically
reduces the amount of data that needs to be transferred.

This commit introduces an efficient but simple algorithm that is a good
starting point for improved performance: random sampling of directories,
the details of which are explained in the docstrings.

Mercurial uses a more sophisticated algorithm for its discovery, but it
is quite a bit more involved and would introduce too much complexity at
once. Also, the constraints for speed that Mercurial has (in the order
of milliseconds) don't apply as obviously to this context without
further investigation.

Benchmarks
==========

Setup
-----
- With a local postgresql storage (so no network overhead), a local
  tmpfs obstorage on a fast NVME SSD, all of which should make this
  improvement look less good than it will be in production
- With a tarball of the linux kernel at commit
  d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
- Loading a tarball of 20 commits earlier
  (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
- Only taking into account the loading (not the downloading of the
  tarball, or its decompression)

Result
------

before: ~30s
after: ~17s

Reproduced 4 times.

798f749e

golang: Fix imports ordering reported by isort · 26fe954b
Antoine Lambert authored 2 years ago

26fe954b