Commits · c843ea5c4b003a9b5a3197e56af049722648a67d · Nicolas Dandrimont / swh-loader-git-old

Oct 20, 2021
- git: Deprecate no longer running or used code · c843ea5c
  Antoine R. Dumont authored 3 years ago
  
  Verified
  
  c843ea5c
Oct 11, 2021
- converters: Prevent zero dates from being converted to null dates. · 0e824769
  vlorentz authored 3 years ago
  
  They are not serialized the same way, so they cause hash mismatches.
  v1.1.3
  
  0e824769
Oct 05, 2021

dumb: Handle missing or corrupted pack file · d3d60421

Antoine Lambert authored 3 years ago

Some dumb git servers might reference a no longer existing pack file
while it is possible to fully load a repository without it.

So remove bogus pack file from the global packs list when encountering
such edge case and try to continue the loading anyway.

Related to T3618

d3d60421

dumb: Handle empty repository edge case · 6b21441d
Antoine Lambert authored 3 years ago
```
An error was raised previously when trying to fetch HEAD.

Related to T3618
```
6b21441d

Oct 01, 2021
- Add specific test to the filtering branch function · 8d371e81
  Antoine R. Dumont authored 3 years ago
  
  Verified
  
  8d371e81
- git: Add debugging log around the packfile retrieval step · 1d69d2be
  Antoine R. Dumont authored 3 years ago
  
  Verified
  
  1d69d2be
- Unify logging instructions to use module logger instance · a34aefbf
  Antoine R. Dumont authored 3 years ago
  
  This unifies logging instructions with swh packages.
  Verified
  
  a34aefbf
Sep 30, 2021
- Drop spurious print statement · 940a5e7e
  Antoine R. Dumont authored 3 years ago
  
  Verified
  
  940a5e7e
Sep 28, 2021

Use correct logging instruction, let log.info format entries · 36867474
Antoine R. Dumont authored 3 years ago

v1.1.1 Verified

36867474
Clarify local/remote heads type as those are hexadecimal bytes str · c662603f
Antoine R. Dumont authored 3 years ago
```
The current conversions done were a bit ambiguous, specifying the types clarifies the
need.
```
Verified

c662603f

loader: Add support for dumb HTTP transfer protocol · d3976ca6

Antoine Lambert authored 3 years ago

Git supports two HTTP based transfer protocols to exchange data
between two repositories: the dumb protocol and the smart protocol.

Nowadays, the smart protocol is a common method of transferring
data because it is more efficient but there is still some git
servers in the wild that only support the dumb protocol.

Unfortunately the dulwich package does not support such protocol
so this kind of git repository could not be loaded into the archive.

That commit adds support to load such git repository by fetching
objects according to the dumb HTTP transfer protocol specification.

Related to T2489

d3976ca6

Sep 21, 2021
- Fix tests for Dulwich < 0.20.22 · 05d6d762
  vlorentz authored 3 years ago
  
  Current Dulwich versions (unconditionally) add \n at the end of tag messages
  v1.0.1
  
  05d6d762
Sep 16, 2021
- converters: Recompute hashes and check they match the originals · 6d7a998b
  vlorentz authored 3 years ago
  
  This makes sure we don't write corrupt objects to the storage, like the examples in T75.
  v1.0.0
  
  6d7a998b
- converters: Add typing · f413e171
  vlorentz authored 3 years ago
  
  f413e171
- Migrate to pytest-style tests · 85eb5401
  vlorentz authored 3 years ago
  
  I want to use parametrized tests in a future commit, but pytest does not support them on unittest-style classes. self.subTest() would work too, but I figured it's a good time to migrate these tests to be consistent with the rest of the codebase.
  85eb5401
- Fix pytest warning about undefined marker · 8ad9799f
  vlorentz authored 3 years ago
  
  8ad9799f
Aug 09, 2021
- tests: remove debug print · e7476cae
  vlorentz authored 3 years ago
  
  e7476cae
Aug 06, 2021

from_disk: Do not drop tags with missing tagger or date · 5448b7b1

vlorentz authored 3 years ago

Old versions of Git didn't writer them, eg. see tags refs/tags/v2.6.11
to refs/tags/v2.6.13-rc3 in linux.git

5448b7b1

Jul 30, 2021

Do not exclude falsy git objects from being added. · 92ef526e

vlorentz authored 3 years ago

AFAICT that's only the empty tree, because trees are the only Dulwich object
with a __len__, and no Dulwich objects have a __bool__.

92ef526e

converters: Preserve GPG signatures on releases · c67ab026

vlorentz authored 3 years ago

Since version 0.19.10 (more specifically, this commit:
<https://github.com/dulwich/dulwich/commit/72aec0c79fb395689e40f9228df93e2a39cf8fb0>),
Dulwich strips GPG signatures from the 'message' attribute of Tag objects,
and stores it in a new attribute, 'signature'.

This means we were silently dropping all signatures from releases.

c67ab026

Jul 26, 2021

from_disk: Improve error logging · 9e52708e

vlorentz authored 3 years ago

* Lazy substitution (instead of %)
* Log actual error message in the text
* Rename variable according to PEP 8

9e52708e

Jun 09, 2021
- mypy: Fix errors with release >= v0.900 · 317d5c20
  Antoine Lambert authored 3 years ago
  
  317d5c20
May 11, 2021
- Spool large packfiles to disk instead of consuming tons of memory · 9823cd11
  Nicolas Dandrimont authored 3 years ago
  
  v0.10.0
  
  9823cd11
Apr 26, 2021

tox: Add sphinx environments to check sane doc build · 15e12fae

Antoine Lambert authored 3 years ago

Enable to check package documentation can be built without producing
sphinx warnings.

The sphinx environment is designed to be used in continuous integration
in order to prevent breaking documentation build when committing changes.

The sphinx-dev environment is designed to be used inside a full swh
development environment.

Related to T3258

15e12fae

Apr 07, 2021
- Add contributors file · 8327ec6d
  Aastha Asthana authored 3 years ago
  
  v0.9.1
  
  8327ec6d
Apr 04, 2021
- Fix Pack File too big error formatting · eaa0590c
  Aastha Asthana authored 3 years ago
  
  eaa0590c
Mar 16, 2021

Rename 'git_metadata' to 'extra_headers' · 1eb1c573

vlorentz authored 3 years ago

Because they are now stored in the 'extra_headers' field instead
of the 'metadata' field.

Motivation: consistency + keep it out of 'grep metadata */swh/ -r'

1eb1c573

Feb 25, 2021

Hardcode the use of the tcp transport for GitHub origins · 342f8fde

Nicolas Dandrimont authored 3 years ago

This change is necessary because of a shortcoming in the Dulwich HTTP
transport: even if the Dulwich API lets us process the packfile in
chunks as it's received, the HTTP transport implementation needs to
entirely allocate the packfile in memory *twice*, once in the HTTP
library, and once in a BytesIO managed by Dulwich, before passing it on
to us as a chunked reader. Overall this triples the memory usage before
we can even try to interrupt the loader before it overruns its memory limit.

In contrast, the Dulwich TCP transport just gives us the read handle on
the underlying socket, doing no processing or copying of the bytes. We
can interrupt it as soon as we've received too many bytes.

342f8fde

Stop processing packfiles before sending objects · 61afbc56

Nicolas Dandrimont authored 3 years ago

Since its creation, the git loader would process the packfile downloaded
from the remote repository, to make an index of all objects, filtering
them before sending them on to the storage. Since this functionality has
been implemented as a filter proxy in the storage API itself, the
built-in filtering by the git loader is now redundant.

The way the filtering was implemented in the loader would run through
the packfile six times: once for the basic object id indexing, once to
get content ids, then once for each object type. This change removes the
first two runs. By eschewing the double filtering, we should also reduce
the load on the backend storage (we would call the <object_type>_missing
endpoints twice).

Finally, as this change removes the global index of objects, and sends
the converted objects to the storage as soon as they're read, the memory
usage decreases substantially for large loads.

61afbc56

Drop unused get_fetch_history_result methods · 5e434d6f
Nicolas Dandrimont authored 4 years ago

5e434d6f

Feb 23, 2021
- README: Update to a minimum · 11fe5b27
  Antoine R. Dumont authored 3 years ago
  
  Verified
  
  11fe5b27
Feb 17, 2021
- Rework loader instantiation logic according to loader core api · b14c06e7
  Antoine R. Dumont authored 4 years ago
  
  Note that this also updated some docstrings and type along the way. Related to T1410
  v0.8.0 Verified
  
  b14c06e7
Feb 11, 2021

loader.git: Mark visit status as not_found or failed when relevant · c2cd5fed

Antoine R. Dumont authored 4 years ago

When the initial communication with the git server is failing initially (e.g repository
is not found), this marks the visit status as not_found.

When the initial communication is ok but a failure occurs during the fetch step (e.g
pack file too big, ...), the visit status is marked as failed.

Related to T3030

Verified

c2cd5fed

loader.git: Explicit the failure test cases · 11074801

Antoine R. Dumont authored 4 years ago

With the new loader.core 0.17, failed or partial status changed slightly.
This adds the necessary tests to explicit those.

Related to T3030

Verified

11074801

Feb 02, 2021
- Adapt origin_get_latest_visit_status according to latest api change · 824880e7
  Antoine R. Dumont authored 4 years ago
  
  v0.6.0 Verified
  
  824880e7
Nov 24, 2020
- tox.ini: Add swh.core[testing] requirement · 226d1041
  Antoine Lambert authored 4 years ago
  
  226d1041
Nov 23, 2020
- from_disk: Fix mypy error with dulwich >= 0.20.13 · 2c311b0e
  Antoine Lambert authored 4 years ago
  
  dulwich recently adds PEP-561 compatibility so ignore typecheck for older dulwich versions.
  2c311b0e
Nov 13, 2020
- loader.git.from_disk: Register loader in `swh loader run` cli · 9ad25545
  Antoine R. Dumont authored 4 years ago
  
  This drops the older cli in swh.loader.git.from_disk which was broken and not covered by test. Related to T2770#52497
  v0.5.0 Verified
  
  9ad25545
Oct 02, 2020
- git.loader*: Open configuration passing from constructor · 9eb1c08f
  Antoine R. Dumont authored 4 years ago
  
  v0.4.1 Verified
  
  9eb1c08f
- tox.ini: pin black to the pre-commit version (19.10b0) to avoid flip-flops · c9edefb7
  Stefano Zacchiroli authored 4 years ago
  
  c9edefb7