Commits · 705976f75bc30aa1451babe10b5449b91afce2bf · Nicolas Dandrimont / swh-loader-git-old

Jan 10, 2022

Remove unnecessary use of dulwich.client.HttpGitClient · 705976f7
vlorentz authored 3 years ago
```
'requests' does the job just fine with less complexity.
```
705976f7

vlorentz authored 3 years ago

response.content_type is set by Dulwich, but isn't part of urllib3's
HTTPResponse, so we shouldn't rely on it.
(And it makes mypy complain when the 'types-urllib3' package is installed)

d7481af6

Dec 20, 2021

tests: Remove the SWHTag mock, use dulwich.objects.Tag instead. · 0cc96c25

vlorentz authored 3 years ago

This mock was clunky because it didn't actually behave much like
dulwich's Tag.

Additionally, a future commit will need to access the as_raw_chunks()
method of ShaFile objects, so SWHTag isn't suitable anymore as it
would need to diverge even more by implementing its own serialization.

0cc96c25

Dec 16, 2021
- Pin mypy and drop type annotations which makes mypy unhappy · 715fbe21
  Antoine R. Dumont authored 3 years ago
  
  This also drops spurious copyright headers to those files if present. Related to T3812
  715fbe21
Oct 28, 2021

loader: Rename ignore_history parameter to incremental · 670e8c83

Antoine R. Dumont authored 3 years ago

This:
- unifies this parameter name with names similar to what's used in lister
- also documents it better

Related to T3695

670e8c83

Oct 21, 2021
- converters: Fix detection of tree entries with non-standard commit/tree mode. · a40f1e00
  vlorentz authored 3 years ago
  
  v1.1.4
  
  a40f1e00
Oct 20, 2021
- git: Deprecate no longer running or used code · c843ea5c
  Antoine R. Dumont authored 3 years ago
  
  c843ea5c
Oct 11, 2021
- converters: Prevent zero dates from being converted to null dates. · 0e824769
  vlorentz authored 3 years ago
  
  They are not serialized the same way, so they cause hash mismatches.
  v1.1.3
  
  0e824769
Oct 05, 2021

dumb: Handle missing or corrupted pack file · d3d60421

Antoine Lambert authored 3 years ago

Some dumb git servers might reference a no longer existing pack file
while it is possible to fully load a repository without it.

So remove bogus pack file from the global packs list when encountering
such edge case and try to continue the loading anyway.

Related to T3618

d3d60421

dumb: Handle empty repository edge case · 6b21441d
Antoine Lambert authored 3 years ago
```
An error was raised previously when trying to fetch HEAD.

Related to T3618
```
6b21441d

Oct 01, 2021
- Add specific test to the filtering branch function · 8d371e81
  Antoine R. Dumont authored 3 years ago
  
  8d371e81
- git: Add debugging log around the packfile retrieval step · 1d69d2be
  Antoine R. Dumont authored 3 years ago
  
  1d69d2be
- Unify logging instructions to use module logger instance · a34aefbf
  Antoine R. Dumont authored 3 years ago
  
  This unifies logging instructions with swh packages.
  a34aefbf
Sep 30, 2021
- Drop spurious print statement · 940a5e7e
  Antoine R. Dumont authored 3 years ago
  
  940a5e7e
Sep 28, 2021

Use correct logging instruction, let log.info format entries · 36867474
Antoine R. Dumont authored 3 years ago

v1.1.1

36867474
Clarify local/remote heads type as those are hexadecimal bytes str · c662603f
Antoine R. Dumont authored 3 years ago
```
The current conversions done were a bit ambiguous, specifying the types clarifies the
need.
```
c662603f

loader: Add support for dumb HTTP transfer protocol · d3976ca6

Antoine Lambert authored 3 years ago

Git supports two HTTP based transfer protocols to exchange data
between two repositories: the dumb protocol and the smart protocol.

Nowadays, the smart protocol is a common method of transferring
data because it is more efficient but there is still some git
servers in the wild that only support the dumb protocol.

Unfortunately the dulwich package does not support such protocol
so this kind of git repository could not be loaded into the archive.

That commit adds support to load such git repository by fetching
objects according to the dumb HTTP transfer protocol specification.

Related to T2489

d3976ca6

Sep 21, 2021
- Fix tests for Dulwich < 0.20.22 · 05d6d762
  vlorentz authored 3 years ago
  
  Current Dulwich versions (unconditionally) add \n at the end of tag messages
  v1.0.1
  
  05d6d762
Sep 16, 2021
- converters: Recompute hashes and check they match the originals · 6d7a998b
  vlorentz authored 3 years ago
  
  This makes sure we don't write corrupt objects to the storage, like the examples in T75.
  v1.0.0
  
  6d7a998b
- converters: Add typing · f413e171
  vlorentz authored 3 years ago
  
  f413e171
- Migrate to pytest-style tests · 85eb5401
  vlorentz authored 3 years ago
  
  I want to use parametrized tests in a future commit, but pytest does not support them on unittest-style classes. self.subTest() would work too, but I figured it's a good time to migrate these tests to be consistent with the rest of the codebase.
  85eb5401
- Fix pytest warning about undefined marker · 8ad9799f
  vlorentz authored 3 years ago
  
  8ad9799f
Aug 09, 2021
- tests: remove debug print · e7476cae
  vlorentz authored 3 years ago
  
  e7476cae
Aug 06, 2021

from_disk: Do not drop tags with missing tagger or date · 5448b7b1

vlorentz authored 3 years ago

Old versions of Git didn't writer them, eg. see tags refs/tags/v2.6.11
to refs/tags/v2.6.13-rc3 in linux.git

5448b7b1

Jul 30, 2021

Do not exclude falsy git objects from being added. · 92ef526e

vlorentz authored 3 years ago

AFAICT that's only the empty tree, because trees are the only Dulwich object
with a __len__, and no Dulwich objects have a __bool__.

92ef526e

converters: Preserve GPG signatures on releases · c67ab026

vlorentz authored 3 years ago

Since version 0.19.10 (more specifically, this commit:
<https://github.com/dulwich/dulwich/commit/72aec0c79fb395689e40f9228df93e2a39cf8fb0>),
Dulwich strips GPG signatures from the 'message' attribute of Tag objects,
and stores it in a new attribute, 'signature'.

This means we were silently dropping all signatures from releases.

c67ab026

Jul 26, 2021

from_disk: Improve error logging · 9e52708e

vlorentz authored 3 years ago

* Lazy substitution (instead of %)
* Log actual error message in the text
* Rename variable according to PEP 8

9e52708e

Jun 09, 2021
- mypy: Fix errors with release >= v0.900 · 317d5c20
  Antoine Lambert authored 3 years ago
  
  317d5c20
May 11, 2021
- Spool large packfiles to disk instead of consuming tons of memory · 9823cd11
  Nicolas Dandrimont authored 3 years ago
  
  v0.10.0
  
  9823cd11
Apr 26, 2021

tox: Add sphinx environments to check sane doc build · 15e12fae

Antoine Lambert authored 3 years ago

Enable to check package documentation can be built without producing
sphinx warnings.

The sphinx environment is designed to be used in continuous integration
in order to prevent breaking documentation build when committing changes.

The sphinx-dev environment is designed to be used inside a full swh
development environment.

Related to T3258

15e12fae

Apr 07, 2021
- Add contributors file · 8327ec6d
  Aastha Asthana authored 3 years ago
  
  v0.9.1
  
  8327ec6d
Apr 04, 2021
- Fix Pack File too big error formatting · eaa0590c
  Aastha Asthana authored 3 years ago
  
  eaa0590c
Mar 16, 2021

Rename 'git_metadata' to 'extra_headers' · 1eb1c573

vlorentz authored 3 years ago

Because they are now stored in the 'extra_headers' field instead
of the 'metadata' field.

Motivation: consistency + keep it out of 'grep metadata */swh/ -r'

1eb1c573

Feb 25, 2021

Hardcode the use of the tcp transport for GitHub origins · 342f8fde

Nicolas Dandrimont authored 3 years ago

This change is necessary because of a shortcoming in the Dulwich HTTP
transport: even if the Dulwich API lets us process the packfile in
chunks as it's received, the HTTP transport implementation needs to
entirely allocate the packfile in memory *twice*, once in the HTTP
library, and once in a BytesIO managed by Dulwich, before passing it on
to us as a chunked reader. Overall this triples the memory usage before
we can even try to interrupt the loader before it overruns its memory limit.

In contrast, the Dulwich TCP transport just gives us the read handle on
the underlying socket, doing no processing or copying of the bytes. We
can interrupt it as soon as we've received too many bytes.

342f8fde

Stop processing packfiles before sending objects · 61afbc56

Nicolas Dandrimont authored 3 years ago

Since its creation, the git loader would process the packfile downloaded
from the remote repository, to make an index of all objects, filtering
them before sending them on to the storage. Since this functionality has
been implemented as a filter proxy in the storage API itself, the
built-in filtering by the git loader is now redundant.

The way the filtering was implemented in the loader would run through
the packfile six times: once for the basic object id indexing, once to
get content ids, then once for each object type. This change removes the
first two runs. By eschewing the double filtering, we should also reduce
the load on the backend storage (we would call the <object_type>_missing
endpoints twice).

Finally, as this change removes the global index of objects, and sends
the converted objects to the storage as soon as they're read, the memory
usage decreases substantially for large loads.

61afbc56

Drop unused get_fetch_history_result methods · 5e434d6f
Nicolas Dandrimont authored 4 years ago

5e434d6f

Feb 23, 2021
- README: Update to a minimum · 11fe5b27
  Antoine R. Dumont authored 3 years ago
  
  11fe5b27
Feb 17, 2021
- Rework loader instantiation logic according to loader core api · b14c06e7
  Antoine R. Dumont authored 4 years ago
  
  Note that this also updated some docstrings and type along the way. Related to T1410
  v0.8.0
  
  b14c06e7
Feb 11, 2021

loader.git: Mark visit status as not_found or failed when relevant · c2cd5fed

Antoine R. Dumont authored 4 years ago

When the initial communication with the git server is failing initially (e.g repository
is not found), this marks the visit status as not_found.

When the initial communication is ok but a failure occurs during the fetch step (e.g
pack file too big, ...), the visit status is marked as failed.

Related to T3030

c2cd5fed

loader.git: Explicit the failure test cases · 11074801

Antoine R. Dumont authored 4 years ago

With the new loader.core 0.17, failed or partial status changed slightly.
This adds the necessary tests to explicit those.

Related to T3030

11074801