Hardcode the use of the tcp transport for GitHub origins (!62) · Merge requests · Platform / Development / swh-loader-git

Nicolas Dandrimont requested to merge generated-differential-D5148-source into generated-differential-D5148-target Feb 25, 2021

This change is necessary because of a shortcoming in the Dulwich HTTP transport: even if the Dulwich API lets us process the packfile in chunks as it's received, the HTTP transport implementation needs to entirely allocate the packfile in memory twice, once in the HTTP library, and once in a BytesIO managed by Dulwich, before passing it on to us as a chunked reader. Overall this triples the memory usage before we can even try to interrupt the loader before it overruns its memory limit.

In contrast, the Dulwich TCP transport just gives us the read handle on the underlying socket, doing no processing or copying of the bytes. We can interrupt it as soon as we've received too many bytes.

Test Plan

This used to be the behavior of this loader, and has been tested on some GitHub origins.

Migrated from D5148 (view on Phabricator)

Hardcode the use of the tcp transport for GitHub origins

Test Plan

Merge request reports