Stop processing packfiles before sending objects (!61) · Merge requests · Platform / Development / swh-loader-git

Nicolas Dandrimont requested to merge generated-differential-D5147-source into generated-differential-D5147-target Feb 25, 2021

Since its creation, the git loader would process the packfile downloaded from the remote repository, to make an index of all objects, filtering them before sending them on to the storage. Since this functionality has been implemented as a filter proxy in the storage API itself, the built-in filtering by the git loader is now redundant.

The way the filtering was implemented in the loader would run through the packfile six times: once for the basic object id indexing, once to get content ids, then once for each object type. This change removes the first two runs. By eschewing the double filtering, we should also reduce the load on the backend storage (we would call the <object_type>_missing endpoints twice).

Finally, as this change removes the global index of objects, and sends the converted objects to the storage as soon as they're read, the memory usage decreases substantially for large loads.

Test Plan

Integration tests unchanged

Migrated from D5147 (view on Phabricator)

Stop processing packfiles before sending objects

Test Plan

Merge request reports