Improve CVS loader performances (!62) · Merge requests · Platform / Development / swh-loader-cvs

Closed Antoine Lambert requested to merge generated-differential-D8682-source into generated-differential-D8682-target 2 years ago

That diff contains two commits that should greatly improve the loading of large CVS repositories in terms of performance.

loader: Reconstruct repo filesystem incrementally at each revision

Instead of creating a from_disk.Directory instance after each replayed
CVS revision by recursively scanning all directories of the repository,
prefer to have a single one as class member kept synchronized with the
recontructed filesystem after each revision replay.

This should improve loader in terms of performance, especially when
delaing with large repositories.

loader: Yield only modified objects in process_cvs_changesets

Previously, after each revision replay all files and directories of the
CVS repository being loaded were collected and sent to the storage.
This is a real bottleneck in terms of loading performances as it delegates
the filtering of new objects to archive to the storage filtering proxy.

As we known exactly the set of paths that have been modified in a CVS
revision, prefer to do that filtering on the loader side and only
send modified objects to storage instead of the whole set of contents
and directories from the reconstructed filesystem.

This should greatly improve loading performance for large repositories
but also reduce loader memory consumption.

Migrated from D8682 (view on Phabricator)

Activity

Please register or sign in to reply

Improve CVS loader performances

Merge request reports

Activity