Skip to content

to_disk: Speedup directory cooking with multi-threading

Previously when cooking a directory, contents bytes were fetched sequentially which could take a good amount of time for large directories.

In order to speedup the cooking process, retrieve the contents bytes in parallel with the help of the concurrent.futures module from the Python standard library which fits particularly well for making loops of I/O-bound tasks concurrent and for issuing tasks asynchronously.

Below are some cooking timings using the following vault config:

storage:
  cls: remote
  url: http://moma.internal.softwareheritage.org:5002/

Without multi-threading:

$ time swh vault cook -C /tmp/vault.yml swh:1:dir:86170c4a719bc655b893cd5b061c98ab0cadc860 /tmp/dir.tar.gz
WARNING:swh.core.cli:Could not load subcommand foo: ModuleNotFoundError("No module named 'swh.foo.cli'")
WARNING:swh.core.cli:Could not load subcommand swh.objstorage.replayer: ModuleNotFoundError("No module named 'swh.objstorage.replayer'")

real    0m2,414s
user    0m0,899s
sys     0m0,074s

With multi-threading:

$ time swh vault cook -C /tmp/vault.yml swh:1:dir:86170c4a719bc655b893cd5b061c98ab0cadc860 /tmp/dir.tar.gz

real    0m1,290s
user    0m0,854s
sys     0m0,088s

Without multi-threading:

$ time swh vault cook -C /tmp/vault.yml swh:1:dir:ee6d64e695081a9205726af023e29dfe18e39956 /tmp/dir.tar.gz
WARNING:swh.core.cli:Could not load subcommand foo: ModuleNotFoundError("No module named 'swh.foo.cli'")
WARNING:swh.core.cli:Could not load subcommand swh.objstorage.replayer: ModuleNotFoundError("No module named 'swh.objstorage.replayer'")

real    1m26,330s
user    0m5,055s
sys     0m0,439s

With multi-threading:

$ time swh vault cook -C /tmp/vault.yml swh:1:dir:ee6d64e695081a9205726af023e29dfe18e39956 /tmp/dir.tar.gz
WARNING:swh.core.cli:Could not load subcommand foo: ModuleNotFoundError("No module named 'swh.foo.cli'")
WARNING:swh.core.cli:Could not load subcommand swh.objstorage.replayer: ModuleNotFoundError("No module named 'swh.objstorage.replayer'")

real    0m10,774s
user    0m5,270s
sys     0m0,586s

Without multi-threading: I did not re-execute the cooking of that one today but it took around three hours last friday.

With multi-threading:

$ time swh vault cook -C /tmp/vault.yml swh:1:dir:44dde92e4dbd16f25c7ce50240bf53a7b753e7ad /tmp/dir.tar.gz

real    21m48,714s
user    7m2,126s
sys     0m47,035s

Merge request reports