Skip to content

svn_repo: Optimize export_temporary performances significantly

The export_temporary method of the SvnRepo class exports the content of a subversion repository at a given revision in a temporary directory.

As we also export the externals that might be associated to some paths in the repository, we first need to get all the svn:externals property values in order to determine if there is recursive or relative externals and adjust some export parameters accordingly.

While that operation is fast when the subversion repository is hosted locally, it is terribly slow when the repository is hosted on a remote server. Indeed a recursive propget operation on a remote server sends a lot of network requests which slows down quite a lot the process, especially with large repositories.

To improve the performances, the previous implementation was doing a full checkout of the repository to local filesystem and gets svn:externals property values from it. Nevertheless, that process is time consuming for large repositories and it can consume a lot of disk space.

In order to remove that bottleneck and improve overall performances for getting all properties values, introduce a C++ extension module for Python that implements a fast way to crawl all paths of a repository and their associated properties. Unlike svn ls --depth infinity or svn propget -R commands it performs only one SVN request over the network, hence saving time especially with large repositories. The code is freely inspired from the fast-svn-crawler project by Dmitry Pavlenko (https://sourceforge.net/projects/fastsvncrawler/).

The obtained speedup is quite impressive, on a large remote repository listing all paths using svn ls --depth infinity or gettings all svn:externals property values using svn propget -R takes around one hour while it takes only a couple of minutes using the approach implemented in the C++ extension module. Using that approach also enables to save disk space as we no longer need to perform a full checkout of the repository.

This change should greatly improve the performances when reloading a svn repository already visited by Software Heritage. Indeed, before the possible archiving of new commits issued since last visit, the loader checks that a repository has not been altered by calling the export_temporary method using the remote repository URL.

Below is some benchmarks for listing all paths of a large repository:

  • using svn propget -R
$ time svn propget -R svn:externals https://svn.code.sf.net/p/swig/code
svn: E000110: Error running context: Connection timed out

real    148m14,425s
user    0m26,722s
sys     0m10,709s
  • using svn ls --depth infinity
$ time svn ls --depth infinity https://svn.code.sf.net/p/swig/code
...

real    61m47,829s
user    0m16,482s
sys     0m6,950s
  • using the C++ extension module for Python
$ time python -c "from swh.loader.svn.fast_crawler import crawl_repository; print('\n'.join(crawl_repository('https://svn.code.sf.net/p/swig/code').keys()))"
...

real    4m14,257s
user    0m13,626s
sys     0m3,227s

I have also tested in the docker environment the performances when reloading a large subversion repository before and after that optimization:

  • before the optimization
swh-loader_1                         | [2023-05-25 19:40:57,029: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[f477e2da-b2a3-4095-a8cf-9ea35eef28a1] received
swh-loader_1                         | [2023-05-25 19:42:12,217: DEBUG/ForkPoolWorker-1] Loading config file /loader.yml
swh-loader_1                         | [2023-05-25 19:42:12,224: DEBUG/ForkPoolWorker-1] PID 148 is live, skipping
swh-loader_1                         | [2023-05-25 19:42:12,237: INFO/ForkPoolWorker-1] Load origin 'https://svn.code.sf.net/p/swig/code' with type 'svn'
swh-loader_1                         | [2023-05-25 19:42:19,243: DEBUG/ForkPoolWorker-1] Checking if history of repository got altered since last visit
swh-loader_1                         | [2023-05-25 19:42:19,244: DEBUG/ForkPoolWorker-1] svn checkout -r 13980 --depth infinity --ignore-externals https://svn.code.sf.net/p/swig/code /tmp/swh.loader.svn.3k_pdiul-148/checkout-revision-13980.yeuge3rj
swh-loader_1                         | [2023-05-25 20:13:30,930: DEBUG/ForkPoolWorker-1] svn propget --recursive svn:externals /tmp/swh.loader.svn.3k_pdiul-148/checkout-revision-13980.yeuge3rj
swh-loader_1                         | [2023-05-25 20:13:36,482: DEBUG/ForkPoolWorker-1] svn export -r 13980 --depth infinity --ignore-keywords https://svn.code.sf.net/p/swig/code /tmp/swh.loader.svn.3k_pdiul-148/check-revision-13980._0bxj55x/code
swh-loader_1                         | [2023-05-25 20:48:17,988: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.3k_pdiul-148/check-revision-13980._0bxj55x
swh-loader_1                         | [2023-05-25 20:48:23,034: DEBUG/ForkPoolWorker-1] snapshot: Snapshot(branches=ImmutableDict({b'HEAD': SnapshotBranch(target=hash_to_bytes('d7f64244c1081062e0d1e73e1a11d3709c540f5c'), target_type=TargetType.REVISION)}), id=hash_to_bytes('1e5620262edd4829aacd994fe4d1735d0f90d4c3'))
swh-loader_1                         | [2023-05-25 20:48:23,034: DEBUG/ForkPoolWorker-1] Flushing 1 objects of type snapshot
swh-loader_1                         | [2023-05-25 20:48:23,057: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.3k_pdiul-148
swh-loader_1                         | [2023-05-25 20:48:23,058: INFO/ForkPoolWorker-1] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[f477e2da-b2a3-4095-a8cf-9ea35eef28a1] succeeded in 3970.840362989009s: {'status': 'uneventful'}
  • after the optimization
swh-loader_1                         | [2023-05-25 20:57:52,666: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[5682f5b7-c34d-4c87-8897-f4d32ee4a13c] received
swh-loader_1                         | [2023-05-25 20:57:52,668: DEBUG/ForkPoolWorker-1] Loading config file /loader.yml
swh-loader_1                         | [2023-05-25 20:57:52,682: DEBUG/ForkPoolWorker-1] PID 166 is live, skipping
swh-loader_1                         | [2023-05-25 20:57:52,698: INFO/ForkPoolWorker-1] Load origin 'https://svn.code.sf.net/p/swig/code' with type 'svn'
swh-loader_1                         | [2023-05-25 20:57:52,698: DEBUG/ForkPoolWorker-1] lister_not provided, skipping extrinsic origin metadata
swh-loader_1                         | [2023-05-25 20:57:59,093: DEBUG/ForkPoolWorker-1] Checking if history of repository got altered since last visit
swh-loader_1                         | [2023-05-25 20:59:32,998: DEBUG/ForkPoolWorker-1] svn export -r 13980 --depth infinity --ignore-keywords https://svn.code.sf.net/p/swig/code /tmp/swh.loader.svn.hu376ned-166/check-revision-13980.7_s4yv8b/code
swh-loader_1                         | [2023-05-25 21:33:50,997: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.hu376ned-166/check-revision-13980.7_s4yv8b
swh-loader_1                         | [2023-05-25 21:33:56,083: DEBUG/ForkPoolWorker-1] snapshot: Snapshot(branches=ImmutableDict({b'HEAD': SnapshotBranch(target=hash_to_bytes('d7f64244c1081062e0d1e73e1a11d3709c540f5c'), target_type=TargetType.REVISION)}), id=hash_to_bytes('1e5620262edd4829aacd994fe4d1735d0f90d4c3'))
swh-loader_1                         | [2023-05-25 21:33:56,084: DEBUG/ForkPoolWorker-1] Flushing 1 objects of type snapshot
swh-loader_1                         | [2023-05-25 21:33:56,104: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.hu376ned-166
swh-loader_1                         | [2023-05-25 21:33:56,108: INFO/ForkPoolWorker-1] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[5682f5b7-c34d-4c87-8897-f4d32ee4a13c] succeeded in 2163.4371637250006s: {'status': 'uneventful'}

So we go from 3970s to 2163s for reloading the same repository so a gain of 1807s.

Edited by Antoine Lambert

Merge request reports