Use a Merkle discovery algorithm with archives
"Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred.
This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings.
Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation.
Benchmarks
Setup
- With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production
- With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
- Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
- only taking into account the loading (not the downloading of the tarball, or its decompression)
Result
before: ~30s after: ~17s
Reproduced 4 times.
Migrated from D8521 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D8521 (id=30687)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit cb93ebabeebbd592a791b93fc034c2433ac88b41 Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/885/ for more details.
Build has FAILED
Patch application report for D8521 (id=30695)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537 Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/console
Build has FAILED
Patch application report for D8521 (id=30695)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537 Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/console
Build is green
Patch application report for D8521 (id=30696)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit 995ae98a51f10e0028132ad8b13b143e1eafc2ea Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/888/ for more details.
Build has FAILED
Patch application report for D8521 (id=30695)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537 Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/console
Build is green
Patch application report for D8521 (id=30702)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit dc781b5e67d5d6b3c09794b1040f56443a3a0b47 Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/890/ for more details.
Forget the bits about
|=
, they are fine. (thx to @olasd for pointing it out)! In !320 (closed), @vlorentz wrote: There seems to be two chunks in
get_sample
not covered by tests. Could you add some coverage for them?Sure, I'll check that out after lunch, I've addressed/answered your questions in the mean time.
Build has FAILED
Patch application report for D8521 (id=30759)
Rebasing onto e1f2135f...
Current branch diff-target is up to date.
Changes applied before test
commit 14df70ed3cef5ab5faf26dec1827b743b221ecf5 Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/console
Build is green
Patch application report for D8521 (id=30760)
Rebasing onto 26fe954b...
First, rewinding head to replay your work on top of it... Applying: Use a Merkle discovery algorithm with archives
Changes applied before test
commit 5add0728637e4374a2662beb5d608f0a24b49bef Author: Raphaël Gomès <rgomes@octobus.net> Date: Thu Sep 22 12:09:23 2022 +0200 Use a Merkle discovery algorithm with archives "Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred. This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings. Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation. Benchmarks ========== Setup ----- - With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production - With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded - Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78) - Only taking into account the loading (not the downloading of the tarball, or its decompression) Result ------ before: ~30s after: ~17s Reproduced 4 times.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/892/ for more details.