Use a Merkle discovery algorithm with archives

Build is green

Patch application report for D8521 (id=30687)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit cb93ebabeebbd592a791b93fc034c2433ac88b41
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/885/ for more details.

Nice, thanks.

I got some questions and suggestions inline.

Optimize loops, add types, improve docs and fix a bug

Build has FAILED

Patch application report for D8521 (id=30695)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/console

Fix import

Build has FAILED

Patch application report for D8521 (id=30695)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/console

Build is green

Patch application report for D8521 (id=30696)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit 995ae98a51f10e0028132ad8b13b143e1eafc2ea
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/888/ for more details.

Build has FAILED

Patch application report for D8521 (id=30695)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/console

ok, thanks, it's clearer.

2 small nitpicks inline.

I'll let some other people from the team a while to have a look too.

Address nits

Build is green

Patch application report for D8521 (id=30702)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit dc781b5e67d5d6b3c09794b1040f56443a3a0b47
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/890/ for more details.

There seems to be two chunks in get_sample not covered by tests. Could you add some coverage for them?

Oops, sent too soon. Two more requests:

Use double backticks in docstrings (we use ReST, not Markdown)

Forget the bits about |=, they are fine. (thx to @olasd for pointing it out)

! In !320 (closed), @vlorentz wrote: There seems to be two chunks in get_sample not covered by tests. Could you add some coverage for them?

Sure, I'll check that out after lunch, I've addressed/answered your questions in the mean time.

Address comments

Build has FAILED

Patch application report for D8521 (id=30759)

Rebasing onto e1f2135f...

Current branch diff-target is up to date.

Changes applied before test

commit 14df70ed3cef5ab5faf26dec1827b743b221ecf5
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/console

thanks!

(maybe) fix CI?

Build is green

Patch application report for D8521 (id=30760)

Rebasing onto 26fe954b...

First, rewinding head to replay your work on top of it...
Applying: Use a Merkle discovery algorithm with archives

Changes applied before test

commit 5add0728637e4374a2662beb5d608f0a24b49bef
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Sep 22 12:09:23 2022 +0200

    Use a Merkle discovery algorithm with archives
    
    "Discovery" is the term used to find out the differences between two
    Merkle graphs. Using such an algorithm is useful in that it drastically
    reduces the amount of data that needs to be transferred.
    
    This commit introduces an efficient but simple algorithm that is a good
    starting point for improved performance: random sampling of directories,
    the details of which are explained in the docstrings.
    
    Mercurial uses a more sophisticated algorithm for its discovery, but it
    is quite a bit more involved and would introduce too much complexity at
    once. Also, the constraints for speed that Mercurial has (in the order
    of milliseconds) don't apply as obviously to this context without
    further investigation.
    
    Benchmarks
    ==========
    
    Setup
    -----
    - With a local postgresql storage (so no network overhead), a local
      tmpfs obstorage on a fast NVME SSD, all of which should make this
      improvement look less good than it will be in production
    - With a tarball of the linux kernel at commit
      d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
    - Loading a tarball of 20 commits earlier
      (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
    - Only taking into account the loading (not the downloading of the
      tarball, or its decompression)
    
    Result
    ------
    
    before: ~30s
    after: ~17s
    
    Reproduced 4 times.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/892/ for more details.

Use a Merkle discovery algorithm with archives

Benchmarks

Setup

Result

Merge request reports

Activity

Patch application report for D8521 (id=30687)

Changes applied before test

Patch application report for D8521 (id=30695)

Changes applied before test

Patch application report for D8521 (id=30695)

Changes applied before test

Patch application report for D8521 (id=30696)

Changes applied before test

Patch application report for D8521 (id=30695)

Changes applied before test

Patch application report for D8521 (id=30702)

Changes applied before test

Patch application report for D8521 (id=30759)

Changes applied before test

Patch application report for D8521 (id=30760)

Changes applied before test