Skip to content
Snippets Groups Projects

Use a Merkle discovery algorithm with archives

"Discovery" is the term used to find out the differences between two Merkle graphs. Using such an algorithm is useful in that it drastically reduces the amount of data that needs to be transferred.

This commit introduces an efficient but simple algorithm that is a good starting point for improved performance: random sampling of directories, the details of which are explained in the docstrings.

Mercurial uses a more sophisticated algorithm for its discovery, but it is quite a bit more involved and would introduce too much complexity at once. Also, the constraints for speed that Mercurial has (in the order of milliseconds) don't apply as obviously to this context without further investigation.

Benchmarks

Setup

  • With a local postgresql storage (so no network overhead), a local tmpfs obstorage on a fast NVME SSD, all of which should make this improvement look less good than it will be in production
  • With a tarball of the linux kernel at commit d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
  • Loading a tarball of 20 commits earlier (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
  • only taking into account the loading (not the downloading of the tarball, or its decompression)

Result

before: ~30s after: ~17s

Reproduced 4 times.


Migrated from D8521 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8521 (id=30687)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit cb93ebabeebbd592a791b93fc034c2433ac88b41
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/885/ for more details.

  • Nice, thanks.

    I got some questions and suggestions inline.

  • Author Contributor

    Optimize loops, add types, improve docs and fix a bug

  • Build has FAILED

    Patch application report for D8521 (id=30695)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/886/console

  • Author Contributor

    Fix import

  • Build has FAILED

    Patch application report for D8521 (id=30695)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/887/console

  • Build is green

    Patch application report for D8521 (id=30696)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 995ae98a51f10e0028132ad8b13b143e1eafc2ea
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/888/ for more details.

  • Build has FAILED

    Patch application report for D8521 (id=30695)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 7f9f75ac74ba9f4d3ea38c8d019d54a8e04cc537
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/889/console

  • ok, thanks, it's clearer.

    2 small nitpicks inline.

    I'll let some other people from the team a while to have a look too.

  • Author Contributor

    Address nits

  • Build is green

    Patch application report for D8521 (id=30702)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit dc781b5e67d5d6b3c09794b1040f56443a3a0b47
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/890/ for more details.

  • There seems to be two chunks in get_sample not covered by tests. Could you add some coverage for them?

  • Oops, sent too soon. Two more requests:

    • Use double backticks in docstrings (we use ReST, not Markdown)
  • Forget the bits about |=, they are fine. (thx to @olasd for pointing it out)

  • Author Contributor

    ! In !320 (closed), @vlorentz wrote: There seems to be two chunks in get_sample not covered by tests. Could you add some coverage for them?

    Sure, I'll check that out after lunch, I've addressed/answered your questions in the mean time.

  • Author Contributor

    Address comments

  • Build has FAILED

    Patch application report for D8521 (id=30759)

    Rebasing onto e1f2135f...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 14df70ed3cef5ab5faf26dec1827b743b221ecf5
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/ See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/891/console

  • thanks!

  • Author Contributor

    (maybe) fix CI?

  • Build is green

    Patch application report for D8521 (id=30760)

    Rebasing onto 26fe954b...

    First, rewinding head to replay your work on top of it...
    Applying: Use a Merkle discovery algorithm with archives
    Changes applied before test
    commit 5add0728637e4374a2662beb5d608f0a24b49bef
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Thu Sep 22 12:09:23 2022 +0200
    
        Use a Merkle discovery algorithm with archives
        
        "Discovery" is the term used to find out the differences between two
        Merkle graphs. Using such an algorithm is useful in that it drastically
        reduces the amount of data that needs to be transferred.
        
        This commit introduces an efficient but simple algorithm that is a good
        starting point for improved performance: random sampling of directories,
        the details of which are explained in the docstrings.
        
        Mercurial uses a more sophisticated algorithm for its discovery, but it
        is quite a bit more involved and would introduce too much complexity at
        once. Also, the constraints for speed that Mercurial has (in the order
        of milliseconds) don't apply as obviously to this context without
        further investigation.
        
        Benchmarks
        ==========
        
        Setup
        -----
        - With a local postgresql storage (so no network overhead), a local
          tmpfs obstorage on a fast NVME SSD, all of which should make this
          improvement look less good than it will be in production
        - With a tarball of the linux kernel at commit
          d96d875ef5dd372f533059a44f98e92de9cf0d42 already loaded
        - Loading a tarball of 20 commits earlier
          (bf3f401db6cbe010095fe3d1e233a5fde54e8b78)
        - Only taking into account the loading (not the downloading of the
          tarball, or its decompression)
        
        Result
        ------
        
        before: ~30s
        after: ~17s
        
        Reproduced 4 times.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/892/ for more details.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading