cassandra: Use concurrent queries in *_missing() instead of naive grouping
Instead of grouping ids in queries in arbitrary batches (which forces the server node to coordinate with other nodes to complete the query), this sends queries with one id each, directly to the right node.
This is the 'concurrent' algorithm from https://forge.softwareheritage.org/swh/infra/sysadm-environment#3577 which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.
This is essentially !727 (closed), minus the option to select other algos.
Migrated from D6885 (view on Phabricator)
Merge request reports
Activity
Some references in the commit message have been migrated:
- T3577#72791 is now swh/infra/sysadm-environment#3577 (closed)
Build is green
Patch application report for D6885 (id=24967)
Rebasing onto 259bf6fe...
Current branch diff-target is up to date.
Changes applied before test
commit 4a24505049d5c34c264d2b27e5feb24719b9e674 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Jan 6 12:41:45 2022 +0100 cassandra: Use concurrent queries in *_missing() instead of naive grouping Instead of grouping ids in queries in arbitrary batches (which forces the server node to coordinate with other nodes to complete the query), this sends queries with one id each, directly to the right node. This is the 'concurrent' algorithm from https://forge.softwareheritage.org/swh/infra/sysadm-environment#3577 which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1517/ for more details.
mentioned in merge request !756 (closed)
mentioned in merge request !727 (closed)