The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.
Today I figured out the bottleneck of the benchmark was actually the CPU usage of the benchmark itself, originating from an excessive amount of transactions. A single worker achieves System administration object insert per seconds but adding more than 5 workers it tops at ~2.5K objects inserts because of the CPU. Hacking it a little showed it can reach 7K object write per second. I rewrote the benchmark to fix this properly, this is commit https://git.easter-eggs.org/biceps/biceps/-/commit/c0e79a2b6751cacb19ad4fad804a3b942047eb7f.
I ran out of time on grid5000 to verify the rewrite works as expected but I'm confident it will. The next steps will be to:
verify --fake-ro can reach 10K object insertions per second with 20 workers
run the bench mark without --fake-ro and see how much MB/s it achieves
Add reader to continuously read from images to simulate a read workload
Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)
I chased various Ceph installation issues when using 14 machines and got to a point where it is reliable by:
zapping known devices even when they don't show up via the orchestrator: the internal logic waits for the required data to be available and does a better job than an attempt to wait for them to show up (sometime they don't and the reason is unclear)
using a Docker mirror registry to avoid hitting the rate limit
for i in $(rbd --pool ro ls | head -6) ; do rbd --pool ro bench --io-type readwrite --io-threads 16 --io-total 1G $i > $i.out & donerm *.out ; for i in $(rbd --pool ro ls | head -12) ; do rbd --pool ro --io-size 4K bench --io-pattern rand --io-type read --io-threads 16 --io-total 10M $i > $i.out & done
Struggled most of today because there is a bottleneck when using threads and postgres, from a single client. However, when running 4 process, it performs as expected. The benchmark should be rewritten to use the process pool instead of the thread pool which should not be too complicated. I tried to add a warmup phase so that all concurrent threads/process do not start at the same time, but it does not really make any visible difference.
The rewrite to use processes was trivial and preliminary tests yield the expected results. Most of the time was spent on two problems:
Postgres server side limitation to 100 clients which I originally mistook for a client side limitation because of the sqlalchemy pools (and the code indeed keeps way too many open connections)
The select when packing turned out to be very expensive and the reason why it slows down to a halt with 100GB Shards. Switched to psycopg2 instead to use server side cursors (and not use an insane amount of RAM) while enumerating all objects at a constant speed.
The grid5000 cluster was reserved for this weekend (from friday night to sunday night) to run tests and collect results which are hopefully final for the layer 0 benchmarks, using 100GB images and all available disk space. I will babysit the run to catch unexpected behavior.
I think the results are too optimistic for reads because it starts at the warmup phase (i.e. when a single Shard exists in the Read Storage) and it fits in RAM.