Benchmark software for the object storage

assigned to @dachary

marked this issue as related to swh/meta#3054

added Object storage priority:Normal state:wip labels

First draft for layer 0.

tests pass
runs with a degraded configuration and pgsql as a database
requires 8 machines (libvirt)

The benchmarks are not fully functional but they produce a write load that matches the object storage design. They run (README.txt) via libvirt and are being tested on Grid5000 to ensure all the pieces are in place (i.e. does it actually work to reserve machines + provision them + run) before moving forward.

ceph.tar.gz

The benchmark runs and it's not too complicated which is a relief. I'll cleanup the mess I made and move forward to finish writing the software.

direnv: loading bench/.envrc                                                                                                           ========================================================= test session starts =========================================================
platform linux -- Python 3.7.3, pytest-6.2.2, py-1.10.0, pluggy-0.13.1                                                                 
rootdir: /root                                                                                                                         
plugins: mock-3.5.1, asyncio-0.14.0                                                                                                    
collected 10 items                                                                                                                    
                                                            
bench/test_bench.py ..........                                                                                                  [100%] 
                                                                                                                  
========================================================= 10 passed in 55.48s =========================================================
Connection to dahu-25.grenoble.grid5000.fr closed.

I could not resist despite the fact that the benchmark is nowhere near meaningful and tried it anyway. With 2 writers it gives:

WARNING:root:Objects write/seconds 1008/s                                                                                             
WARNING:root:Bytes write/seconds 20MB/s

and with 5 writers it gives:

WARNING:root:Objects write/seconds 2K/s
WARNING:root:Bytes write/seconds 41MB/s

Meanwhile the PostgreSQL host has way too many processors for the load :-)

Refactored the custer provsioning to use all available disks instead of the existing file system (using cephadm instead of a hand made ceph cluster).

ceph.tar.gz

The benchmark was moved to a temporary repository for convenience (easier than uploading here every time). https://git.easter-eggs.org/biceps/biceps

Today I figured out the bottleneck of the benchmark was actually the CPU usage of the benchmark itself, originating from an excessive amount of transactions. A single worker achieves System administration object insert per seconds but adding more than 5 workers it tops at ~2.5K objects inserts because of the CPU. Hacking it a little showed it can reach 7K object write per second. I rewrote the benchmark to fix this properly, this is commit https://git.easter-eggs.org/biceps/biceps/-/commit/c0e79a2b6751cacb19ad4fad804a3b942047eb7f.

I ran out of time on grid5000 to verify the rewrite works as expected but I'm confident it will. The next steps will be to:

verify --fake-ro can reach 10K object insertions per second with 20 workers
run the bench mark without --fake-ro and see how much MB/s it achieves

bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 17.7K/s
bench.py --file-count-ro 40 --rw-workers 40 --packer-workers 20 --file-size 1024 --fake-ro yields WARNING:root:Objects write 13.8K/s
bench.py --file-count-ro 20 --rw-workers 20 --packer-workers 20 --file-size 1024
- WARNING:root:Objects write 6.4K/s
- WARNING:root:Bytes write 131.1MB/s
bench.py --file-count-ro 200 --rw-workers 20 --packer-workers 20 --file-size 1024
- WARNING:root:Objects write 6.1K/s
- WARNING:root:Bytes write 124.4MB/s

https://git.easter-eggs.org/biceps/biceps/-/commit/4552098bc6f364ab0e59df996551f23b2ec35049

Add reader to continuously read from images to simulate a read workload
Randomize the payload instead of using easily compressible data (postgres does a good job compressing them and this does not reflect the reality)

I chased various Ceph installation issues when using 14 machines and got to a point where it is reliable by:

zapping known devices even when they don't show up via the orchestrator: the internal logic waits for the required data to be available and does a better job than an attempt to wait for them to show up (sometime they don't and the reason is unclear)
using a Docker mirror registry to avoid hitting the rate limit

https://git.easter-eggs.org/biceps/biceps/-/commit/ffaf1cad18748377ec8e90b12beed83a862afd4f

Complete rewrite to:

Use one thread per worker (using asyncio for workloads turns out to be too complicated because python3 lacks universal support, for file I/O)
Merge the write/pack steps together for simplicity since one follows the other, always

Running the tests during ~24h showed:

there is no significant memory leak (but there is). The memory usage stayed at 27GB and went from 800MB RSS to 1.4GB RSS.
the throughput is does not degrade over time: creating ~6,000 image for 6TB shows the same througput as creating 20 images for 20GB.

There is a 3% space overhead on the RBD data pool. 6TB data, 3TB parity = 9TB. Actual 9.3TB, i.e. ~+3%.

root@ceph1:~# ceph df
--- RAW STORAGE ---
CLASS  SIZE    AVAIL   USED     RAW USED  %RAW USED
hdd    25 TiB  16 TiB  9.3 TiB   9.4 TiB      36.72
TOTAL  25 TiB  16 TiB  9.3 TiB   9.4 TiB      36.72
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1  214 KiB       12  642 KiB      0    4.7 TiB
ro-data                18   32  6.0 TiB    1.56M  9.3 TiB  40.04    9.3 TiB
ro                     19   32   11 MiB   12.16k  1.1 GiB      0    4.7 TiB

rbd bench on the images created

for i in $(rbd --pool ro ls | head -6) ; do rbd --pool ro  bench --io-type readwrite --io-threads 16 --io-total 1G $i > $i.out & done
rm *.out ; for i in $(rbd --pool ro ls | head -12) ; do rbd --pool ro --io-size 4K  bench --io-pattern rand --io-type read --io-threads 16 --io-total 10M $i > $i.out & done

https://git.easter-eggs.org/biceps/biceps/-/commit/9ee7fa5db5766bc8b765e0967d40ed86185fe1e9

Completed the tests for the rewrite, it is working.

https://git.easter-eggs.org/biceps/biceps/-/commit/5463a5c133f95de2406902b173b1b1e0eeb78630

debug and use nvme on yeti

https://git.easter-eggs.org/biceps/biceps/-/commit/d10cf0ec8c6e18ebc79122d0bb3587edbcb4594e

Struggled most of today because there is a bottleneck when using threads and postgres, from a single client. However, when running 4 process, it performs as expected. The benchmark should be rewritten to use the process pool instead of the thread pool which should not be too complicated. I tried to add a warmup phase so that all concurrent threads/process do not start at the same time, but it does not really make any visible difference.

$ ansible-playbook -i inventory tests-run.yml && ssh -t $runner direnv exec bench python bench/bench.py --file-count-ro 200 --rw-workers 10 --ro-workers 10 --file-size 1024 --no-warmup

WARNING:root:Objects write 1.4K/s
WARNING:root:Bytes write 27.9MB/s
WARNING:root:Objects read 816/s
WARNING:root:Bytes read 61.6MB/s

Since there are 4 process delivering the same performances (more or less 10%), this is:

Objects write ~6K/s
Bytes write: ~120MB/s
Objects read: ~3K/s
Bytes read: ~150MB/s

https://git.easter-eggs.org/biceps/biceps/-/commit/d4702adfedfa5486674062aa7364766ca4d22ec1

The rewrite to use processes was trivial and preliminary tests yield the expected results. Most of the time was spent on two problems:

Postgres server side limitation to 100 clients which I originally mistook for a client side limitation because of the sqlalchemy pools (and the code indeed keeps way too many open connections)
The select when packing turned out to be very expensive and the reason why it slows down to a halt with 100GB Shards. Switched to psycopg2 instead to use server side cursors (and not use an insane amount of RAM) while enumerating all objects at a constant speed.

The grid5000 cluster was reserved for this weekend (from friday night to sunday night) to run tests and collect results which are hopefully final for the layer 0 benchmarks, using 100GB images and all available disk space. I will babysit the run to catch unexpected behavior.

https://git.easter-eggs.org/biceps/biceps/-/commit/c5bb5a56291a00dd06529d4425cdfa05a36c2891

Fix a race condition that failed postgresql database drops.

$ bench.py --file-count-ro 200 --rw-workers 20 --ro-workers 80 --file-size 50000 --rand-ratio 10
...
WARNING:root:Objects write 5.8K/s
WARNING:root:Bytes write 118.4MB/s
WARNING:root:Objects read 12.3K/s
WARNING:root:Bytes read 850.3MB/s

I think the results are too optimistic for reads because it starts at the warmup phase (i.e. when a single Shard exists in the Read Storage) and it fits in RAM.

Benchmark software for the object storage

Designs

Child items ...

Activity