The replication of the table on cassandra07 is done at ~ 65%.
% /opt/cassandra/bin/nodetool -u cassandra --password $PASS repair -pr --full -j4 -os swh directory_entry[2023-06-05 14:07:39,287] Starting repair command #2 (548af2e0-03aa-11ee-ac72-7f7ab325e955), repairing keyspace swh with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 4, ColumnFamilies: [directory_entry], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 16, pull repair: false, force repair: false, optimise streams: true, ignore unreplicated keyspaces: false)[2023-06-05 20:00:03,334] Repair session 54909830-03aa-11ee-ac72-7f7ab325e955 for range [(-5671075333212739092,-5551702937750253325]] finished (progress: 20%)[2023-06-05 22:28:12,625] Repair session 54918290-03aa-11ee-ac72-7f7ab325e955 for range [(-3189361288579214217,-3119256184926673420]] finished (progress: 25%)[2023-06-05 22:35:21,883] Repair session 54904a11-03aa-11ee-ac72-7f7ab325e955 for range [(1809754653662232774,1912181077780678886]] finished (progress: 30%)[2023-06-06 02:23:00,829] Repair session 5491a9a1-03aa-11ee-ac72-7f7ab325e955 for range [(-2290959842066850949,-2177579945957392766]] finished (progress: 35%)[2023-06-06 02:34:47,696] Repair session 54902300-03aa-11ee-ac72-7f7ab325e955 for range [(6690016262382860541,6830452116972767102]] finished (progress: 40%)[2023-06-06 03:48:54,279] Repair session 54913471-03aa-11ee-ac72-7f7ab325e955 for range [(-3701625211983488138,-3625502822596996061]] finished (progress: 45%)[2023-06-06 04:37:30,208] Repair session 5491f7c1-03aa-11ee-ac72-7f7ab325e955 for range [(-819588852886591199,-696379350469125012]] finished (progress: 50%)[2023-06-06 05:07:03,173] Repair session 54910d61-03aa-11ee-ac72-7f7ab325e955 for range [(-7863147936251041151,-7755395066018374130]] finished (progress: 55%)[2023-06-06 06:07:18,987] Repair session 548fadd0-03aa-11ee-ac72-7f7ab325e955 for range [(7251759680742486785,7351090393829664982], (-8794965904141383817,-8649953041069444200]] finished (progress: 60%)[2023-06-06 06:46:28,242] Repair session 5491d0b1-03aa-11ee-ac72-7f7ab325e955 for range [(5624441674498865760,5832260504284685700]] finished (progress: 65%)
The cluster need to stabilize with a lot of sstable to compact, but it's normal as there is a lot of data incoming,
% /opt/cassandra/bin/nodetool -u cassandra --password $PASS tablestats swh.directory_entry -HTotal number of tables: 85----------------Keyspace : swh Read Count: 83390045 Read Latency: 1.0857913792347755 ms Write Count: 45931622 Write Latency: 0.057805239340339425 ms Pending Flushes: 0 Table: directory_entry SSTable count: 879 Old SSTable count: 0 SSTables in each level: [869/4, 14/10, 0, 0, 0, 0, 0, 0, 0] Space used (live): 2.18 TiB Space used (total): 2.18 TiB Space used by snapshots (total): 0 bytes Off heap memory used (total): 3.97 GiB SSTable Compression Ratio: 0.502868825436812 Number of partitions (estimate): 899741333 Memtable cell count: 257863 Memtable data size: 26.7 MiB Memtable off heap memory used: 0 bytes Memtable switch count: 68 Local read count: 28 Local read latency: NaN ms Local write count: 37860123 Local write latency: 0.040 ms Pending flushes: 0 Percent repaired: 99.35 Bytes repaired: 4289.571GiB Bytes unrepaired: 28.252GiB Bytes pending repair: 0.000KiB Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 1.2 GiB Bloom filter off heap memory used: 1.46 GiB Index summary off heap memory used: 226.8 MiB Compression metadata off heap memory used: 2.29 GiB Compacted partition minimum bytes: 73 Compacted partition maximum bytes: 74975550 Compacted partition mean bytes: 2731 Average live cells per slice (last five minutes): NaN Maximum live cells per slice (last five minutes): 0 Average tombstones per slice (last five minutes): NaN Maximum tombstones per slice (last five minutes): 0 Dropped Mutations: 0 bytes Droppable tombstone ratio: 0.00000----------------
% /opt/cassandra/bin/nodetool -u cassandra --password $PASS repair -pr --full -j4 -os swh directory_entry[2023-06-05 14:07:39,287] Starting repair command #2 (548af2e0-03aa-11ee-ac72-7f7ab325e955), repairing keyspace swh with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 4, ColumnFamilies: [directory_entry], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 16, pull repair: false, force repair: false, optimise streams: true, ignore unreplicated keyspaces: false)[2023-06-05 20:00:03,334] Repair session 54909830-03aa-11ee-ac72-7f7ab325e955 for range [(-5671075333212739092,-5551702937750253325]] finished (progress: 20%)[2023-06-05 22:28:12,625] Repair session 54918290-03aa-11ee-ac72-7f7ab325e955 for range [(-3189361288579214217,-3119256184926673420]] finished (progress: 25%)[2023-06-05 22:35:21,883] Repair session 54904a11-03aa-11ee-ac72-7f7ab325e955 for range [(1809754653662232774,1912181077780678886]] finished (progress: 30%)[2023-06-06 02:23:00,829] Repair session 5491a9a1-03aa-11ee-ac72-7f7ab325e955 for range [(-2290959842066850949,-2177579945957392766]] finished (progress: 35%)[2023-06-06 02:34:47,696] Repair session 54902300-03aa-11ee-ac72-7f7ab325e955 for range [(6690016262382860541,6830452116972767102]] finished (progress: 40%)[2023-06-06 03:48:54,279] Repair session 54913471-03aa-11ee-ac72-7f7ab325e955 for range [(-3701625211983488138,-3625502822596996061]] finished (progress: 45%)[2023-06-06 04:37:30,208] Repair session 5491f7c1-03aa-11ee-ac72-7f7ab325e955 for range [(-819588852886591199,-696379350469125012]] finished (progress: 50%)[2023-06-06 05:07:03,173] Repair session 54910d61-03aa-11ee-ac72-7f7ab325e955 for range [(-7863147936251041151,-7755395066018374130]] finished (progress: 55%)[2023-06-06 06:07:18,987] Repair session 548fadd0-03aa-11ee-ac72-7f7ab325e955 for range [(7251759680742486785,7351090393829664982], (-8794965904141383817,-8649953041069444200]] finished (progress: 60%)[2023-06-06 06:46:28,242] Repair session 5491d0b1-03aa-11ee-ac72-7f7ab325e955 for range [(5624441674498865760,5832260504284685700]] finished (progress: 65%)[2023-06-06 09:40:51,589] Repair session 548d63e0-03aa-11ee-ac72-7f7ab325e955 for range [(-6980711446864766273,-6854581648503231791], (295734264205959219,444559856446059031], (2219460350136017224,2344955064167581048]] finished (progress: 70%)[2023-06-06 13:03:56,750] Repair session 5490bf40-03aa-11ee-ac72-7f7ab325e955 for range [(-4937151099428690793,-4808933773461638178], (1289501923344847682,1397333404541601956]] finished (progress: 75%)[2023-06-06 13:03:56,752] Repair completed successfully[2023-06-06 13:03:56,755] Repair command #2 finished in 22 hours 56 minutes 17 seconds[2023-06-06 13:03:56,760] condition satisfied queried for parent session status and discovered repair completed.[2023-06-06 13:03:56,762] Repair completed successfully
Unfortunately, it was wrong to add the -pr option as it repaired only the range where the node is the primary node.
It works well when the command is launched sequentially on all the nodes of the cluster. It's not the case here as we are removing the data of a node each time, we need to have everything replicated to always maintain a RF>=2
In order to perform the STCS to LCS compaction migration several strategy were tested:
on cassandra07 and cassandra01, completely remove the directory_entry content and perform a full repair
it works well but it's very slow
it impacts all the other servers as some space is needed to prepare the data to stream to the server to repair
on other servers
Offload some tables to the disk dedicated to the commitlog [1]
create a lvm lv volume with the space remaining on the boot device
If not enough remove a small sstable (The repair of a "small" table has less impact on the cluster)
When there is enough space, the migration automatically starts
This is the current status of the nodes in the cluster:
cassandra01
full compaction done
Removal of unused temporary lvm lv to do
cassandra02
cassandra03
full compaction done
Removal of unused temporary lvm lvs to do
cassandra04
wait for cassandra05 content table repair
found 500Go+ of disk space (possibly temporary remove the content table)
Compaction of a couple of big sstables todo (1.4To, 2.2To, ...)
wait for end of the migration to LCS compaction
Repair the removed table (content?)
Move the offloaded tables from the write intensive disk to the data directory
Cleanup zfs
Cleanup lvm lvs
cassandra05
Migration to LCS of 2 big sstables (2.2To each) in progress
Repair the content table
Move the offloaded table from the write intensive disk
Cleanup zfs
Remove the temporary lvm lvs from the mixeduse zfs datapool
Remove the temporary lvm lvs
cassandra06
full compaction done
Move the offloaded table from the write intensive disk
Cleanup zfs
Removal of unused temporary lvm lv to do
cassandra07
Full compaction done
Wait for the L0 -> L1 accumulated lag to resorb [2]
Repair the content_by_blake2s256 table
Remove temporary lvs from the mixeduse zfs pool
Remove the temporary lvm lvs
Move the offloaded table from the write intensive disk
Cleanup zfs
cassandra08
Full compaction done
Remove temporary lvs from the mixeduse zfs pool
Remove the temporary lvm lvs
Move the offloaded table from the write intensive disk
Cleanup zfs
[1]
Add a zfs dataset mounted on /srv/cassandra/instance1/data2
Stop cassandra
Move a data directory from /srv/cassandra/instance1/data/swh/ to /srv/cassandra/instance1/data2
Create a link between data/swh/<data_dir> and data2/<data-dir>
[2]
...SStables in each level: [3800/4, 65/10, 105/100, 43, 0, 0, 0, 0, 0]...
L0 sstables are not well compressed and a lot of space is wasted.
The more space available, the more the compaction will be fast and efficient, so the server cleanup is not done yet
INFO [CompactionExecutor:1941] 2023-08-07 10:56:36,429 CompactionTask.java:241 - Compacted (ac13e560-33e0-11ee-a10a-e7994b5b327f) 38 sstables to [/srv/cassandra/instance1/data/swh/directory_entry-b74d85d02d2e11ed970a612d80206516/nb-36993-big,/srv/cassandra/instance1/data/swh/directory_entry-b74d85d02d2e11ed970a612d80206516/nb-36995-big,...,/srv/cassandra/instance1/data/swh/directory_entry-b74d85d02d2e11ed970a612d80206516/nb-37060-big,] to level=1. 1486.245GiB to 1329.002GiB (~89% of original) in 130,741,639ms. Read Throughput = 11.641MiB/s, Write Throughput = 10.409MiB/s, Row Throughput = ~9,243/s. 1,283,208,241 total partitions merged to 1,145,501,383. Partition merge counts were {1:1008095317, 2:137105678, 3:299984, 4:404, }
The directory_entry table on cassandra07 was finally deleted and rebuild from scratch with a full repair. The repair is done and the l0->l1 compaction is in progress (112 S0 sstable to compact)
cassandra04 has finished the repair of content table and the compactions to the different levels.
It has also finished the migration and following compactions of directory_entry. It's ready for the cleanup.
The compaction in cassandra07 is still in progress, but almost done (~60 sstables in L0 to compact).