[cassandra] Update the directory_entry table to LCS compaction

changed milestone to %Cassandra in production as primary storage [Roadmap - Tooling and infrastructure]

compaction disabled:

 seq 1 8 | xargs -t -i{}  /opt/cassandra/bin/nodetool -h cassandra0{} -u cassandra --password $PASS disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra01 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra02 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra03 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra04 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra05 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra06 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra07 -u cassandra --password [redacted] disableautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra08 -u cassandra --password [redacted] disableautocompaction swh directory_entry

~ % seq 1 8 | xargs -t -i{}  /opt/cassandra/bin/nodetool -h cassandra0{} -u cassandra --password $PASS statusautocompaction swh directory_entry
/opt/cassandra/bin/nodetool -h cassandra01 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra02 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra03 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra04 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra05 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra06 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra07 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running
/opt/cassandra/bin/nodetool -h cassandra08 -u cassandra --password [redacted] statusautocompaction swh directory_entry
not running

cleanup of cassandra07 done:

/srv/cassandra/instance1/data/swh % sudo systemctl stop cassandra@instance1
/srv/cassandra/instance1/data/swh % sudo rm -rf directory_entry-b74d85d02d2e11ed970a612d80206516
/srv/cassandra/instance1/data/swh % df -h .
Filesystem                         Size  Used Avail Use% Mounted on
mixeduse/cassandra-instance1-data   12T  3.8T  7.6T  34% /srv/cassandra/instance1/data

For the next nodes, nodetool flush should be used

Repair launched

/opt/cassandra/bin/nodetool -u cassandra --password $PASS repair -pr --full -j4 -os swh directory_entry

mentioned in issue #4892 (closed)

The replication of the table on cassandra07 is done at ~ 65%.

 % /opt/cassandra/bin/nodetool -u cassandra --password $PASS repair -pr --full -j4 -os swh directory_entry
[2023-06-05 14:07:39,287] Starting repair command #2 (548af2e0-03aa-11ee-ac72-7f7ab325e955), repairing keyspace swh with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 4, ColumnFamilies: [directory_entry], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 16, pull repair: false, force repair: false, optimise streams: true, ignore unreplicated keyspaces: false)
[2023-06-05 20:00:03,334] Repair session 54909830-03aa-11ee-ac72-7f7ab325e955 for range [(-5671075333212739092,-5551702937750253325]] finished (progress: 20%)
[2023-06-05 22:28:12,625] Repair session 54918290-03aa-11ee-ac72-7f7ab325e955 for range [(-3189361288579214217,-3119256184926673420]] finished (progress: 25%)
[2023-06-05 22:35:21,883] Repair session 54904a11-03aa-11ee-ac72-7f7ab325e955 for range [(1809754653662232774,1912181077780678886]] finished (progress: 30%)
[2023-06-06 02:23:00,829] Repair session 5491a9a1-03aa-11ee-ac72-7f7ab325e955 for range [(-2290959842066850949,-2177579945957392766]] finished (progress: 35%)
[2023-06-06 02:34:47,696] Repair session 54902300-03aa-11ee-ac72-7f7ab325e955 for range [(6690016262382860541,6830452116972767102]] finished (progress: 40%)
[2023-06-06 03:48:54,279] Repair session 54913471-03aa-11ee-ac72-7f7ab325e955 for range [(-3701625211983488138,-3625502822596996061]] finished (progress: 45%)
[2023-06-06 04:37:30,208] Repair session 5491f7c1-03aa-11ee-ac72-7f7ab325e955 for range [(-819588852886591199,-696379350469125012]] finished (progress: 50%)
[2023-06-06 05:07:03,173] Repair session 54910d61-03aa-11ee-ac72-7f7ab325e955 for range [(-7863147936251041151,-7755395066018374130]] finished (progress: 55%)
[2023-06-06 06:07:18,987] Repair session 548fadd0-03aa-11ee-ac72-7f7ab325e955 for range [(7251759680742486785,7351090393829664982], (-8794965904141383817,-8649953041069444200]] finished (progress: 60%)
[2023-06-06 06:46:28,242] Repair session 5491d0b1-03aa-11ee-ac72-7f7ab325e955 for range [(5624441674498865760,5832260504284685700]] finished (progress: 65%)

The cluster need to stabilize with a lot of sstable to compact, but it's normal as there is a lot of data incoming,

 % /opt/cassandra/bin/nodetool -u cassandra --password $PASS tablestats swh.directory_entry -H
Total number of tables: 85
----------------
Keyspace : swh
        Read Count: 83390045
        Read Latency: 1.0857913792347755 ms
        Write Count: 45931622
        Write Latency: 0.057805239340339425 ms
        Pending Flushes: 0
                Table: directory_entry
                SSTable count: 879
                Old SSTable count: 0
                SSTables in each level: [869/4, 14/10, 0, 0, 0, 0, 0, 0, 0]
                Space used (live): 2.18 TiB
                Space used (total): 2.18 TiB
                Space used by snapshots (total): 0 bytes
                Off heap memory used (total): 3.97 GiB
                SSTable Compression Ratio: 0.502868825436812
                Number of partitions (estimate): 899741333
                Memtable cell count: 257863
                Memtable data size: 26.7 MiB
                Memtable off heap memory used: 0 bytes
                Memtable switch count: 68
                Local read count: 28
                Local read latency: NaN ms
                Local write count: 37860123
                Local write latency: 0.040 ms
                Pending flushes: 0
                Percent repaired: 99.35
                Bytes repaired: 4289.571GiB
                Bytes unrepaired: 28.252GiB
                Bytes pending repair: 0.000KiB
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 1.2 GiB
                Bloom filter off heap memory used: 1.46 GiB
                Index summary off heap memory used: 226.8 MiB
                Compression metadata off heap memory used: 2.29 GiB
                Compacted partition minimum bytes: 73
                Compacted partition maximum bytes: 74975550
                Compacted partition mean bytes: 2731
                Average live cells per slice (last five minutes): NaN
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): NaN
                Maximum tombstones per slice (last five minutes): 0
                Dropped Mutations: 0 bytes
                Droppable tombstone ratio: 0.00000

----------------

the repair job is done

 % /opt/cassandra/bin/nodetool -u cassandra --password $PASS repair -pr --full -j4 -os swh directory_entry
[2023-06-05 14:07:39,287] Starting repair command #2 (548af2e0-03aa-11ee-ac72-7f7ab325e955), repairing keyspace swh with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 4, ColumnFamilies: [directory_entry], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 16, pull repair: false, force repair: false, optimise streams: true, ignore unreplicated keyspaces: false)
[2023-06-05 20:00:03,334] Repair session 54909830-03aa-11ee-ac72-7f7ab325e955 for range [(-5671075333212739092,-5551702937750253325]] finished (progress: 20%)
[2023-06-05 22:28:12,625] Repair session 54918290-03aa-11ee-ac72-7f7ab325e955 for range [(-3189361288579214217,-3119256184926673420]] finished (progress: 25%)
[2023-06-05 22:35:21,883] Repair session 54904a11-03aa-11ee-ac72-7f7ab325e955 for range [(1809754653662232774,1912181077780678886]] finished (progress: 30%)
[2023-06-06 02:23:00,829] Repair session 5491a9a1-03aa-11ee-ac72-7f7ab325e955 for range [(-2290959842066850949,-2177579945957392766]] finished (progress: 35%)
[2023-06-06 02:34:47,696] Repair session 54902300-03aa-11ee-ac72-7f7ab325e955 for range [(6690016262382860541,6830452116972767102]] finished (progress: 40%)
[2023-06-06 03:48:54,279] Repair session 54913471-03aa-11ee-ac72-7f7ab325e955 for range [(-3701625211983488138,-3625502822596996061]] finished (progress: 45%)
[2023-06-06 04:37:30,208] Repair session 5491f7c1-03aa-11ee-ac72-7f7ab325e955 for range [(-819588852886591199,-696379350469125012]] finished (progress: 50%)
[2023-06-06 05:07:03,173] Repair session 54910d61-03aa-11ee-ac72-7f7ab325e955 for range [(-7863147936251041151,-7755395066018374130]] finished (progress: 55%)
[2023-06-06 06:07:18,987] Repair session 548fadd0-03aa-11ee-ac72-7f7ab325e955 for range [(7251759680742486785,7351090393829664982], (-8794965904141383817,-8649953041069444200]] finished (progress: 60%)
[2023-06-06 06:46:28,242] Repair session 5491d0b1-03aa-11ee-ac72-7f7ab325e955 for range [(5624441674498865760,5832260504284685700]] finished (progress: 65%)

[2023-06-06 09:40:51,589] Repair session 548d63e0-03aa-11ee-ac72-7f7ab325e955 for range [(-6980711446864766273,-6854581648503231791], (295734264205959219,444559856446059031], (2219460350136017224,2344955064167581048]] finished (progress: 70%)
[2023-06-06 13:03:56,750] Repair session 5490bf40-03aa-11ee-ac72-7f7ab325e955 for range [(-4937151099428690793,-4808933773461638178], (1289501923344847682,1397333404541601956]] finished (progress: 75%)
[2023-06-06 13:03:56,752] Repair completed successfully
[2023-06-06 13:03:56,755] Repair command #2 finished in 22 hours 56 minutes 17 seconds
[2023-06-06 13:03:56,760] condition satisfied queried for parent session status and discovered repair completed.
[2023-06-06 13:03:56,762] Repair completed successfully

Unfortunately, it was wrong to add the -pr option as it repaired only the range where the node is the primary node.

It works well when the command is launched sequentially on all the nodes of the cluster. It's not the case here as we are removing the data of a node each time, we need to have everything replicated to always maintain a RF>=2

Let's repair the other ranges.

mentioned in commit swh/infra/puppet/puppet-swh-site@d2403595

mentioned in commit swh/infra/ci-cd/swh-charts@a2f15c9b

In order to perform the STCS to LCS compaction migration several strategy were tested:

on cassandra07 and cassandra01, completely remove the directory_entry content and perform a full repair
- it works well but it's very slow
- it impacts all the other servers as some space is needed to prepare the data to stream to the server to repair
on other servers
- Offload some tables to the disk dedicated to the commitlog [1]
- create a lvm lv volume with the space remaining on the boot device
- If not enough remove a small sstable (The repair of a "small" table has less impact on the cluster)
- When there is enough space, the migration automatically starts

This is the current status of the nodes in the cluster:

[1]

Add a zfs dataset mounted on /srv/cassandra/instance1/data2
Stop cassandra
Move a data directory from /srv/cassandra/instance1/data/swh/ to /srv/cassandra/instance1/data2
Create a link between data/swh/<data_dir> and data2/<data-dir>

[2]

...
SStables in each level: [3800/4, 65/10, 105/100, 43, 0, 0, 0, 0, 0]
...

L0 sstables are not well compressed and a lot of space is wasted. The more space available, the more the compaction will be fast and efficient, so the server cleanup is not done yet

INFO  [CompactionExecutor:1941] 2023-08-07 10:56:36,429 CompactionTask.java:241 - Compacted (ac13e560-33e0-11ee-a10a-e7994b5b327f) 38 sstables to [/srv/cassandra/instance1/data/swh/directory_entry-b74d85d02d2e11ed970a612d80206516/nb-36993-big,/srv/cassandra/instance1/data/swh/directory_entry-b74d85d02d2e11ed970a612d80206516/nb-36995-big,...,/srv/cassandra/instance1/data/swh/directory_entry-b74d85d02d2e11ed970a612d80206516/nb-37060-big,] to level=1.  1486.245GiB to 1329.002GiB (~89% of original) in 130,741,639ms.  Read Throughput = 11.641MiB/s, Write Throughput = 10.409MiB/s, Row Throughput = ~9,243/s.  1,283,208,241 total partitions merged to 1,145,501,383.  Partition merge counts were {1:1008095317, 2:137105678, 3:299984, 4:404, }

migration and content repair are done on cassadra05, starting the cleanup

cassandra05 done, working on cassandra04 now

The directory_entry table on cassandra07 was finally deleted and rebuild from scratch with a full repair. The repair is done and the l0->l1 compaction is in progress (112 S0 sstable to compact)

It should be done by the end of the week.

And cassandra04 is migrating the last STS sstable. It should be finished in <24h

cassandra04 has finished the repair of content table and the compactions to the different levels.
It has also finished the migration and following compactions of directory_entry. It's ready for the cleanup.

The compaction in cassandra07 is still in progress, but almost done (~60 sstables in L0 to compact).

cassandra04 fully migrated and cleaned

mentioned in commit swh/infra/ci-cd/swh-charts@8cff1577

mentioned in commit swh/infra/ci-cd/swh-charts@c63fed81

mentioned in commit swh/infra/ci-cd/swh-charts@fee58343

The cluster is now fully migrated \o/

closed

[cassandra] Update the directory_entry table to LCS compaction

Designs

Child items ...

Activity