Looks like the raw_extrinsic_metadata table has a lot of tombstones.
A tablestat info
Percent repaired: 99.97 Bytes repaired: 788.622GiB Bytes unrepaired: 258.229MiB Bytes pending repair: 0.000KiB Average tombstones per slice (last five minutes): NaN Maximum tombstones per slice (last five minutes): 0 Droppable tombstone ratio: 0.25052
The checker script is spamming the logs:
[2024-03-08 18:27:57,292] Server warning: Read 5000 live rows and 5805 tombstone cells for query SELECT directory, fetcher_name, fetcher_version, format, metadata, origin, path, release, revision, snapshot, type, visit FROM swh.raw_extrinsic_metadata WHERE target = swh:1:dir:0c00174e41033b337f67575ce08c7492eedb5f9a AND authority_type = forge AND authority_url = https://opam.ocaml.org AND discovery_date > 2024-02-16T03:53:45.000Z AND id > 7339181526ca084c8826499c2f589ab313236a23 LIMIT 5000; token 3702119759087180356 (see tombstone_warn_threshold)
Vincent Sellierchanged title from Cassandra warning about tombstone on raw_extrinsic_metadata queries to Cassandra warnings about tombstones on raw_extrinsic_metadata queries
changed title from Cassandra warning about tombstone on raw_extrinsic_metadata queries to Cassandra warnings about tombstones on raw_extrinsic_metadata queries
Having looked a bit, that table has a high number of tombstones indeed [1]. It's off the
charts [2] comparatively with other tables.
In the current storage implementation [3], we write null cells when adding new raw
extrinsic metadata. So either our cassandra model should be reworked to 1. avoid this or
the 2. storage implementation.
Rework the model: Adding the correct number of tables per combination of
nullable fields (currently, nullable values can be release, revision, snapshot,
directory)... So that translates to more RawExtrinsicMetadata(name_1, ..., name_n)
tables.
Adapt storage implementation: maybe simpler but no idea if that's viable and/or
the implications:
Other references [3] [4] mention we could avoid setting any null values. When a field
is None, do not mention the field during the insert action.
write empty strings when that data is None instead of null (sounds like a workaround
and we'd need to adapt the reading part too with the risk of confusion when the case
of empty data strings exist too)
[1]
root@cassandra01:~# /opt/cassandra/bin/nodetool -u $USER -pw $PASS tablestats swh.raw_extrinsic_metadata | grep tombstone Average tombstones per slice (last five minutes): 230.452850877193 Maximum tombstones per slice (last five minutes): 17084 Droppable tombstone ratio: 0.24825
[2]
root@cassandra01:~# for obj_type in content directory release revision extid origin origin_visit origin_visit_status snapshot raw_extrinsic_metadata; do date; echo "object_type: $obj_type"; /opt/cassandra/bin/nodetool -u $USER -pw $PASS tablestats swh.$obj_type | grep tombstone; doneWed 20 Mar 2024 09:58:01 AM UTCobject_type: content Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.00000Wed 20 Mar 2024 09:58:03 AM UTCobject_type: directory Average tombstones per slice (last five minutes): 1.0004125715872978 Maximum tombstones per slice (last five minutes): 10 Droppable tombstone ratio: 0.09998Wed 20 Mar 2024 09:58:05 AM UTCobject_type: release Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.02080Wed 20 Mar 2024 09:58:07 AM UTCobject_type: revision Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.00494Wed 20 Mar 2024 09:58:09 AM UTCobject_type: extid Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.00000Wed 20 Mar 2024 09:58:11 AM UTCobject_type: origin Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.00000Wed 20 Mar 2024 09:58:13 AM UTCobject_type: origin_visit Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.00000Wed 20 Mar 2024 09:58:15 AM UTCobject_type: origin_visit_status Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.08160Wed 20 Mar 2024 09:58:17 AM UTCobject_type: snapshot Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1 Droppable tombstone ratio: 0.00000Wed 20 Mar 2024 09:58:18 AM UTCobject_type: raw_extrinsic_metadata Average tombstones per slice (last five minutes): 303.1795989537925 Maximum tombstones per slice (last five minutes): 17084 Droppable tombstone ratio: 0.24826
In the specific case of the raw_extrinsic_metadata table, non-null values for the context fields will always be swhids or a non-empty path, so I believe the empty string would be a good, inambiguous substitute to null values.
Perhaps this table would also deserve a bit of remodeling. It's the table with the big 1G partition (I thought it was directory_entry, but I was wrong).
The following grafana/prometheus query gives the exact numbers: max( cassandra_table_maxpartitionsize{environment="production"}) by(table)
The dsbulk tool allows to retrieve the id of N biggest partitions of a table:
After a random picking on the 50th biggest partitions, it seems the directories are all related to opam and more precisely the tezos project.
For example:
cqlsh:swh> select count(*) from raw_extrinsic_metadata where target='swh:1:dir:ab964a5bd58ffce2bb24667bc08dd90b7ffaea84'; count-------- 138770(1 rows)
A little more of 700 lines are added each day for this directory. And those lines are not the smallest ones.
I've no idea how this table is requested and how it could be partitioned differently. I'll create a dedicated issue on swh-storage to track the problem.