After the migration to the G1, the server are now subject to OOM.
They are under heavy load (lot of replayers, big repair in progress)
Dec 29 14:39:47 cassandra01 cassandra[777199]: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fa718b13000, 16384, 0) failed; error='Not enough space' (errno=12)Dec 29 14:39:47 cassandra01 cassandra[777199]: #Dec 29 14:39:47 cassandra01 cassandra[777199]: # There is insufficient memory for the Java Runtime Environment to continue.Dec 29 14:39:47 cassandra01 cassandra[777199]: # Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.Dec 29 14:39:47 cassandra01 cassandra[777199]: # An error report file with more information is saved as:Dec 29 14:39:47 cassandra01 cassandra[777199]: # /tmp/hs_err_pid777199.logDec 29 14:39:47 cassandra01 cassandra[777199]: [thread 828675 also had an error][thread 828704 also had an error][thread 843094 also had an error]Dec 29 14:39:47 cassandra01 cassandra[777199]: [thread 1033436 also had an error][thread 1007071 also had an error][thread 1087563 also had an error][thread 841719 also had an error]Dec 29 14:39:47 cassandra01 cassandra[777199]: [thread 1087561 also had an error]Dec 29 14:40:00 cassandra01 systemd[1]: cassandra@instance1.service: Main process exited, code=exited, status=1/FAILUREDec 29 14:40:00 cassandra01 systemd[1]: cassandra@instance1.service: Failed with result 'exit-code'.
## There is insufficient memory for the Java Runtime Environment to continue.# Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.# Possible reasons:# The system is out of physical RAM or swap space# Possible solutions:# Reduce memory load on the system# Increase physical memory or swap space# Check if swap backing store is full# Decrease Java heap size (-Xmx/-Xms)# Decrease number of Java threads# Decrease Java thread stack sizes (-Xss)# Set larger code cache with -XX:ReservedCodeCacheSize=# This output file may be truncated or incomplete.## Out of Memory Error (os_linux.cpp:3127), pid=3357435, tid=2418466## JRE version: OpenJDK Runtime Environment (11.0.16+8) (build 11.0.16+8-post-Debian-1deb11u1)# Java VM: OpenJDK 64-Bit Server VM (11.0.16+8-post-Debian-1deb11u1, mixed mode, tiered, g1 gc, linux-amd64)# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again#
% /opt/cassandra/bin/nodetool -h cassandra04 -u cassandra --password [REDACTED] tablestats swh.directory_entryKeyspace : swh Read Count: 174359996 Read Latency: 0.23323904041039323 ms Write Count: 510354710 Write Latency: 0.036647582864474795 ms Pending Flushes: 0 Table: directory_entry SSTable count: 90 Old SSTable count: 0 Space used (live): 5014401576598 Space used (total): 5138722769348 Space used by snapshots (total): 0 Off heap memory used (total): 13214698768 SSTable Compression Ratio: 0.6350494262508151 Number of partitions (estimate): 3979350143 Memtable cell count: 236422 Memtable data size: 30414536 Memtable off heap memory used: 0 Memtable switch count: 311 Local read count: 0 Local read latency: NaN ms Local write count: 73808532 Local write latency: 0.234 ms Pending flushes: 0 Percent repaired: 91.43 Bytes repaired: 6544.043GiB Bytes unrepaired: 612.012GiB Bytes pending repair: 1.033GiB Bloom filter false positives: 0 Bloom filter false ratio: 0.00000 Bloom filter space used: 5772287376 Bloom filter off heap memory used: 8691388208 Index summary off heap memory used: 351504608 Compression metadata off heap memory used: 4171805952 Compacted partition minimum bytes: 73 Compacted partition maximum bytes: 89970660 Compacted partition mean bytes: 2054 Average live cells per slice (last five minutes): NaN Maximum live cells per slice (last five minutes): 0 Average tombstones per slice (last five minutes): NaN Maximum tombstones per slice (last five minutes): 0 Dropped Mutations: 0 Droppable tombstone ratio: 0.00000----------------
Update:
cqlsh:swh> alter table swh.directory_entry WITH bloom_filter_fp_chance = 0.1;
the sstables need to be rewritten to make the change used
Switching to mmap_index_only has improved performance in these use cases:
Linux OOM killer terminates DSE due to rapid off-heap memory growth (mmap allocating too much too fast)
High end percentile latencies (for example, 99% and Max)
Sporadic read timeouts occur, usually in conjunction with latencies as above \
If OOM errors, high latencies, and read timeouts are constantly observed, then consider setting disk_access_mode: mmap_index_only.
Note These read errors are not observed in DSE 6.x and later. The standard mode is default and standard data is not mapped into memory.
I will let the current repair finish (or fail) and try to test this configuration
Let apply (temporary) the same configuration as for https://forge.softwareheritage.org/T2492#46833 and check if its better.
(I will test that before changing the disk_access_mode configuration)
looks like the oom issue is still present.
let's try the magic parameter disk_access_mode=mmap_index_only
The parameter was tested in vagrant and seems to be valid, even if it's not documented in the official cassandra documentation.
If the repair goes ok after that, the possible performance impacts could be checked by comparing before and after with the same number of replayers (the arc config was changed which can also have an impact)
If the perfs are very bad, we could check if it will be possible to rollback after the replaying phase.
Finally, the repairs correctly finished with the option activated.
The replayers were reconfigured with 64 content replayers and 48 directory replayers.
The directory_entry reaper scheduler was also reconfigured to launch a repair each day (was already like that) but also to launch a repair when the unrepaired level reach 1%.
Regarding the performances, it seems there is no impact on the directory or content replaying speed, at least in the write rate.
Some OOM still occur with the disk_access_mode configuration but the behavior is different because the system seems to have remaining memory when the memory allocation is refused:
A possible cause can be the cassandra process tries to reserve to many memory pages (limited by default to 65530 pages)