[cassandra] OOM under heavy load

marked this issue as related to #4373 (closed)

From the generated tmp file:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:3127), pid=3357435, tid=2418466
#
# JRE version: OpenJDK Runtime Environment (11.0.16+8) (build 11.0.16+8-post-Debian-1deb11u1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.16+8-post-Debian-1deb11u1, mixed mode, tiered, g1 gc, linux-amd64)
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

hs_err_pid3357435.log.gz

Some statistics of the cassandra servers:

There is System administration threads per vms. Bigger thread pools are the following:
Xss is set to 256k which means the thread stacks are consumming ~128Mo

% cat /tmp/stack.log | grep -o -E '^"[a-zA-Z 0-9\-]+' | sed -E 's/[-0-9]*$//g' | sort | uniq -c | sort -n | tail -n 10
     16 "G1 Conc
     16 "G1 Refine
     16 "GC Thread
     22 "ReadStage
     32 "Messaging-EventLoop
     32 "Streaming-EventLoop
     64 "epollEventLoopGroup
     72 "MutationStage
     83 "Native-Transport-Requests
     92 "RequestResponseStage

Cassandra reserves ~80Go of off-heap memory

% /opt/cassandra/bin/nodetool -h cassandra04  -u cassandra --password [REDACTED]   cfstats | grep -i -e "keyspace" -e "table:" -e "off heap" | cut -d":" -f2 | grep -E ' [0-9]' | paste -sd+ | bc | numfmt --to=iec
78G

nmap count is ok

% sudo wc -l /proc/2865481/maps
[sudo] password for vsellier:
33963 /proc/2865481/maps
vsellier@cassandra04 ~ % sudo sysctl vm.max_map_count
vm.max_map_count = 65530

The overcommited size is also quite high (~8To):

Virtual Size: 7 965 606 592K (peak: 15576301364K)

Correction: The off heap memory is ~40G, not 80G The total off heap amount is already present in the stats so the computation double the amount:

% /opt/cassandra/bin/nodetool -h cassandra04  -u cassandra --password $PASS cfstats | grep -i -e "keyspace" -e "table:" -e "off heap"
...
                Table: directory_entry
                Off heap memory used (total): 12 798 857 056
                Memtable off heap memory used: 0
                Bloom filter off heap memory used: 8 628 087 648
                Index summary off heap memory used: 314 637 600
                Compression metadata off heap memory used: 3 856 131 808
...

The correct sum is:

% /opt/cassandra/bin/nodetool -h cassandra04  -u cassandra --password $PASS cfstats | grep -i -e "keyspace" -e "table:" -e "off heap" | grep total | cut -d":" -f2 | grep -E ' [0-9]' | paste -sd+ | bc | numfmt --to=iec
42G

the bloom filter part represents 35G

% /opt/cassandra/bin/nodetool -h cassandra04  -u cassandra --password $PASS cfstats | grep -i -e "keyspace" -e "table:" -e "off heap" | grep Bloom | cut -d":" -f2 | grep -E ' [0-9]' | paste -sd+ | bc | numfmt --to=iec
35G

As a first try to reduce the off heap memory consumption, the bloom filter false positive chance will be updated from 0.01 to 0.1 for the directory_entry table. According to https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsTuningBloomFilters.html, it will eventually* divide the memory used by the bloom filter by 3

current stats:

% /opt/cassandra/bin/nodetool -h cassandra04 -u cassandra --password [REDACTED] tablestats swh.directory_entry
Keyspace : swh
        Read Count: 174359996
        Read Latency: 0.23323904041039323 ms
        Write Count: 510354710
        Write Latency: 0.036647582864474795 ms
        Pending Flushes: 0
                Table: directory_entry
                SSTable count: 90
                Old SSTable count: 0
                Space used (live): 5014401576598
                Space used (total): 5138722769348
                Space used by snapshots (total): 0
                Off heap memory used (total): 13214698768
                SSTable Compression Ratio: 0.6350494262508151
                Number of partitions (estimate): 3979350143
                Memtable cell count: 236422
                Memtable data size: 30414536
                Memtable off heap memory used: 0
                Memtable switch count: 311
                Local read count: 0
                Local read latency: NaN ms
                Local write count: 73808532
                Local write latency: 0.234 ms
                Pending flushes: 0
                Percent repaired: 91.43
                Bytes repaired: 6544.043GiB
                Bytes unrepaired: 612.012GiB
                Bytes pending repair: 1.033GiB
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 5772287376
                Bloom filter off heap memory used: 8691388208
                Index summary off heap memory used: 351504608
                Compression metadata off heap memory used: 4171805952
                Compacted partition minimum bytes: 73
                Compacted partition maximum bytes: 89970660
                Compacted partition mean bytes: 2054
                Average live cells per slice (last five minutes): NaN
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): NaN
                Maximum tombstones per slice (last five minutes): 0
                Dropped Mutations: 0
                Droppable tombstone ratio: 0.00000

----------------

Update:

cqlsh:swh> alter table swh.directory_entry WITH bloom_filter_fp_chance = 0.1;

the sstables need to be rewritten to make the change used

Interesting read: https://support.datastax.com/s/article/FAQ-Use-of-disk-access-mode-in-DSE-51-and-earlier

The log is also present in vanilla cassandra:

% grep DiskAccessMode /var/log/cassandra/instance1/system.log
INFO  [main] 2023-01-05 10:55:21,286 DatabaseDescriptor.java:416 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap

It looks supported in the code: https://github.com/search?q=repo%3Aapache%2Fcassandra%20disk_access_mode&type=code

Switching to mmap_index_only has improved performance in these use cases:

Linux OOM killer terminates DSE due to rapid off-heap memory growth (mmap allocating too much too fast)
High end percentile latencies (for example, 99% and Max)
Sporadic read timeouts occur, usually in conjunction with latencies as above \

If OOM errors, high latencies, and read timeouts are constantly observed, then consider setting disk_access_mode: mmap_index_only.
Note These read errors are not observed in DSE 6.x and later. The standard mode is default and standard data is not mapped into memory.

I will let the current repair finish (or fail) and try to test this configuration

So the repair failed. Cassandra was stopped on cassandra03. With cassandra stopped, ~120g were still consumed:

# free -h
               total        used        free      shared  buff/cache   available
Mem:           251Gi       121Gi       120Gi       4.0Mi       9.5Gi       127Gi

It seems zfs arc configuration was forgotten!

# arc_summary  | head -n10

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Jan 05 21:50:53 2023
Linux 5.10.0-17-amd64                                            2.0.3-9
Machine: cassandra03 (x86_64)                                    2.0.3-9

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    91.9 %  115.7 GiB

Let apply (temporary) the same configuration as for https://forge.softwareheritage.org/T2492#46833 and check if its better. (I will test that before changing the disk_access_mode configuration)

root@cassandra03:~# echo 68719476736 > /sys/module/zfs/parameters/zfs_arc_max
root@cassandra03:~# sync
root@cassandra03:~# echo 3 > /proc/sys/vm/drop_caches                                                                                                      
root@cassandra03:/home/vsellier# free -h
               total        used        free      shared  buff/cache   available
Mem:           251Gi        13Gi       237Gi       4.0Mi       252Mi       236Gi
Swap:          1.9Gi        94Mi       1.8Gi

looks like the oom issue is still present. let's try the magic parameter disk_access_mode=mmap_index_only

The parameter was tested in vagrant and seems to be valid, even if it's not documented in the official cassandra documentation.

If the repair goes ok after that, the possible performance impacts could be checked by comparing before and after with the same number of replayers (the arc config was changed which can also have an impact) If the perfs are very bad, we could check if it will be possible to rollback after the replaying phase.

Some issues relative to the disk_access_mode option: https://issues.apache.org/jira/browse/CASSANDRA-15531?jql=text%20~%20%22disk_access_mode%22

especially this one: https://issues.apache.org/jira/browse/CASSANDRA-15531

Finally, the repairs correctly finished with the option activated.

The replayers were reconfigured with 64 content replayers and 48 directory replayers.

The directory_entry reaper scheduler was also reconfigured to launch a repair each day (was already like that) but also to launch a repair when the unrepaired level reach 1%.

Regarding the performances, it seems there is no impact on the directory or content replaying speed, at least in the write rate.

mentioned in commit swh/infra/ci-cd/swh-charts@5ead742a

added 2h of time spent

deleted 2h of spent time from 2023-01-10

added 2h of time spent at 2023-01-09

changed milestone to %Cassandra in production as primary storage [Roadmap - Tooling and infrastructure]

added activity::Implementation label

Some OOM still occur with the disk_access_mode configuration but the behavior is different because the system seems to have remaining memory when the memory allocation is refused:

A possible cause can be the cassandra process tries to reserve to many memory pages (limited by default to 65530 pages)

# sysctl vm.max_map_count
vm.max_map_count = 65530

root@cassandra03:/proc/2172098# wc -l maps
53279 maps

Datastax recommends to increase the value to 1048575 https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/install/installRecommendSettings.html#Setuserresourcelimits

For all installations, add the following line to /etc/sysctl.conf:
vm.max_map_count = 1048575

Let's try if it improve something

Temporary setting:

root@pergamon:~# clush -b -w @cassandra "sysctl vm.max_map_count=1048575"
---------------
cassandra[01-06] (6)
---------------
vm.max_map_count = 1048575

Reminder: The arc cache configuration is still not configured in a persistent way

There were no more OOM since this configuration was applied. The directory_entry repair is running without interruption

added 2h of time spent at 2023-01-10

the problem looks solved, but Keeping this issue opened until the configuration changed are reported / configured in puppet

added 2h of time spent at 2023-01-12

deleted 2h of spent time from 2023-01-12

added 3h of time spent at 2023-01-12

mentioned in commit swh/infra/puppet/puppet-swh-site@1a4edfef

mentioned in commit swh/infra/puppet/puppet-swh-site@1a88477b

mentioned in merge request swh/infra/puppet/puppet-swh-site!594 (merged)

mentioned in commit swh/infra/puppet/puppet-swh-site@453a9bb1

added 6h of time spent at 2023-01-16

closing as the repairs are almost done (1.5 servers remain to repair) and there was no crash since the last restarts to apply the configuration

closed

changed the incident status to Resolved by closing the incident

[cassandra] OOM under heavy load

Child items ...

Activity