OSD crashes after autoscaler PG scale up

After deploying some monitoring on the Ceph cluster nodes we finally started the benchmark suite on Friday 2024-01-26 afternoon. While doing so, we did a quick review of the Ceph pool settings, for the shards/shards-data rbd pool on which we had started to ingest images with winery.

During the review we noticed that shards-data had very few PGs (64), which kept most OSDs idle, or at least with a very unbalanced load. As the autoscaler was set up, we decided to just go ahead and enable the "bulk" flag on the shards-data pool to let the autoscaler scale up the number of PGs.

The autoscaler immediately moved the pool to 4096 PGs and started the data movement process.

As soon as the reallocation started, 10-15% of the OSDs crashed hard. This crash looks persistent (the OSDs crash again as soon as systemd restarts them), and therefore we consider that the data are lost and the cluster is unavailable.

Remedial steps attempted (some of them happened multiple times, so the order isn't guaranteed):

manual restart of OSDs that were disabled by systemd after consecutive crashes
- no difference, apparently the crash is persistent
review of similar upstream tickets :
- https://tracker.ceph.com/issues/53584
- https://tracker.ceph.com/issues/55662
  - attempt to set osd_read_ec_check_for_errors = true on all osds, no mitigation of the crash
revert of the bulk flag on the pool
- autoscaler target config moved back to 64 pgs
- no impact on data availability after restarting the crashed OSDs
ceph osd set noout
- stabilized the number of crashed OSDs (as no new reallocations are happening)
- no revival of dead OSDs after restarting them

All the current diagnostic information is dumped below:

ceph status: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt

  cluster:
    id:     e0a98ad0-fd1f-4079-894f-ed4554ce40c6
    health: HEALTH_ERR
            noout flag(s) set
            25 osds down
            7055371 scrub errors
            Reduced data availability: 138 pgs inactive, 103 pgs down
            Possible data damage: 30 pgs inconsistent
            Degraded data redundancy: 1797720/26981188 objects degraded (6.663%), 47 pgs degraded, 130 pgs undersized
            49 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 2d)
    mgr: dwalin003(active, since 2d), standbys: dwalin001, dwalin002
    osd: 240 osds: 190 up (since 7h), 215 in (since 2d); 73 remapped pgs
         flags noout
 
  data:
    pools:   6 pools, 389 pgs
    objects: 3.85M objects, 15 TiB
    usage:   18 TiB used, 2.0 PiB / 2.0 PiB avail
    pgs:     35.476% pgs not active
             1797720/26981188 objects degraded (6.663%)
             134 active+clean
             73  down+remapped
             62  active+undersized
             29  down
             29  active+undersized+degraded
             22  active+clean+inconsistent
             21  undersized+peered
             11  undersized+degraded+peered
             4   active+undersized+degraded+inconsistent
             3   undersized+degraded+inconsistent+peered
             1   down+inconsistent

ceph report: ceph-report-2024-01-29-152825.txt
ceph health detail: ceph-health-detail-2024-01-29-143133.txt
ceph crash ls: ceph-crash-ls-2024-01-29-143402.txt
full logs (1.1 GB compressed, 31 GB uncompresed): https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst