OSD crashes after autoscaler PG scale up
After deploying some monitoring on the Ceph cluster nodes we finally started the benchmark suite on Friday 2024-01-26 afternoon. While doing so, we did a quick review of the Ceph pool settings, for the shards
/shards-data
rbd pool on which we had started to ingest images with winery.
During the review we noticed that shards-data
had very few PGs (64), which kept most OSDs idle, or at least with a very unbalanced load. As the autoscaler was set up, we decided to just go ahead and enable the "bulk" flag on the shards-data
pool to let the autoscaler scale up the number of PGs.
The autoscaler immediately moved the pool to 4096 PGs and started the data movement process.
As soon as the reallocation started, 10-15% of the OSDs crashed hard. This crash looks persistent (the OSDs crash again as soon as systemd restarts them), and therefore we consider that the data are lost and the cluster is unavailable.
Remedial steps attempted (some of them happened multiple times, so the order isn't guaranteed):
- manual restart of OSDs that were disabled by systemd after consecutive crashes
- no difference, apparently the crash is persistent
- review of similar upstream tickets :
- https://tracker.ceph.com/issues/53584
-
https://tracker.ceph.com/issues/55662
- attempt to set
osd_read_ec_check_for_errors = true
on all osds, no mitigation of the crash
- attempt to set
- revert of the bulk flag on the pool
- autoscaler target config moved back to 64 pgs
- no impact on data availability after restarting the crashed OSDs
-
ceph osd set noout
- stabilized the number of crashed OSDs (as no new reallocations are happening)
- no revival of dead OSDs after restarting them
All the current diagnostic information is dumped below:
ceph status: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt
cluster:
id: e0a98ad0-fd1f-4079-894f-ed4554ce40c6
health: HEALTH_ERR
noout flag(s) set
25 osds down
7055371 scrub errors
Reduced data availability: 138 pgs inactive, 103 pgs down
Possible data damage: 30 pgs inconsistent
Degraded data redundancy: 1797720/26981188 objects degraded (6.663%), 47 pgs degraded, 130 pgs undersized
49 daemons have recently crashed
services:
mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 2d)
mgr: dwalin003(active, since 2d), standbys: dwalin001, dwalin002
osd: 240 osds: 190 up (since 7h), 215 in (since 2d); 73 remapped pgs
flags noout
data:
pools: 6 pools, 389 pgs
objects: 3.85M objects, 15 TiB
usage: 18 TiB used, 2.0 PiB / 2.0 PiB avail
pgs: 35.476% pgs not active
1797720/26981188 objects degraded (6.663%)
134 active+clean
73 down+remapped
62 active+undersized
29 down
29 active+undersized+degraded
22 active+clean+inconsistent
21 undersized+peered
11 undersized+degraded+peered
4 active+undersized+degraded+inconsistent
3 undersized+degraded+inconsistent+peered
1 down+inconsistent
- ceph report: ceph-report-2024-01-29-152825.txt
- ceph health detail: ceph-health-detail-2024-01-29-143133.txt
- ceph crash ls: ceph-crash-ls-2024-01-29-143402.txt
- full logs (1.1 GB compressed, 31 GB uncompresed): https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst