All servers were done by @vsellier first and then me following the docs vince started (cea/Readme.md).
I've iterated over the documentation to clarify some tidbits ;).
The plan got updated accordingly.
Remains the actual ceph massaging.
I've now installed OSDs on all 12 disks on each host.
The data movement is in progress.
$ ceph status cluster: id: e0a98ad0-fd1f-4079-894f-ed4554ce40c6 health: HEALTH_WARN 104 OSD(s) experiencing BlueFS spillover 4914 pgs not deep-scrubbed in time 6244 pgs not scrubbed in time services: mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 34h) mgr: dwalin002(active, since 34h), standbys: dwalin003, dwalin001 osd: 312 osds: 312 up (since 9h), 312 in (since 9h); 4328 remapped pgs data: pools: 7 pools, 6497 pgs objects: 130.12M objects, 496 TiB usage: 643 TiB used, 2.7 PiB / 3.3 PiB avail pgs: 170710916/799357874 objects misplaced (21.356%) 4257 active+remapped+backfill_wait 2169 active+clean 71 active+remapped+backfilling io: client: 1.2 MiB/s rd, 19 op/s rd, 0 op/s wr recovery: 2.5 GiB/s, 645 objects/s
I've done some tuning to try and speed up the recovery.
The pgs won't be scrubbed properly until the backfill is over, which will take a couple of weeks.
If the recovery seems stuck, restarting one of the OSDs that are marked as primary of one of the pgs in backfill or backfill_wait status seems to unstick it. To find that out, do a :
$ ceph pg ls | grep backfill | head -10PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP LAST_SCRUB_DURATION SCRUB_SCHEDULING 8.0 27323 0 54646 0 114595454976 0 0 1929 3000 active+remapped+backfill_wait 9h 109626'203400 111242:892716 [219,18,191,304,6,97]p219 [219,18,191,37,257,97]p219 2024-07-22T21:14:19.817114+0200 2024-07-22T21:14:19.817114+0200 3055 queued for deep scrub 8.1 27324 0 54648 0 114603294720 0 0 2017 3000 active+remapped+backfill_wait 9h 109586'203105 111242:785986 [254,128,87,0,306,53]p254 [217,128,87,0,256,53]p217 2024-07-22T16:46:08.040455+0200 2024-07-22T16:46:08.040455+0200 3546 queued for deep scrub 8.7 27124 0 54248 0 113761914880 0 0 1866 3000 active+remapped+backfill_wait 9h 109641'203703 111243:937106 [7,199,32,224,248,177]p7 [7,199,32,224,46,251]p7 2024-07-21T13:23:38.208798+0200 2024-07-16T01:10:32.254270+0200 105 queued for deep scrub 8.8 27233 0 81699 0 114219597824 0 0 1856 3000 active+remapped+backfill_wait 9h 109626'203644 111242:895743 [121,268,251,252,180,154]p121 [121,109,34,142,180,154]p121 2024-07-21T07:25:57.932327+0200 2024-07-21T07:25:57.932327+0200 3279 queued for deep scrub 8.9 27531 0 82593 0 115473383424 0 0 1873 3000 active+remapped+backfill_wait 9h 109560'204017 111242:851377 [263,176,102,6,62,307]p263 [107,176,24,6,62,78]p107 2024-07-23T03:04:33.086225+0200 2024-07-21T19:12:20.314656+0200 107 queued for deep scrub 8.b 27412 0 54824 0 114974261248 0 0 1831 3000 active+remapped+backfill_wait 9h 109630'203070 111243:846876 [80,278,66,280,139,178]p80 [80,253,66,151,139,178]p80 2024-07-22T16:04:47.769489+0200 2024-07-18T17:41:28.716859+0200 106 queued for deep scrub 8.c 27164 0 81492 0 113930366976 0 0 1730 3000 active+remapped+backfill_wait 9h 109641'202429 111242:831995 [243,55,262,274,213,219]p243 [29,55,11,125,213,219]p29 2024-07-22T06:26:30.674646+0200 2024-07-19T13:51:01.415851+0200 105 queued for deep scrub 8.f 27605 0 55210 0 115783561216 0 0 2115 3000 active+remapped+backfill_wait 9h 109641'202710 111242:899567 [35,182,224,301,58,271]p35 [35,182,224,146,58,244]p35 2024-07-19T18:38:09.383061+0200 2024-07-18T07:23:06.313555+0200 105 queued for deep scrub 8.10 27313 0 27313 0 114559025152 0 0 1961 3000 active+remapped+backfill_wait 9h 109625'204192 111243:814538 [18,126,286,181,36,198]p18 [18,126,162,181,36,198]p18 2024-07-22T09:56:04.555345+0200 2024-07-20T22:55:19.531073+0200 36 queued for deep scrub 8.11 27328 0 54656 0 114621939712 0 0 1780 3000 active+remapped+backfill_wait 9h 109641'204344 111243:1013967 [111,262,299,134,81,295]p111 [111,262,196,134,81,245]p111 2024-07-22T15:35:55.708052+0200 2024-07-22T15:35:55.708052+0200 3749 queued for deep scrub
The primary OSD for the PG is after the brackets in the "ACTING" column (for instance, for pg 8.0, this is OSD 219, for pg 8.c this would be OSD 29)
To restart an OSD cleanly, we need to prevent (further) data movement, then restart the osd:
[ on the OSD host]$ ceph osd set noout$ sudo systemctl restart ceph-osd@${osd_number}$ ceph status # wait for the OSD to be back "up" and "in"$ ceph osd unset noout
All data movement has completed, the OSDs are fully in service.
Pending scrubs are running, it's not clear whether they'll catch up (that was already tracked in #5368)
It seems that some combination of pgs stuck in backfilling mode and OSDs being restarted has caused heavy issues on the frontend servers (blocking all reads); this is tracked separately in #5378