Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
The flapping subsided when beaubourg was upgraded. I'm guessing there's some interaction issues when the array is connected to servers with different kernel versions.
When both servers were upgraded, DRBD wouldn't recognize the disks (drbd-overview would show the replica pair as Diskless/Diskless, and lvm wouldn't come back up). Trying to bring the disks back up with drbdadm attach r0 would fail with a no meta data found error. After poking around for a while, I rebooted beaubourg (which was Primary before its upgrade) on the old kernel, which let it come back up as Primary for the drbd pair.
After the reboot of beaubourg on the old kernel, I did a drbdadm create-md r0 on louvre, to try to re-create the replica. It turns out that /that/ recognized the metadata on disk:
You want me to create a v08 style flexible-size internal meta data block. There appears to be a v09 flexible-size internal meta data block already in place on /dev/vg-louvre/drbd at byte offset 300647706624 Valid v09 meta-data found, convert to v08? [need to type 'yes' to confirm] yes md_offset 300647706624 al_offset 300647673856 bm_offset 300638498816 Found LVM2 physical volume signature 293588992 kB data area apparently used 293592284 kB left usable by current configurationEven though it looks like this would place the new meta data intounused space, you still need to confirm, as this is only a guess.Do you want to proceed?[need to type 'yes' to confirm] yesWriting meta data...New drbd meta data block successfully created.
After a drbdadm attach r0 on louvre, the sync completed and louvre was a proper secondary again.
I then stopped all services on beaubourg, then manually set the drbd device to secondary (ensuring no further changes, and that louvre's copy was UpToDate). I rebooted under the new kernel, and the drbd device came back up as Primary... with a Diskless status ???
I did the drbdadm create-md r0 dance on beaubourg as well, and after drbdadm attach r0 the replica pair is back to healthy status.
Considering those issues, as well as the stability issues we had when using drbd in Primary/Primary mode in the past, I'll migrate the remaining machines away from this drbd volume and get rid of it once and for all. We'll be able to move VM storage to Ceph (which has better integration with Proxmox) soon enough.
For info: it looks like the fix for #755 (closed), which is now deployed on pergamon, requires the version of monitoring-plugins-basic that is on Debian Stretch; previous versions do not have the --only-critical flag.
So, until this is fixed, pending package upgrades on all workers have status "unknown". (Which isn't a big deal.)