upgrade all machines to Debian Stretch

added System administration priority:Normal labels

added state:wip label

changed the description

changed title from upgrade all machines to debian stretch to upgrade all machines to Debian Stretch

marked the checklist item getty as completed

marked the checklist item pergamon as completed

marked the checklist item saatchi as completed

marked the checklist item banco as completed

marked the checklist item tate as completed

tate still has PHP 5 installed as that's what our current version of mediawiki supports (pending #434 (closed)).

! In #761 (closed), @olasd wrote: tate still has PHP 5 installed as that's what our current version of mediawiki supports (pending #434 (closed)).

Fixed now.

marked the checklist item somerset as completed

Somerset upgraded, and migrated to postgresql 10 (via pg_upgrade). However pglogical fails to resume replication, more investigation is needed.

marked the checklist item beaubourg as completed

marked the checklist item louvre as completed

marked the checklist item prado as completed

marked the checklist item uffizi as completed

Only the workers are left to be upgraded.

Multipath on the MD3260 was flapping on louvre while beaubourg wasn't upgraded, with the following messages:

[ 4393.957050] device-mapper: multipath: Reinstating path 8:224.
[ 4393.957556] device-mapper: multipath: Reinstating path 8:240.
[ 4393.970855] sd 1:0:1:1: rdac: array SoftwareHeritage1, ctlr 0, queueing MODE_SELECT command
[ 4393.970997] sd 1:0:1:1: rdac: array SoftwareHeritage1, ctlr 0, MODE_SELECT returned with sense 05/24/00
[ 4393.970999] device-mapper: multipath: Failing path 8:224.
[ 4393.971567] sd 1:0:1:2: rdac: array SoftwareHeritage1, ctlr 0, queueing MODE_SELECT command
[ 4393.971700] sd 1:0:1:2: rdac: array SoftwareHeritage1, ctlr 0, MODE_SELECT returned with sense 05/24/00
[ 4393.971703] device-mapper: multipath: Failing path 8:240.

The flapping subsided when beaubourg was upgraded. I'm guessing there's some interaction issues when the array is connected to servers with different kernel versions.

When both servers were upgraded, DRBD wouldn't recognize the disks (drbd-overview would show the replica pair as Diskless/Diskless, and lvm wouldn't come back up). Trying to bring the disks back up with drbdadm attach r0 would fail with a no meta data found error. After poking around for a while, I rebooted beaubourg (which was Primary before its upgrade) on the old kernel, which let it come back up as Primary for the drbd pair.

After the reboot of beaubourg on the old kernel, I did a drbdadm create-md r0 on louvre, to try to re-create the replica. It turns out that /that/ recognized the metadata on disk:

You want me to create a v08 style flexible-size internal meta data block.                                        
There appears to be a v09 flexible-size internal meta data block                
already in place on /dev/vg-louvre/drbd at byte offset 300647706624                                          
                                                                                            
Valid v09 meta-data found, convert to v08?                     
[need to type 'yes' to confirm] yes   
                                                                
md_offset 300647706624                                         
al_offset 300647673856            
bm_offset 300638498816                                     

Found LVM2 physical volume signature
   293588992 kB data area apparently used
   293592284 kB left usable by current configuration

Even though it looks like this would place the new meta data into
unused space, you still need to confirm, as this is only a guess.

Do you want to proceed?
[need to type 'yes' to confirm] yes

Writing meta data...
New drbd meta data block successfully created.

After a drbdadm attach r0 on louvre, the sync completed and louvre was a proper secondary again.

I then stopped all services on beaubourg, then manually set the drbd device to secondary (ensuring no further changes, and that louvre's copy was UpToDate). I rebooted under the new kernel, and the drbd device came back up as Primary... with a Diskless status ???

I did the drbdadm create-md r0 dance on beaubourg as well, and after drbdadm attach r0 the replica pair is back to healthy status.

Considering those issues, as well as the stability issues we had when using drbd in Primary/Primary mode in the past, I'll migrate the remaining machines away from this drbd volume and get rid of it once and for all. We'll be able to move VM storage to Ceph (which has better integration with Proxmox) soon enough.

Following the woes with drbd, I have now removed all traces of it from our hypervisors.

mentioned in commit swh-sysadmin-provisioning@d8b0bcdb

I've now upgraded PostgreSQL to version 10 everywhere. Replication needs to be reset.

I've started doing the work to reinstall workers under stretch. worker01 is back up and running, but there are two issues:

nfs mounts fail on boot
the new "persistent" interface naming is weird (interfaces come up as ens18/19), and confuses our configuration management

For info: it looks like the fix for #755 (closed), which is now deployed on pergamon, requires the version of monitoring-plugins-basic that is on Debian Stretch; previous versions do not have the --only-critical flag. So, until this is fixed, pending package upgrades on all workers have status "unknown". (Which isn't a big deal.)

marked the checklist item worker{01-16} as completed

After some thorough massaging, I've finished updated our preseeding configuration for Debian Stretch, and re-created the 16 local workers.

worker08.euwest.azure has been migrated (scratched and recreated back).

Wiki documentation about it has been updated.

The only gotcha i hit was the /etc/facter/facts.d/location.txt file that we need to install ourselves for puppet to be satisfied.

marked the checklist item worker{01-08}.euwest.azure as completed

removed state:wip label

closed

mentioned in commit swh/devel/snippets@bab6b833

mentioned in commit swh/devel/snippets@f1534e28

mentioned in issue swh/meta#850 (closed)

upgrade all machines to Debian Stretch

Designs

Child items ...

Activity