DC intervention at the CEA

changed the description

The HPE switches are rebadged Mellanox SN2010M switches, and hopefully they're running Onyx rather than the ancient MLNX-OS (at least according to https://support.hpe.com/connect/s/product?language=en_US&kmpmoid=1010699578&manualsFilter=66000035).

Which means we might be able to use https://github.com/ansible-collections/mellanox.onyx

The switches also support ONIE, which means that they might run cumulus linux?

Cumulus Linux is based on a license, that we probably don't have, so this is probably a dead end.

For the Dell switches, we should be able to use the https://github.com/ansible-collections/dellemc.os10 ansible module

changed the description

Added our ed225519 ssh keys to bifur, bofur (os10) and our rsa keys to bombur (os9)

on bifur and bofur:

bifur# copy http://10.25.3.254/authorized_keys home://swh-public-keys.txt
bifur# conf t
bifur(config)# username swh sshkey filename /home/swh/swh-public-keys.txt
bifur(config)# exit
bifur# copy running-configuration startup-configuration

on bombur (need to upgrade privilege to avoid having to enable):

bombur> enable
bombur# conf t
bombur(config)# ip ssh rsa-authentification enable
bombur(config)# username swh privilege 15 password [...]
bombur(config)# exit
bombur#copy http://10.25.3.254/authorized_keys.rsa flash://authorized_keys                                                                                                                                                                                                                                                  
!
970 bytes successfully copied
bombur#ip ssh rsa-authentication username swh authorized-keys flash://authorized_keys
RSA keys added to user's list of authorized-keys.
Delete the file flash://authorized_keys : (yes/no) ? yes
bombur# copy running-config startup-config
File with same name already exist. 
Proceed to copy the file [confirm yes/no]: yes
!
6616 bytes successfully copied

changed the description

bofur upgraded to the latest OS10 release.

I had to use a system shell to manually mount the lvmetad socket inside the alternate root, the vg configuration backup was hanging. Gotta love switches running Debian, I guess.

The bifur upgrade hanged in the same place, but I noticed that the the lvm2 postinst has vgcfgbackup || :, so I just killed the hung vgcfgbackup process.

bombur upgraded successfully following the OS9 release notes: https://dl.dell.com/content/manual2383281-dell-networking-s3100-series-9-14-2-23-release-notes.pdf?language=en-us

marked the checklist item Upgrade bifur/bofur to latest compatible OS10 release as completed

dori and nori switchs basic configurations are done, the admin connection is possible via their ipmi addresses. The user swh is configured on both switch.

A first MLAG configuration attempt was unsuccessfully tried and is not yet working.

The C06 nodes ILos are configured with the ips declared in the inventory.

There is an additional server that apparently was ordered as a bastion in the rack. It's currently name gloin003 as is it's label in the rack but it doesn't have a frontend configuration, so we'll see how to rename it later.

For the record, the Ilo ips were already configured via a dhcp server running on angrenost, so we were just missing the password to being able to access the server just after their installation.

marked the checklist item Update the access switchs (bifur/bofur) configuration to use active/active LACP as completed

marked the checklist item Install and configure the admin network of the C06 switchs as completed

marked the checklist item Add the C06 switchs in the active network as completed

marked the checklist item Check and finalize the C06 hardware installation as completed

marked the checklist item Move dwalin003 to the C06 rack and reconfigure switch ports as completed

bifur and bofur have been configured, via ansible:

VLT setup with 1 x 100G cross-connect on vlt domain 1
port-channels have been configured for link aggregation, each with a matching vlt-port-channel id
- 1-2 for gloin001/002
- 11-12 for dwalin001/002
- 21-40 for balin001-020
- port-channel 999 for interconnection with nori & dori (manually configured)

nori and dori have been configured, manually:

mlag setup with 2 x 100G cross connect on port channel 1000 (peer ip addresses 10.25.254.1/30 + 10.25.254.2/30; mlag-vip on the management network 10.25.3.246/24)
mlag port channel 999 configured for interconnect with Dell switches
- 1 x 100G link between bifur (eth 1/1/26) and nori (eth 1/20)
- 2 x 25G backup links between bofur (eth 1/1/21-1/1/22) and dori (eth 1/16-1/17), to be upgraded to 1x100G when a 3m-long QSFP28 DAC is acquired
  - (the dori eth 1/16 link is reporting "bad signal integrity", I might have kinked the DAC when pulling it through the floor tiles. it's only a temp fallback link, it's probably not a problem, we should monitor it once we start installing the OSDs)
mlag port channel 13 has been configured for dwalin003 (using port 8 on both switches)

The gloin003 server has been connected with a pair of leftover 10 Gbps DACs to both ports 1 on nori & dori.

Network equipment shopping list:

2 x 3m QSFP28 (100 Gbps) DAC (interconnect bifur/bofur + bofur/dori) - we need 3m as we have to skip over one rack
2 x 50cm or 1m SFP28 (25 Gbps) DAC (gloin003 - {dori,nori})
2 x 3m SFP28 (25 Gbps) DAC (extra for angrenost (?) + 1 spare + 1 spare recovered from the bofur-dori links)

Thanks for the summary.

1 x additional 50cm QSFP28 DAC (extra cross-connect bifur - bofur, if we have a spare QSFP28 port on both switches, or spare) \

Is 50cm long enough to connect the switches in 2 different racks ? It should have enough free ports to move 4 servers from one of the qsfp28->4sfp8 adapter

2 x 3m SFP28 (25 Gbps) DAC (extra for angrenost (?) + 1 spare + 1 spare recovered from the bofur-dori links)

We can order a couple of additional ones to have some spares just in case ;)

50cm will definitely not be long enough to go across racks, no, d'oh.

The current one is 100GBASE-CR4-2.0M, so let's go with that.

Actually we need to jump below one rack, so 2 x 3m it is (we can move the 2m bifur/nori link back between bifur and bofur, and use the 3m cables to go between dell and hp with more slack)

The ports of the C06 OSDs on nori&dori are configured with untagged vlan 1 and no aggregation to facilitate the install process (it's not clear that mlag supports a fallback in case no lacp negotiation happens, so it's much easier that way). They should be moved to mlags once the servers are installed.

Upgrading nori & dori successively (both switches upgraded twice in a row, because of the onyx upgrade process) didn't affect connectivity from gloin003 to the rest of the cluster.

mentioned in commit swh/infra/puppet/puppet-swh-site@3e00ac56

mentioned in merge request swh/infra/puppet/puppet-swh-site!701 (merged)

changed the description

marked this issue as related to #5312

mentioned in commit swh/infra/puppet/puppet-swh-site@638fc1be

Closing due to inactivity

closed

DC intervention at the CEA

Designs

Child items ...

Activity