Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
In August, we have ordered a new storage enclosure from supermicro, and a new server from dell, to extend the main storage and replace the server hosting it. After a long delay in building the enclosure and testing disks on the supermicro side, the deliveries were all completed in the first week of November.
We therefore planned the installation in Rocquencourt on 2020-11-18.
As the rack with the existing storage enclosures was almost full, we've decided to go ahead with the decommission of orsay, and to move uffizi to the almost empty rack next to it to make space. The list of operations planned was:
decommission of orsay and the attached storage array
removal of uffizi from the rack
retrieving HBAs and NVME storage from uffizi to move them to the new server
rack the new server in place of uffizi
install the new storage enclosure, connect it to the new server, and chain the other supermicro enclosure to it
reinstall of uffizi in the other rack
We've also decided to use our presence in the DC to perform the following pending operations:
replace RAM in ceph-mon1 with the RAM bought in the summer
reinstall ceph-mon1 from scratch to prepare it becoming a hypervisor
add RAM retrieved from ceph-mon1 to db1.staging and storage1.staging
The disassembly of orsay and uffizi happened without an issue.
We had two relatively new 4TB SSDs in orsay which have been moved to uffizi.
We picked saam as name for the new server, and @vsellier prepared its inventory entry and puppet configuration.
saam had exactly the right number of free PCIe slots to receive the add-on cards:
1 SAS HBA for the dell MD3460 array (Full Height, half length)
2 SAS HBAs for the supermicro arrays (Full Height, half length)
1 Intel Optane SSD DC P4800X card (Half Height; full height bracket available in the SWH "stock" in the DSI office)
2 M.2 NVMe - PCIe adapters (Half Height cards with full height brackets; the half height brackets are in the SWH stock of the DSI office)
The server has two PCIe cards provided: the card for the Boot Storage M.2 SSDs, and a HBA for the front SAS.
We installed the rack mount rails for saam and for the new storage array, and rack-mounted both.
We also reinstalled the rack mount hardware for uffizi in the other cabinet and rack-mounted it back.
I went on to cabling saam and the new storage array while @vsellier did uffizi.
saam cabling:
attached the SAS cables from the MD3460 to the HBA
attached the SAS cables left over from old supermicro array to new supermicro array
attached the new SAS cables from new supermicro array to both HBAs on saam
reused iDRAC and SFP+ cables from uffizi
uffizi cabling:
attached new SFP+ cabling to the top of rack switch
attached new iDRAC cable to the top of rack switch
After lunch, we finished power cabling and went on to set up an OS on saam.
The boot blocked before the system setup while initializing devices. After trying again, on a hunch, I disconnected all the SAS cables from the back of the server. This allowed us to access system setup.
Set ip address for the iDRAC with the next ip in our range; set iDRAC password to the default Dell value
Disabled the boot rom on the PCI ports for the external HBAs
Before rebooting, I plugged the SAS cables back in, and popped in a Debian Installer stick.
The system booted to the Debian installer, which failed to find the USB stick (I guess it's not happy when the USB drive is /dev/sdrq1, yes, two letters). I unplugged the SAS drives again and we could finally install Debian.
After the Debian install, I plugged the SAS drives again and rebooted. At first the system failed to boot: I had disabled boot from all PCI cards, including the one with the boot storage; after enabling that again Debian booted.
The Dell MD3460 virtual devices were detected on boot, but the other SAS enclosures would time out after enumerating some of the disks. After rebooting a few times, no dice.
I then opened the manual for the SAS enclosures again, and noticed that it was very specific about how to wire multiple HBAs and multiple chained enclosures. After removing all SAS cables and doing them again one at a time, using the manufacturer mandated wiring scheme, all the disks showed up on the system properly.
Once that was done, we could zfs import the pools.
We also validated iDRAC access to saam, and recorded the credentials to the password store.
uffizi was cabled again and rebooted. The network setup is pending actions from the DSI network admins.
Once the main operations were done, we went on to the bonus track:
We've replaced the ram in ceph-mon1. @vsellier reinstalled a plain Debian so we can recycle it as a hypervisor. We validated that IPMI access was still OK after the re-racking.
We've expanded the RAM in db1.staging and storage1.staging. IPMI access was validated too.
We still have issues in boot ordering on saam, but we had done enough physical setup to be able to handle them remotely, so we decided that the physical part of the operation was complete and that we could follow up remotely.
Takeaways from the physical setup:
the rack mount arm of the supermicro array is a bit too small for 8 micro-SAS cables (16 conductors) + IPMI network + power; the swinging arm also interferes with the (pretty bulky) SAS connectors on the enclosure controllers, preventing the array to be fully in the rack.
the rack mount rails of the supermicro array protrude far enough back that they're blocking some PDU ports.
multipath SAS cabling is very sensitive and needs to be done carefully, which is really hard with the bulky connectors and the very small space on the back of the rack.
ceph-mon1 is not on pull out rails, so it needs to be supported to be taken out of the rack
all three supermicro servers have no cable management arms, so you need to pull some slack in the back of the rack before pulling them out
The boot is timing out because of a race condition between systemd-udev-settle.service and multipathd.service. udev is calling multipath -c for all drives, but that needs multipathd running, which systemd doesn't do before systemd-udev-settle returns.
It turns out we already had that issue on uffizi, and solved it by:
overriding the multipathd.service unit to get ordered before systemd-udev-settle.service, instead of after
adding an override to zfs-import-cache.service to be ordered after multipathd.socket.
Both of these are implemented in infra/puppet/puppet-swh-site!262.
There was an issue with the indexer storage package missing the swh.indexer setuptools metadata. I've moved the metadata to the swh.indexer.storage package in rDCIDX3809bb03
infra/puppet/puppet-swh-site!73 carries the local storage/objstorage configuration on uffizi to saam.
infra/puppet/puppet-swh-site!74/infra/puppet/puppet-swh-site!75 adds all the local mountpoints needed for this local config to work.
the import was run on the puppet-swh-site repository:
root@worker01:/etc/softwareheritage# sudo -u swhworker SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/SoftwareHeritage/puppet-swh-site
The first try returns this exception :
swh.core.api.RemoteException: <RemoteException 500 ValueError: ["Storage class azure-prefixed is not available: No module named 'swh.objstorage.backends.azure'"]>
the last commit of the diff is well present [1] and the file is well stored on the saam storage :
softwareheritage=> select * from content where sha1_git='\x1781d66d33737d1e422cd54add562f7f04f16b30';-[ RECORD 1 ]------------------------------------------------------------------sha1 | \xa12d17353c310908068110a859f9b54e618c775asha1_git | \x1781d66d33737d1e422cd54add562f7f04f16b30sha256 | \xc927814db44f633cf72ac735cc740d950e3bfe9d75dd8409564708759203f03dlength | 235ctime | 2020-11-20 09:35:43.992683+00status | visibleobject_id | 9177836614blake2s256 | \x42a45c33b44a8ebb6011dc3679dbc6a90b389cb3ef6b92f7651988ed72f13a93
root@saam:/srv/softwareheritage/objects/a1/a12d1# ls -alh /srv/softwareheritage/objects/a1/a12d1/a12d17353c310908068110a859f9b54e618c775a-rw-r--r-- 1 swhstorage swhstorage 235 Nov 20 09:35 /srv/softwareheritage/objects/a1/a12d1/a12d17353c310908068110a859f9b54e618c775a
root@saam:/srv/softwareheritage/objects/a1/a12d1# cat /srv/softwareheritage/objects/a1/a12d1/a12d17353c310908068110a859f9b54e618c775aclass role::swh_storage_baremetal inherits role::swh_storage { include profile::dar::server include profile::megacli include profile::multipath include profile::mountpoints include ::profile::swh::deploy::objstorage_cloud}
swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_mercurial.yml swh loader run mercurial https://foss.heptapod.net/fluiddyn/fluidfftINFO:swh.loader.mercurial.Bundle20Loader:Load origin 'https://foss.heptapod.net/fluiddyn/fluidfft' with type 'hg'{'status': 'eventful'}swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_mercurial.yml swh loader run mercurial https://hg.mozilla.org/projects/nssINFO:swh.loader.mercurial.Bundle20Loader:Load origin 'https://hg.mozilla.org/projects/nss' with type 'hg'WARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_15_5_BETA2 (hg changeset: e5d3ec1d9a35f7cac554543d52775092de9f6a01). SkippingWARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_15_5_BETA2 (hg changeset: 0000000000000000000000000000000000000000). SkippingWARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_18_RTM (hg changeset: 0000000000000000000000000000000000000000). SkippingWARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_18_RTM (hg changeset: 0000000000000000000000000000000000000000). SkippingWARNING:swh.loader.mercurial.Bundle20Loader:No matching revision for tag NSS_3_24_BETA3 (hg changeset: 0000000000000000000000000000000000000000). Skipping{'status': 'eventful'}
svn
root@worker01:~# SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_svn.yml swh loader run svn svn://svn.appwork.org/utilsINFO:swh.loader.svn.SvnLoader:Load origin 'svn://svn.appwork.org/utils' with type 'svn'INFO:swh.loader.svn.SvnLoader:Processing revisions [3428-3436] for {'swh-origin': 'svn://svn.appwork.org/utils', 'remote_url': 'svn://svn.appwork.org/utils', 'local_url': b'/tmp/swh.loader.svn.dojsubkd-890577/utils', 'uuid': b'21714237-3853-44ef-a1f0-ef8f03a7d1fe'}{'status': 'eventful'}
root@worker01:~# SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_npm.yml swh loader run npm https://www.npmjs.com/package/bootstrap-vueWARNING:swh.storage.retry:Retry adding a batchWARNING:swh.storage.retry:Retry adding a batchWARNING:swh.storage.retry:Retry adding a batchERROR:swh.loader.package.loader:Failed loading branch releases/2.18.0 for https://www.npmjs.com/package/bootstrap-vueTraceback (most recent call last): File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 333, in call result = fn(*args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/retry.py", line 117, in raw_extrinsic_metadata_add return self.storage.raw_extrinsic_metadata_add(metadata) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_ return self.post(meth._endpoint_path, post_data) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post return self._decode_response(response) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 352, in _decode_response self.raise_for_status(response) File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 29, in raise_for_status super().raise_for_status(response) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 342, in raise_for_status raise exception from Noneswh.core.api.RemoteException: <RemoteException 500 TypeError: ["__init__() got an unexpected keyword argument 'id'"]>The above exception was the direct cause of the following exception:Traceback (most recent call last): File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 424, in load res = self._load_revision(p_info, origin) File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 577, in _load_revision self._load_metadata_objects([original_artifact_metadata]) File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 788, in _load_metadata_objects self.storage.raw_extrinsic_metadata_add(metadata_objects) File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 241, in wrapped_f return self.call(f, *args, **kw) File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 330, in call start_time=start_time) File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 298, in iter six.raise_from(retry_exc, fut.exception()) File "<string>", line 3, in raise_fromtenacity.RetryError: RetryError[<Future at 0x7f6fe4e98cf8 state=finished raised RemoteException>]
deposit
same issue as npm :
swhworker@worker01:~$ SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_deposit.yml swh loader run deposit https://www.softwareheritage.org/check-deposit-2020-11-17T20:48:13.534821 deposit_id=1114WARNING:swh.storage.retry:Retry adding a batchWARNING:swh.storage.retry:Retry adding a batchWARNING:swh.storage.retry:Retry adding a batchERROR:swh.loader.package.loader:Failed loading branch HEAD for https://www.softwareheritage.org/check-deposit-2020-11-17T20:48:13.534821Traceback (most recent call last): File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 333, in call result = fn(*args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/retry.py", line 117, in raw_extrinsic_metadata_add return self.storage.raw_extrinsic_metadata_add(metadata) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 181, in meth_ return self.post(meth._endpoint_path, post_data) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 278, in post return self._decode_response(response) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 352, in _decode_response self.raise_for_status(response) File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 29, in raise_for_status super().raise_for_status(response) File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 342, in raise_for_status raise exception from Noneswh.core.api.RemoteException: <RemoteException 500 TypeError: ["__init__() got an unexpected keyword argument 'id'"]>The above exception was the direct cause of the following exception:Traceback (most recent call last): File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 424, in load res = self._load_revision(p_info, origin) File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 577, in _load_revision self._load_metadata_objects([original_artifact_metadata]) File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 788, in _load_metadata_objects self.storage.raw_extrinsic_metadata_add(metadata_objects) File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 241, in wrapped_f return self.call(f, *args, **kw) File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 330, in call start_time=start_time) File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 298, in iter six.raise_from(retry_exc, fut.exception()) File "<string>", line 3, in raise_fromtenacity.RetryError: RetryError[<Future at 0x7fdecca288d0 state=finished raised RemoteException>]