Install new bare metal hypervisor (chaillot)
Installation procedure: https://docs.softwareheritage.org/sysadm/server-architecture/howto-install-new-physical-server.html
Inventory: https://inventory.internal.admin.swh.network/dcim/devices/291/
Environment: production
Summary:
- Management address (DNS): 128.93.134.54
- VLAN configuration: VLAN440
- Internal IP(s): 192.168.100.35
- Internal DNS name(s): chaillot.internal.softwareheritage.org
Tasks:
- Declare the servers in the inventory
- Add the management info in the credential store (root and idrac/ilo access)
- Check out of band access is ok
- Install the OS
- (if needed) Add puppet configuration
- [for kube nodes] Register node in rancher cluster
- [for kube nodes] Create the required kubernetes labels (e.g. swh/journal_client=true, ...)
- Create a swap at least the size of the machine's memory
- Update firewall rules with the new machine's ip (e.g. swh_$environment_kube_workers, ...)
- (other actions if needed, drop unneeded actions)
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Antoine R. Dumont added activity::Deployment label
added activity::Deployment label
- Antoine R. Dumont changed the description
changed the description
- Antoine R. Dumont changed the description
changed the description
- Antoine R. Dumont marked the checklist item Add the management info in the credential store (root and idrac/ilo access) as completed
marked the checklist item Add the management info in the credential store (root and idrac/ilo access) as completed
- Author Owner
First install tryout failure, network interfaces are named differently (it's now possible to declare it in our ipxe repository).
2: n: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000 link/ether 6c:92:cf:b9:31:10 brd ff:ff:ff:ff:ff:ff 3: ens22f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000 link/ether 6c:92:cf:b9:31:11 brd ff:ff:ff:ff:ff:ff
- Author OwnerResolved by Antoine R. Dumont
Passing the network detection after the adaptation, it now stuck in the disk partition view, seeing disk but failing to choose one.
Maybe because there are 2 similarly named (one with suffix _1)?
[2]
~ # ls /dev/disk/by-id nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q_1 nvme-VO001920KYDMT_S70RNN0X400826 nvme-VO001920KYDMT_S70RNN0X400826_1 nvme-VO001920KYDMT_S70RNN0X400909 nvme-VO001920KYDMT_S70RNN0X400909_1 nvme-VO001920KYDMT_S70RNN0X400910 nvme-VO001920KYDMT_S70RNN0X400910_1 nvme-VO001920KYDMT_S70RNN0X402639 nvme-VO001920KYDMT_S70RNN0X402639_1 nvme-VO001920KYDMT_S70RNN0X402848 nvme-VO001920KYDMT_S70RNN0X402848_1 nvme-VO001920KYDMT_S70RNN0X402854 nvme-VO001920KYDMT_S70RNN0X402854_1 nvme-eui.0050434ef1e70001 nvme-eui.37305230584008260025384e00000002 nvme-eui.37305230584009090025384e00000002 nvme-eui.37305230584009100025384e00000002 nvme-eui.37305230584026390025384e00000002 nvme-eui.37305230584028480025384e00000002 nvme-eui.37305230584028540025384e00000002 ~ # ls /dev/disk/by-id/*_Boot_Controller_* /dev/disk/by-id/nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q /dev/disk/by-id/nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q_1
[1]
Edited by Antoine R. Dumont 2 replies Last reply by Antoine R. Dumont
- Antoine R. Dumont mentioned in commit ipxe@b5cdf95a
mentioned in commit ipxe@b5cdf95a
- Antoine R. Dumont mentioned in commit ipxe@bb8dd0e7
mentioned in commit ipxe@bb8dd0e7
- Antoine R. Dumont marked the checklist item Install the OS as completed
marked the checklist item Install the OS as completed
- Author Owner
[x] Install the os
root@chaillot:~# uname -a Linux chaillot 6.1.0-30-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) x86_64 GNU/Linux root@chaillot:~# lsb_release -c No LSB modules are available. Codename: bookworm root@chaillot:~# uptime 11:19:32 up 5 min, 1 user, load average: 0.00, 0.05, 0.02 root@chaillot:~# cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 127.0.1.1 chaillot chaillot.internal.softwareheritage.org
- Author Owner
I've made a basic run with puppet so we can connect to it. Since pergamon got upgraded the certification coming from the
puppet agent --test
worked immediately. - Antoine R. Dumont marked the checklist item Declare the servers in the inventory as completed
marked the checklist item Declare the servers in the inventory as completed
- Nicolas Dandrimont assigned to @olasd
assigned to @olasd
- Nicolas Dandrimont mentioned in commit swh/infra/puppet/puppet-swh-site@005e2b15
mentioned in commit swh/infra/puppet/puppet-swh-site@005e2b15
- Nicolas Dandrimont mentioned in commit swh/infra/puppet/puppet-swh-site@005e2b15
mentioned in commit swh/infra/puppet/puppet-swh-site@005e2b15
- Owner
Next step is integrating the machine in proxmox.
Following the instructions on: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_12_Bookworm
Collapse replies - Owner
- edited /etc/hosts to resolve to the actual node IP address
- added chaillot's IP to the "Ceph on proxmox" firewall alias
- installed proxmox kernel, rebooted
After installing the proxmox kernel, unbound failed to start due to an apparmor issue (apparmor was preventing it to bind its control socket).
After investigating, I've figured out the following:
- Debian 12 ships Apparmor 3 with a "vanilla kernel" profile
- The Proxmox kernel is based on the Ubuntu 24.04 (6.8) kernel, with the apparmor 4 patches applied
- The Apparmor 3 userspace (vanilla kernel) is incompatible with the Apparmor 4 kernel patches, specifically as it introduces unix socket enforcement
- Backporting the Apparmor 4 userspace from Ubuntu 24.04 (straight build with a new changelog entry for bookworm) fixes the issue
I'm looking into getting in touch with Proxmox to report this issue; I'm surprised none of their paying customers has reported it yet...
1 - Owner
Reported upstream: https://bugzilla.proxmox.com/show_bug.cgi?id=6101
- Owner
After chatting with upstream, it seems that we hit this because we rebooted to the new kernel before pve-lxc (which fixes the apparmor config) was installed. Reverting to apparmor 3.x once that package was installed made things work fine.
(Which explains why the issue doesn't appear on nodes that were upgraded from PVE 7)
- Owner
Updated the networking config to use /etc/network/interfaces rather than systemd-networkd (for compatibility with the proxmox UI)
- Owner
Added the node to the proxmox cluster, following instructions on https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_join_node_to_cluster (using mucem as bootstrap node)
- Owner
- Owner
Installed Ceph via the GUI (selecting the no-subscription repository and Quincy which is the current version in the rest of the cluster)
- Owner
Configuring ceph on chaillot:
- add 1 mon instance on chaillot
- add 1 mgr instance on chaillot
- add 4 mds instances on chaillot (
chaillot{,-1,-2,-3}
) - decommission mon and mgr from hypervisor3, update config for rook on kubernetes clusters
- added one osd on one of the nvme drives
1