Install new bare metal hypervisor (chaillot)

changed the description

marked the checklist item Add the management info in the credential store (root and idrac/ilo access) as completed

First install tryout failure, network interfaces are named differently (it's now possible to declare it in our ipxe repository).

2: n: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000
    link/ether 6c:92:cf:b9:31:10 brd ff:ff:ff:ff:ff:ff
3: ens22f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000
    link/ether 6c:92:cf:b9:31:11 brd ff:ff:ff:ff:ff:ff

Passing the network detection after the adaptation, it now stuck in the disk partition view, seeing disk but failing to choose one.

Maybe because there are 2 similarly named (one with suffix _1)?

[2]

~ # ls /dev/disk/by-id
nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q
nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q_1
nvme-VO001920KYDMT_S70RNN0X400826
nvme-VO001920KYDMT_S70RNN0X400826_1
nvme-VO001920KYDMT_S70RNN0X400909
nvme-VO001920KYDMT_S70RNN0X400909_1
nvme-VO001920KYDMT_S70RNN0X400910
nvme-VO001920KYDMT_S70RNN0X400910_1
nvme-VO001920KYDMT_S70RNN0X402639
nvme-VO001920KYDMT_S70RNN0X402639_1
nvme-VO001920KYDMT_S70RNN0X402848
nvme-VO001920KYDMT_S70RNN0X402848_1
nvme-VO001920KYDMT_S70RNN0X402854
nvme-VO001920KYDMT_S70RNN0X402854_1
nvme-eui.0050434ef1e70001
nvme-eui.37305230584008260025384e00000002
nvme-eui.37305230584009090025384e00000002
nvme-eui.37305230584009100025384e00000002
nvme-eui.37305230584026390025384e00000002
nvme-eui.37305230584028480025384e00000002
nvme-eui.37305230584028540025384e00000002
~ # ls /dev/disk/by-id/*_Boot_Controller_*
/dev/disk/by-id/nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q
/dev/disk/by-id/nvme-HPE_NS204i-u_Gen11_Boot_Controller_PXTYE0ARHK429Q_1

[1]

mentioned in commit ipxe@b5cdf95a

mentioned in commit ipxe@bb8dd0e7

marked the checklist item Install the OS as completed

[x] Install the os

root@chaillot:~# uname -a
Linux chaillot 6.1.0-30-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) x86_64 GNU/Linux
root@chaillot:~# lsb_release -c
No LSB modules are available.
Codename:       bookworm
root@chaillot:~# uptime
 11:19:32 up 5 min,  1 user,  load average: 0.00, 0.05, 0.02
root@chaillot:~# cat /etc/hosts
127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters

127.0.1.1 chaillot chaillot.internal.softwareheritage.org

I've made a basic run with puppet so we can connect to it. Since pergamon got upgraded the certification coming from the puppet agent --test worked immediately.

marked the checklist item Declare the servers in the inventory as completed

assigned to @olasd

mentioned in commit swh/infra/puppet/puppet-swh-site@005e2b15

Next step is integrating the machine in proxmox.

Following the instructions on: https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_12_Bookworm

edited /etc/hosts to resolve to the actual node IP address
added chaillot's IP to the "Ceph on proxmox" firewall alias
installed proxmox kernel, rebooted

After installing the proxmox kernel, unbound failed to start due to an apparmor issue (apparmor was preventing it to bind its control socket).

After investigating, I've figured out the following:

Debian 12 ships Apparmor 3 with a "vanilla kernel" profile
The Proxmox kernel is based on the Ubuntu 24.04 (6.8) kernel, with the apparmor 4 patches applied
The Apparmor 3 userspace (vanilla kernel) is incompatible with the Apparmor 4 kernel patches, specifically as it introduces unix socket enforcement
Backporting the Apparmor 4 userspace from Ubuntu 24.04 (straight build with a new changelog entry for bookworm) fixes the issue

I'm looking into getting in touch with Proxmox to report this issue; I'm surprised none of their paying customers has reported it yet...

Reported upstream: https://bugzilla.proxmox.com/show_bug.cgi?id=6101

After chatting with upstream, it seems that we hit this because we rebooted to the new kernel before pve-lxc (which fixes the apparmor config) was installed. Reverting to apparmor 3.x once that package was installed made things work fine.

(Which explains why the issue doesn't appear on nodes that were upgraded from PVE 7)

Updated the networking config to use /etc/network/interfaces rather than systemd-networkd (for compatibility with the proxmox UI)

Added the node to the proxmox cluster, following instructions on https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_join_node_to_cluster (using mucem as bootstrap node)

Requested a clean certificate for the web UI (Datacenter > chaillot > System > Certificates > ACME > Add)

Installed Ceph via the GUI (selecting the no-subscription repository and Quincy which is the current version in the rest of the cluster)

Configuring ceph on chaillot:

add 1 mon instance on chaillot
add 1 mgr instance on chaillot
add 4 mds instances on chaillot (chaillot{,-1,-2,-3})
decommission mon and mgr from hypervisor3, update config for rook on kubernetes clusters
added one osd on one of the nvme drives

Install new bare metal hypervisor (chaillot)

Designs

Child items ...

Activity