Upgrade bullseye (Debian 11) systems to bookworm (Debian 12)
We should upgrade our Debian 11 bullseye (Debian support ended last month, expected end of LTS support in 2026) systems to bookworm (end of Debian support in 2026, end of LTS in 2028).
The most impactful system upgrades will be:
- Python 3.9 to Python 3.11
- Default JRE upgraded from 11 to 17 (17 is already available in bullseye as non-default, 11 is removed in bookworm)
- Puppet Agent 5.5 to 7.x
- Puppet Server migrated to Clojure implementation (vs. current ruby / passenger implementation)
- Proxmox VE 7 to 8 (PVE 7 is already out of support)
We have a couple of canary hosts upgraded and it seems that:
- the base system works fine
- puppet 7 runs okay against the 5.5 server
- the puppet generation certificate must happen prior to the debian dist-upgrade (as long as we still run our puppet server [pergamon] in version 5.5)
Hosts/Clusters to upgrade:
1st step: All cluster related services node can be dist-upgraded.
-
logging elasticsearch
- Upgrade to a more recent elasticsearch version
- esnode[1-3,7]
- Check puppet certificate expiry
-
proxmox cluster
- mucem
- pompidou
- hypervisor3
- branly
- uffizi
- Check puppet certificate expiry
-
swh-search elasticsearch nodes
-
staging (single node, so downtime)
- search-esnode0.internal.staging.swh.network
-
production
- search-esnode4.internal.softwareheritage.org
- search-esnode5.internal.softwareheritage.org
- search-esnode6.internal.softwareheritage.org
-
staging (single node, so downtime)
-
kafka
-
staging (single node, so downtime)
- kafka1.internal.staging.swh.network
-
production
- kafka1.internal.softwareheritage.org
- kafka2.internal.softwareheritage.org
- kafka3.internal.softwareheritage.org
- kafka4.internal.softwareheritage.org
-
staging (single node, so downtime)
-
cassandra
-
staging
- cassandra1.internal.staging.swh.network
- cassandra2.internal.staging.swh.network
- cassandra3.internal.staging.swh.network
-
production
- cassandra01.internal.softwareheritage.org
- cassandra02.internal.softwareheritage.org
- cassandra03.internal.softwareheritage.org
- cassandra04.internal.softwareheritage.org
- cassandra05.internal.softwareheritage.org
- cassandra06.internal.softwareheritage.org
- cassandra07.internal.softwareheritage.org
- cassandra08.internal.softwareheritage.org
- cassandra09.internal.softwareheritage.org
- cassandra10.internal.softwareheritage.org
- cassandra11.internal.softwareheritage.org
- cassandra12.internal.softwareheritage.org
- cassandra13.internal.softwareheritage.org
-
staging
2nd step
-
rancher clusters
-
test-staging-rke2 (vms)
- rancher-node-test-rke2-mgmt1
- rancher-node-test-rke2-mgmt2
- rancher-node-test-rke2-mgmt3
- rancher-node-test-rke2-worker1
- rancher-node-test-rke2-worker2
- rancher-node-test-rke2-worker3
-
staging-rke2 [For stateful service, care must be taken regarding pv that needs replication]
- db1
- rancher-node-staging-rke2-metal01
- rancher-node-staging-rke2-mgmt1
- rancher-node-staging-rke2-mgmt2
- rancher-node-staging-rke2-mgmt3
- rancher-node-staging-rke2-worker1
- rancher-node-staging-rke2-worker2
- rancher-node-staging-rke2-worker3
- rancher-node-staging-rke2-worker4
- rancher-node-staging-rke2-worker5
- rancher-node-staging-rke2-worker6
- storage1
-
admin-rke2
- rancher-node-admin-rke2-mgmt1
- rancher-node-admin-rke2-mgmt2
- rancher-node-admin-rke2-mgmt3
- rancher-node-admin-rke2-node01
- rancher-node-admin-rke2-node02
- rancher-node-admin-rke2-node03
-
production-rke2
- rancher-node-production-rke2-mgmt1
- rancher-node-production-rke2-mgmt2
- rancher-node-production-rke2-mgmt3
- banco
- saam
- rancher-node-metal01
- rancher-node-metal02
- rancher-node-metal03
- rancher-node-metal04
-
rancher-node-metal05-will-be-reinstalled-as-> rancher-node-highmem02 [1]
-
test-staging-rke2 (vms)
[1] #5543 (closed)
- Current state of the migration [2]
Plan (for each node):
- Ensure idrac/ilo access to the machine (all nodes in the clustered service)
- Check puppet certificate validity range for the fqdn, renew if needed
- Debian upgrade to bookworm
- puppet agent run
- Reboot the machine
-
Checks
- Reboot ok
- Impacted services start without any issue
- Fix issues if any (e.g. fix venv, adapt package versions, swap mix up, ...)
- apt autoremove
Current status of the migration (considered done):
|--------------------+--------------------------+--------------+-------------------------------------------|
| Env | Machine | Distribution | Purpose |
|--------------------+--------------------------+--------------+-------------------------------------------|
| production (metal) | | bullseye | |
| | albertina | // | postgresql |
| | maxxi | // | compression toolbox |
| | massmoca | // | postgresql mirror |
| | banco | bookworm | objstorage legacy |
| | saam | // | // |
| | branly | // | hypervisor |
| | mucem | // | // |
| | chaillot | // | // |
| | pompidou | // | // |
| | uffizi | // | // |
| | hypervisor3 | // | to reinstall for staging kube environment |
| | cassandra[01-13] | // | cassandra |
| | esnode[1-3,7-9] | // | elasticsearch |
| | search-esnode[4-6] | // | // |
| | kafka[1-4] | // | kafka |
| | rancher-metal0[1-4] | // | rancher node |
| | rancher-highmem0[1-2] | // | // |
| | giverny | // | desktop devel lab machine |
| | grand-palais | // | desktop devel lab machine (mirror) |
| | mam | // | provenance computation poc |
|--------------------+--------------------------+--------------+-------------------------------------------|
| production (vm) | | bullseye | |
| | kelvingrove | // | keycloak |
| | moma | bookworm | cache, reverse-proxy |
| | counters1 | // | webapp counters |
| | saatchi | // | rabbitmq |
| | pergamon | // | puppet master, dns server, icinga... |
| | tate | // | old forge, old wikis, ssh access |
| | logstash0 | // | logstash |
| | kibana0 | // | kibana |
| | getty | // | old kafka server, ... |
| | rancher-node-mgmt0[1-3] | // | rancher mgmt node |
| | maven-exporter | // | maven scrap index and expose for lister |
| | thyssen | // | jenkins |
| | jenkins-docker0[1-2] | // | jenkins compute machines |
| | thanos-compact.azure | // | thanos metrics |
| | ns0.euwest.azure | // | dns on azure networks |
| | backup01.euwest.azure | // | ? |
|--------------------+--------------------------+--------------+-------------------------------------------|
| admin (metal) | N/A | N/A | N/A |
|--------------------+--------------------------+--------------+-------------------------------------------|
| admin (vms) | | bullseye | |
| | thanos | // | thanos server |
| | dali | // | postgresql server |
| | bojimans | bookworm | inventory (netbox) |
| | bardo | // | hedgedoc |
| | rp1 | // | reverse-proxy |
| | rancher-admin-node0[1-3] | // | rancher node |
| | rancher-admin-mgmt0[1-3] | // | rancher admin node |
|--------------------+--------------------------+--------------+-------------------------------------------|
| staging (metal) | kafka[1-3] | bookworm | kafka |
| | storage1 | // | objstorage |
| | db1 | // | db + objstorage legacy |
|--------------------+--------------------------+--------------+-------------------------------------------|
| staging (vms) | rancher-test-worker[1,3] | bookworm | rancher nodes |
| | rancher-node-worker[1,6] | // | // |
| | rancher-test-mgmt[1-3] | // | // |
| | rancher-mgmt[1-3] | // | // |
| | search-esnode0 | // | elasticsearch |
| | maven-exporter0 | // | maven scrap index and expose for lister |
| | counters0 | // | webapp counters |
| | rp0 | // | reverse proxy |
| | scheduler0 | // | rabbitmq |
| | runner0 | // | add-forge-now worker |
|--------------------+--------------------------+--------------+-------------------------------------------|
The remaining static services (in the vm still running bullseyes ^). It's not really actionable here so we splitted it outside of this issue per environment. The gist of it is to check whether it's cost-effective (medium/long run) to move those services inside kubernetes instead of manually (puppet) manage those:
-
Document debian upgrade procedure~> #5556
Fallback plan for puppet generation certificate after a dist-upgrade to bookworm already occurred)