Upgrade bullseye (Debian 11) systems to bookworm (Debian 12)

marked this issue as related to #5416 (closed)

marked this issue as related to #5423 (closed)

marked the checklist item logging elasticsearch (should probably use the opportunity to upgrade to a more recent elasticsearch version. maybe wait for the new nodes) as completed

marked this issue as related to #5428

marked this issue as related to #5429

changed the description

marked the checklist item proxmox cluster as completed

changed the description

marked the checklist item Check puppet certificate expiry as completed

logging elasticsearch

Check puppet certificate expiry

All fine and dandy. Communication is fine [1] and certificates got renewed recently enough [2]

[2]

root@pergamon:~# for name in $(puppet cert list -a 2>&1 | awk '/esnode/{print $2}' | tr -d '"'| grep -v search-esnode); do puppet cert print $name; done 2>&1 | grep -A2 "Validity"
        Validity
            Not Before: Apr 12 16:46:15 2023 GMT
            Not After : Apr 11 16:46:15 2028 GMT
--
        Validity
            Not Before: Apr 12 16:47:55 2023 GMT
            Not After : Apr 11 16:47:55 2028 GMT
--
        Validity
            Not Before: Apr 12 16:47:43 2023 GMT
            Not After : Apr 11 16:47:43 2028 GMT
--
        Validity
            Not Before: Mar 22 10:57:44 2023 GMT
            Not After : Mar 21 10:57:44 2028 GMT

[1]

root@pergamon:~# grep "elasticsearch:" /etc/clustershell/groups
elasticsearch: esnode[1-3,7] search-esnode[4-6]
root@pergamon:~# clush -b -w @elasticsearch puppet agent -t --noop
---------------
esnode1
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730732850'
Notice: Applied catalog in 5.09 seconds
---------------
esnode2
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730732850'
Notice: Applied catalog in 4.83 seconds
---------------
esnode3
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730732852'
Notice: Applied catalog in 4.98 seconds
---------------
esnode7
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730732853'
Notice: Applied catalog in 7.18 seconds
---------------
search-esnode4
---------------
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Applying configuration version '1730732847'
Notice: Applied catalog in 6.34 seconds
---------------
search-esnode5
---------------
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Applying configuration version '1730732847'
Notice: Applied catalog in 6.50 seconds
---------------
search-esnode6
---------------
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Applying configuration version '1730732847'
Notice: Applied catalog in 6.13 seconds

marked the checklist item Check puppet certificate expiry as completed

proxmox cluster

Check puppet certificate expiry

Fine too. The current communication is ok [1] But some might need renewal soonish [2]

[2]

root@pergamon:~# for name in $(grep "hypervisors:" /etc/clustershell/groups | cut -d":" -f2); do puppet cert print $name.internal.softwareheritage.org; done 2>&1 | grep -A2 "Validity"
        Validity
            Not Before: Jul 31 14:57:07 2020 GMT
            Not After : Jul 31 14:57:07 2025 GMT
--
        Validity
            Not Before: Jan 18 06:18:16 2024 GMT
            Not After : Jan 17 06:18:16 2029 GMT
--
        Validity
            Not Before: Dec 16 16:17:47 2022 GMT
            Not After : Dec 16 16:17:47 2027 GMT
--
        Validity
            Not Before: Dec 13 16:46:56 2020 GMT
            Not After : Dec 13 16:46:56 2025 GMT
--
        Validity
            Not Before: Dec 13 15:15:29 2020 GMT
            Not After : Dec 13 15:15:29 2025 GMT

[1]

puppet certificate are fine for now

root@pergamon:~# grep "hypervisors:" /etc/clustershell/groups
hypervisors: branly hypervisor3 mucem pompidou uffizi
root@pergamon:~# clush -b -w @hypervisors puppet agent --test --noop
branly: Warning: The current total number of facts: 3864 exceeds the number of facts limit: 2048
mucem: Warning: The current total number of facts: 3992 exceeds the number of facts limit: 2048
pompidou: Warning: The current total number of facts: 2378 exceeds the number of facts limit: 2048
hypervisor3: Warning: The current total number of facts: 3917 exceeds the number of facts limit: 2048
uffizi: Warning: The current total number of facts: 2596 exceeds the number of facts limit: 2048
---------------
branly
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730733249'
Notice: Applied catalog in 6.27 seconds
---------------
hypervisor3
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730733255'
Notice: Applied catalog in 11.11 seconds
---------------
mucem
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730733251'
Notice: Applied catalog in 7.69 seconds
---------------
pompidou
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730733252'
Notice: Applied catalog in 9.51 seconds
---------------
uffizi
---------------
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Applying configuration version '1730733260'
Notice: Applied catalog in 15.24 seconds

The thing to check with puppet certificate expiry, rather than a boolean "expired y/n", is how far in the future it is (and how likely it is that we'll have to do the manual dance until the puppet server is upgraded to 7.x)

Indeed, I realized that as soon as I committed the comments and checked the validity range (probably while you were mentioning it to me). The comments were adapted too to reflect this. ^ And thanks for the heads up.

The soonest to be renewed is for branly on Jul 31 14:57:07 2025 GMT. Then it's uffizi on Dec 13 15:15:29 2025 GMT. That sounds far enough.

changed the description

So I went to start with the kafka1.staging and unfortunately, the ilo is not accessible. So i'm compulsing the list of ilo/idrac i don't have access to so I can open helpdesk ticket(s?) to the dsi.

The basic current status with regards to the machine in the first step. So I actually can't start any upgrade (at least one node per cluster is unreachable through its iDRAC/iLo address )

Preparing the 2 tickets (1 per environment) to the dsi.

[1] staging

|------------------------+------------------|
| staging machine        | idrac/ilo access |
|------------------------+------------------|
| kafka1.staging         | fail             |
|------------------------+------------------|
| cassandra1.staging     | fail             |
| cassandra2.staging     | ok               |
| cassandra3.staging     | fail             |
|------------------------+------------------|

[2] production

|--------------------+------------------|
| production machine | idrac/ilo access |
|--------------------+------------------|
| kafka1             | fail             |
| kafka2             | fail             |
| kafka3             | ok               |
| kafka4             | ok               |
|--------------------+------------------|
| cassandra01        | ok               |
| cassandra02        | fail             |
| cassandra03        | ok               |
| cassandra04        | fail             |
| cassandra05        | ok               |
| cassandra06        | fail             |
| cassandra07        | fail             |
| cassandra08        | fail             |
| cassandra09        | fail             |
| cassandra10        | ok               |
| cassandra11        | ok               |
| cassandra12        | ok               |
|--------------------+------------------|
| search-esnode4     | ok               |
| search-esnode5     | fail             |
| search-esnode6     | ok               |
|--------------------+------------------|

DSI has no visibility of our environments, a single ticket may be simpler (though that's a big list :/)

As exchanged and agreed upon, i'm actually giving the same list with more details (bay, rack position, switch ports, ...) out of our inventory. I'm currently sorting them per bay and rack position.

changed the description

search-esnode0.staging is the sole one that can be migrated for now.

[1] not really needed but i wanted to check the steps:

decommissionned it from puppet master
dropped its /var/lib/puppet/ssl certificates
trigger back a 'puppet agent -t'

# before
root@pergamon:~# check-puppet-cert-validity.sh search-esnode0.internal.staging.swh.network
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
        Validity
            Not Before: Dec  3 10:14:18 2020 GMT
            Not After : Dec  3 10:14:18 2025 GMT
        Subject: CN=search-esnode0.internal.staging.swh.network
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
# now
root@pergamon:~# check-puppet-cert-validity.sh search-esnode0.internal.staging.swh.network
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
        Validity
            Not Before: Nov  7 09:51:48 2024 GMT
            Not After : Nov  7 09:51:48 2029 GMT
        Subject: CN=search-esnode0.internal.staging.swh.network
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption

[3]

root@pergamon:~# cat /usr/local/bin/check-puppet-cert-validity.sh
#!/usr/bin/env bash

FQDN=$1

puppet cert print $FQDN | grep -A2 "Subject:\|Validity"

[4] (maybe it'd be worth puppetizing it so it's installed for the next one without having to install it manually too)

ardumont@search-esnode0:~% cat /usr/local/bin/migrate-to-bookworm.sh
#!/usr/bin/env bash

# Run as root

set -x

pushd /etc/apt

for pattern_file in sources.list sources.list.d/*.list; do
    # Update the main sources.list
    sed -i 's/bullseye/bookworm/gi' $pattern_file
done

git status
git add .
git commit -m "migrate from bullseye to bookworm distribution"

apt update
apt upgrade --without-new-pkgs -y
apt full-upgrade --download-only -y
apt full-upgrade -y

marked the checklist item search-esnode0.internal.staging.swh.network as completed

changed the description

Thierry did a reset and made some other checks on machines.

We have further troubles accessing the following machines' idrac.

|            IP | FQDN Inria              | Associated swh machines           |
|---------------+-------------------------+-----------------------------------|
|  128.93.134.7 | swh11-adm.inria.fr      | rancher-node-metal05 (former met) |
|  128.93.134.8 | swh5-adm.inria.fr       | rancher-node-staging-rke2-metal01 |
| 128.93.134.12 | swh1-das-mgmt1.inria.fr | ?                                 |
| 128.93.134.13 | swh1-das-mgmt2.inria.fr | ?                                 |
| 128.93.134.14 | swh10-adm.inria.fr      | granet                            |
| 128.93.134.15 | swh-kafka1-adm.inria.fr | kafka1                            |
| 128.93.134.16 | swh-kafka2-adm.inria.fr | kafka2                            |
| 128.93.134.19 | swh-hv4-adm.inria.fr    | branly                            |
| 128.93.134.22 | swh-es5-adm.inria.fr    | search-esnode5                    |
| 128.93.134.32 | swh-kube1-adm.inria.fr  | rancher-node-metal01              |

Note: i've filled-in information about swh-kube[1-2]-adm.inria.fr in the inventory.

kafka1.staging]

Check puppet certificate status: ok [1]
migration to bookworm migration
Reboot
Checks

[1] certificate ok

root@pergamon:~# /usr/local/bin/check-puppet-cert-validity.sh kafka1.internal.staging.swh.network
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
        Validity
            Not Before: Mar 28 08:23:04 2023 GMT
            Not After : Mar 27 08:23:04 2028 GMT
        Subject: CN=kafka1.internal.staging.swh.network
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption

[2] Status before upgrade

root@kafka1:~# uptime
 13:16:58 up 602 days,  4:14,  4 users,  load average: 1.08, 0.95, 0.83
root@kafka1:~# uname -a
Linux kafka1 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
root@kafka1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

[3] Status after upgrade and reboot

root@kafka1:~# uptime
 13:42:40 up 1 min,  5 users,  load average: 1.31, 0.47, 0.17
root@kafka1:~# uname -a
Linux kafka1 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux
root@kafka1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

marked the checklist item kafka1.internal.staging.swh.network as completed

marked the checklist item staging (single node, so downtime) as completed

marked the checklist item cassandra1.internal.staging.swh.network as completed

[x] cassandra1.staging

Check puppet certificate status: ok [1]
Unregister node from the cassandra cluster (nodetool drain) [2]
Monitor cluster from another node [3]
migration to bookworm migration
Reboot
Checks [4] [5]

[1] certificate ok

root@pergamon:~# /usr/local/bin/check-puppet-cert-validity.sh cassandra1.internal.staging.swh.network
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
        Validity
            Not Before: Mar 14 11:04:58 2023 GMT
            Not After : Mar 13 11:04:58 2028 GMT
        Subject: CN=cassandra1.internal.staging.swh.network
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption

[2] Stop node from cassandra the time to upgrade it

root@cassandra1:~# USER=$(cat /etc/cassandra/jmxremote.password | awk '{print $1}')
root@cassandra1:~# PASS=$(cat /etc/cassandra/jmxremote.password | awk '{print $2}')
root@cassandra1:~# /opt/cassandra/bin/nodetool -u $USER --password $PASS drain
root@cassandra1:~# systemctl status cassandra@instance1.service
● cassandra@instance1.service - Cassandra instance1 instance
     Loaded: loaded (/etc/systemd/system/cassandra@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/cassandra@instance1.service.d
             └─parameters.conf
     Active: active (running) since Mon 2024-09-30 21:11:55 UTC; 1 months 20 days ago
   Main PID: 2911237 (java)
      Tasks: 560 (limit: 154321)
     Memory: 69.9G
        CPU: 1month 2w 1d 13h 14min 30.184s
     CGroup: /system.slice/system-cassandra.slice/cassandra@instance1.service
             └─2911237 /usr/bin/java -ea -da:net.openhft... -XX:+UseThreadPriorities -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:+AlwaysPre>

Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7137] 2024-11-20 14:27:40,315 Memtable.java:489 - Writing>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7137] 2024-11-20 14:27:40,318 Memtable.java:518 - Complet>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7138] 2024-11-20 14:27:40,321 Memtable.java:489 - Writing>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7137] 2024-11-20 14:27:40,326 Memtable.java:489 - Writing>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7137] 2024-11-20 14:27:40,328 Memtable.java:518 - Complet>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7138] 2024-11-20 14:27:40,331 Memtable.java:518 - Complet>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7137] 2024-11-20 14:27:40,334 Memtable.java:489 - Writing>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [PerDiskMemtableFlushWriter_0:7137] 2024-11-20 14:27:40,338 Memtable.java:518 - Complet>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [RMI TCP Connection(162892)-192.168.130.181] 2024-11-20 14:27:40,343 HintsService.java:>
Nov 20 14:27:40 cassandra1 cassandra[2911237]: INFO  [RMI TCP Connection(162892)-192.168.130.181] 2024-11-20 14:27:40,397 StorageService.jav>
root@cassandra1:~# systemctl stop cassandra@instance1.service
root@cassandra1:~# systemctl disable cassandra@instance1.service
Removed /etc/systemd/system/multi-user.target.wants/cassandra@instance1.service.

[3] Cassandra cluster status before upgrade & reboot

root@cassandra3:~# USER=$(cat /etc/cassandra/jmxremote.password | awk '{print $1}')
root@cassandra3:~# PASS=$(cat /etc/cassandra/jmxremote.password | awk '{print $2}')
root@cassandra3:~# period=30; while true; do date; echo; /opt/cassandra/bin/nodetool -u $USER --password $PASS status -r; echo; /opt/cassandra/bin/nodetool -u $USER --password $PASS netstats -H | grep -v 100%; sleep $period; done
Wed 20 Nov 2024 02:24:32 PM UTC

Datacenter: sesi_rocquencourt_staging
=====================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                  Load      Tokens  Owns (effective)  Host ID                               Rack
UN  cassandra2.internal.staging.swh.network  2.79 TiB  16      100.0%            76e43f7a-e1fc-405b-b3ef-361bb7e76b72  rack1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
UN  cassandra3.internal.staging.swh.network  2.79 TiB  16      100.0%            4f4fddc0-b7ce-468d-80dc-58cbfe23d030  rack1


Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 4174
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0          24496         0
Small messages                  n/a         0     4343976427         0
Gossip messages                 n/a         0       15323938         0

...
|/ State=Normal/Leaving/Joining/Moving
--  Address                                  Load      Tokens  Owns (effective)  Host ID                               Rack
UN  cassandra2.internal.staging.swh.network  2.79 TiB  16      100.0%            76e43f7a-e1fc-405b-b3ef-361bb7e76b72  rack1
DN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1  <- the node is down for now
UN  cassandra3.internal.staging.swh.network  2.79 TiB  16      100.0%            4f4fddc0-b7ce-468d-80dc-58cbfe23d030  rack1


Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 4174
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0          24496         0
Small messages                  n/a         0     4344087506         0
Gossip messages                 n/a       309       15325543        42  <- increase in gossip since one node less in the cluster
Wed 20 Nov 2024 02:33:42 PM UTC

Datacenter: sesi_rocquencourt_staging
=====================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                  Load      Tokens  Owns (effective)  Host ID                               Rack
UN  cassandra2.internal.staging.swh.network  2.79 TiB  16      100.0%            76e43f7a-e1fc-405b-b3ef-361bb7e76b72  rack1
DN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
UN  cassandra3.internal.staging.swh.network  2.79 TiB  16      100.0%            4f4fddc0-b7ce-468d-80dc-58cbfe23d030  rack1


Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 4174
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0          24496         0
Small messages                  n/a         0     4344093305         0
Gossip messages                 n/a       314       15325639        73  <- another tick, another increase

[4] Status before upgrade

root@cassandra1:~# uptime
 13:58:09 up 365 days, 21:51,  2 users,  load average: 0.58, 0.59, 0.63
root@cassandra1:~# uname -a
Linux cassandra1 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
root@cassandra1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

[3] Status after upgrade and reboot

root@cassandra1:~# uptime
 14:44:33 up 1 min,  3 users,  load average: 0.54, 0.25, 0.10
root@cassandra1:~# uname -a
Linux cassandra1 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux
root@cassandra1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm
# The other nodes sees that node as UN too.
root@cassandra1:~# /opt/cassandra/bin/nodetool -u $USER --password $PASS status -r | grep cassandra1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
root@cassandra2:~# /opt/cassandra/bin/nodetool -u $USER --password $PASS status -r | grep cassandra1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
root@cassandra3:~# /opt/cassandra/bin/nodetool -u $USER --password $PASS status -r | grep cassandra1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1

# The node has status normal and is completing tasks
root@cassandra1:~# period=30; while true; do date; echo; /opt/cassandra/bin/nodetool -u $USER --password $PASS status -r; echo; /opt/cassandra/bin/nodetool -u $USER --password $PASS netstats -H | grep -v 100%; sleep $period; done
Wed Nov 20 15:00:34 UTC 2024

Datacenter: sesi_rocquencourt_staging
=====================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                  Load      Tokens  Owns (effective)  Host ID                               Rack
UN  cassandra2.internal.staging.swh.network  2.79 TiB  16      100.0%            76e43f7a-e1fc-405b-b3ef-361bb7e76b72  rack1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
UN  cassandra3.internal.staging.swh.network  2.79 TiB  16      100.0%            4f4fddc0-b7ce-468d-80dc-58cbfe23d030  rack1


Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0              2         0
Small messages                  n/a         0         478142         0
Gossip messages                 n/a         0           2365         0
                                                        ^ up this number goes at each tick

[x] cassandra2.staging

Check puppet certificate status: ok [1]
Unregister node from the cassandra cluster (nodetool drain) [2]
Monitor cluster from another node [3]
migration to bookworm migration
Reboot
Checks [4] [5]

[1] certificate ok

root@pergamon:~# /usr/local/bin/check-puppet-cert-validity.sh cassandra2.internal.staging.swh.network
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
        Validity
            Not Before: Mar 14 11:05:53 2023 GMT
            Not After : Mar 13 11:05:53 2028 GMT
        Subject: CN=cassandra2.internal.staging.swh.network
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption

[2]

root@cassandra2:~# /opt/cassandra/bin/nodetool -u $USER --password $PASS drain
root@cassandra2:~# systemctl stop cassandra@instance1.service
root@cassandra2:~# ps aux | grep -i cassandra
root     1228674  0.0  0.0   7868  2564 pts/1    S+   15:06   0:00 grep -i cassandra

[3] from cassandra1

Datacenter: sesi_rocquencourt_staging
=====================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                  Load      Tokens  Owns (effective)  Host ID                               Rack
DN  cassandra2.internal.staging.swh.network  2.79 TiB  16      100.0%            76e43f7a-e1fc-405b-b3ef-361bb7e76b72  rack1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
UN  cassandra3.internal.staging.swh.network  2.79 TiB  16      100.0%            4f4fddc0-b7ce-468d-80dc-58cbfe23d030  rack1


Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0              2         0
Small messages                  n/a         0        1629902         0
Gossip messages                 n/a        58           3760         0

[4] Status before upgrade

root@cassandra2:~# uptime
 13:58:43 up 365 days,  5:02,  2 users,  load average: 0.55, 0.50, 0.59
root@cassandra2:~# uname -a
Linux cassandra2 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
root@cassandra2:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

[5] Status after upgrade and reboot

root@cassandra2:~# uname -a
Linux cassandra2 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux
root@cassandra2:~# uptime
 15:34:16 up 0 min,  3 users,  load average: 6.30, 1.86, 0.64
root@cassandra2:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm
root@cassandra2:~# /opt/cassandra/bin/nodetool -u $USER --password $PASS status -r
Datacenter: sesi_rocquencourt_staging
=====================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                  Load      Tokens  Owns (effective)  Host ID                               Rack
UN  cassandra2.internal.staging.swh.network  2.79 TiB  16      100.0%            76e43f7a-e1fc-405b-b3ef-361bb7e76b72  rack1
UN  cassandra1.internal.staging.swh.network  2.79 TiB  16      100.0%            91e480eb-8f3c-4f8b-b4d4-1b5fe377e3f7  rack1
UN  cassandra3.internal.staging.swh.network  2.79 TiB  16      100.0%            4f4fddc0-b7ce-468d-80dc-58cbfe23d030  rack1

marked the checklist item cassandra2.internal.staging.swh.network as completed

marked the checklist item cassandra3.internal.staging.swh.network as completed

marked the checklist item staging as completed

mentioned in commit swh/devel/swh-docs@9f815060

mentioned in merge request swh/devel/swh-docs!451 (merged)

mentioned in commit swh/devel/swh-docs@762a2b1d

mentioned in commit swh/devel/swh-docs@1e0a68a3

mentioned in commit swh/devel/swh-docs@9640172b

mentioned in commit swh/devel/swh-docs@78f19407

All Cassandra idrac/ilo access has been restored by the team currently at Rocq and switching old cables. Thanks team!

changed the description

mentioned in commit swh/infra/puppet/puppet-swh-site@598f3176

A script to migrate to bookworm has been installed through puppet (on bare metal node). It will check and prevent the migration if the node is not running bullseye. It's based on the various script i've been iterated since migrating machines from buster (one version of it is referenced here).

Here goes the production search-esnode[4-6] plan:

For each node in a rollout fashion:

Ongoing status:

search-esnode4
search-esnode5
search-esnode6

[1] https://grafana.softwareheritage.org/goto/7gZ_ToSHz?orgId=1

[2]

marked the checklist item search-esnode4.internal.softwareheritage.org as completed

marked the checklist item search-esnode5.internal.softwareheritage.org as completed

Upgrade bullseye (Debian 11) systems to bookworm (Debian 12)

Designs

Child items 0

Activity