Reinstall rancher-node-metal05 as rancher-node-highmem02

changed the description

The out-of-band access to this node is up again.

The current graph data on that node is 2024-03-31 [1] which is already present on the rancher-node-highmem01 [2]. So there is no need to back it up already.

[1]

root@rancher-node-metal05:~# zfs list | grep graph
data/datasets/2024-03-31/compressed  4.86T  26.0T     4.79T  /srv/kubernetes/volumes/pvc-1d95c55e-1fe7-4482-9ea0-c044c4dc4ff3_swh-cassandra_graph-20240331-persistent-pvc/2024-03-31/compressed

[2]

ardumont@rancher-node-highmem01:~% zfs list | grep graph
data/datasets/2024-03-31/compressed  4.80T   334G     4.80T  /srv/softwareheritage/ssd/graph/2024-03-31/compressed
data/datasets/2024-12-06/compressed  5.09T   334G     5.09T  /srv/softwareheritage/ssd/graph/2024-12-06/compressed

changed title from Rename rancher-node-metal05 as rancher-node-highmem02 to Reinstall rancher-node-metal05 as rancher-node-highmem02

changed the description

assigned to @ardumont

mentioned in commit swh/infra/ci-cd/swh-charts@f45d3b32

marked the checklist item swh-charts: Disable previous graph instances running on rancher-node-metal05 as completed

changed the description

[x] Deprovision node from rancher cluster

From the rancher-ui admin tool, select the node and delete it. After that, this no longer shows up in the node listing

[1] made the screenshot after removing metal05, made the screenshot as a dry-run with metal04

[2]

tony@kessel:~ $ swh-kubectl prd get nodes
+ case "$1" in
+ context=archive-production-rke2
+ shift
+ kubectl --context archive-production-rke2 get nodes
NAME                                 STATUS   ROLES                       AGE     VERSION
banco                                Ready    worker                      353d    v1.28.10+rke2r1
rancher-node-highmem01               Ready    worker                      7d17h   v1.28.10+rke2r1
rancher-node-metal01                 Ready    worker                      2y34d   v1.28.10+rke2r1
rancher-node-metal02                 Ready    worker                      2y29d   v1.28.10+rke2r1
rancher-node-metal03                 Ready    worker                      503d    v1.28.10+rke2r1
rancher-node-metal04                 Ready    worker                      363d    v1.28.10+rke2r1
rancher-node-production-rke2-mgmt1   Ready    control-plane,etcd,master   2y34d   v1.28.10+rke2r1
rancher-node-production-rke2-mgmt2   Ready    control-plane,etcd,master   322d    v1.28.10+rke2r1
rancher-node-production-rke2-mgmt3   Ready    control-plane,etcd,master   322d    v1.28.10+rke2r1
saam                                 Ready    worker                      367d    v1.28.10+rke2r1
tony@kessel:~ $ swh-kubectl prd get nodes | grep metal05
+ case "$1" in
+ context=archive-production-rke2
+ shift
+ kubectl --context archive-production-rke2 get nodes

marked the checklist item Deprovision node from rancher cluster as completed

marked the checklist item Disable puppet as completed

changed the description

mentioned in commit ipxe@093798cd

mentioned in commit swh/infra/puppet/puppet-swh-site@d015231a

changed the description

mentioned in commit ipxe@a7996ad7

After playing a bit with the bios settings so the machines boots on the virtual optical drive, it booted but it seems stuck on the debian iso fetch...

That happens sometimes (when LACP gets confused by a large packet), IME a retry or two lets it go through

Take 2, but failed again ;)

Take 3 (again, third is the charm, maybe ;)

Nope, did not work the 4th time.

On the 5th tryout, I fetched the stuff it failed to retrieve and tried to provide them through pergamon but that hanged as well.

As a shot in the dark since it worked with previous distribution, i've tried to test with bullseye but that hangs out the same (expectedly but you never know ;).

So i'm out of ideas beside renaming that machine again.

The plus side of things, ipxe wise, we can declare either distribution we wanna use now... (not that i see any reasons to install an oldstable one but heh).

Have you tried net1 instead of net0?

No, lemme see that, thx for the ping.

Nope, multiple tryouts (with variations, net1, new ipxe code, ...) and only failures...

In the end also, i just used the virtual keyboard to enter the one-time boot menu entry instead of changing punctually the setup (which took forever and a half prior to reboot).

After a lot of poking (based on ifclose net0 and ifopen net0, and maybe lowering the mtu?) I managed to boot the installer.

I reimported the zfs pool, moved away the rancher datasets to a -legacy suffix, then did the first puppet run.

mentioned in commit ipxe@1e56231c

mentioned in commit ipxe@9760d8a3

changed the description

mentioned in commit ipxe@c82d1045

mentioned in commit ipxe@8ec3da02

marked the checklist item (blocked) swh-ipxe: Reinstall os with fqdn rancher-node-highmem02 as completed

marked the checklist item Reboot as completed

marked the checklist item Check machine has new hostname (hostname and hostname -f report the new values) as completed

marked the checklist item Run puppet (should provision a new certificate for the new hostname) as completed

marked the checklist item Decommission the old node in puppet as completed

[ ] Reprovision the node in the rancher cluster

Registration cli is coming from rancher ui website [1]

[1] https://rancher.euwest.azure.internal.softwareheritage.org

[2]

root@rancher-node-highmem02:~# curl -fL https://rancher.euwest.azure.internal.softwareheritage.org/system-agent-install.sh | sudo  sh -s - --server https://rancher.euwest.azure.internal.softwareheritage.org --label 'cattle.io/os=linux' --token $token --worker
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 32274    0 32274    0     0   347k      0 --:--:-- --:--:-- --:--:--  350k
[INFO]  Label: cattle.io/os=linux
[INFO]  Role requested: worker
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from https://rancher.euwest.azure.internal.softwareheritage.org/assets/rancher-system-agent-amd64
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from https://rancher.euwest.azure.internal.softwareheritage.org/assets/system-agent-uninstall.sh
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
[INFO]  Successfully downloaded Rancher connection information
[INFO]  systemd: Creating service file
[INFO]  Creating environment file /etc/systemd/system/rancher-system-agent.env
[INFO]  Enabling rancher-system-agent.service
Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service → /etc/systemd/system/rancher-system-agent.service.
[INFO]  Starting/restarting rancher-system-agent.service

[3]

marked the checklist item Reprovision the node in the rancher cluster as completed

changed the description

[x] Run the same graph version as rancher-node-highmem01

[x] 1. Install proper labels on node

Install swh/graph=true labels so it can be installed on the node. Next time, we can install labels at the same time as the rancher registration step.

tony@kessel:~ $ context=archive-production-rke2; kubectl --context $context label --overwrite node rancher-node-highmem02 software-stories=false svix-server=false swh/alter=false swh/backfiller=false swh/cooker=false swh/cookers=false swh/counters=false swh/deposit=false swh/indexer=false swh/journal_client=false swh/large-scratch-fs=false swh/lister=false swh/loader-metadata=false swh/loader=false swh/loaders=false swh/memcached=false swh/objstorage=false swh/replayer=false swh/rpc=false swh/scheduler=false swh/scrubber=false swh/storage=false swh/toolbox=false swh/web=false swh/webhooks=false swh/graph=true

tony@kessel:~ $ swh-kubectl prd get nodes --show-labels | grep highmem02
+ case "$1" in
+ context=archive-production-rke2
+ shift
+ kubectl --context archive-production-rke2 get nodes --show-labels
rancher-node-highmem02               Ready    worker                      20m     v1.28.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,cattle.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rancher-node-highmem02,kubernetes.io/os=linux,node-role.kubernetes.io/worker=true,node.kubernetes.io/instance-type=rke2,plan.upgrade.cattle.io/system-agent-upgrader=9bb1010f7f487d1fb565e26c004071b7c56489fa9a1d3ce128297483,rke.cattle.io/machine=361bf3a2-1daf-4a79-bcf8-749558e6b3ee,software-stories=false,svix-server=false,swh/alter=false,swh/backfiller=false,swh/cooker=false,swh/cookers=false,swh/counters=false,swh/deposit=false,swh/graph=true,swh/indexer=false,swh/journal_client=false,swh/large-scratch-fs=false,swh/lister=false,swh/loader-metadata=false,swh/loader=false,swh/loaders=false,swh/memcached=false,swh/objstorage=false,swh/replayer=false,swh/rpc=false,swh/scheduler=false,swh/scrubber=false,swh/storage=false,swh/toolbox=false,swh/web=false,swh/webhooks=false

[x] 2. Prepare the data/datasets zfs volume with proper defaults options (compression, noatime, ...)

Initially:


root@rancher-node-highmem02:~# zfs get all data/datasets | grep "time\\|xattr\\|compression"
data/datasets  compression           off                    default
data/datasets  atime                 on                     default
data/datasets  xattr                 on                     default
data/datasets  relatime              off                    default
root@rancher-node-highmem02:~# zfs set compression=zstd data/datasets
root@rancher-node-highmem02:~# zfs set atime=off data/datasets
root@rancher-node-highmem02:~# zfs set relatime=on data/datasets
root@rancher-node-highmem02:~# zfs set xattr=sa data/datasets
root@rancher-node-highmem02:~# zfs get all data/datasets | grep "time\\|xattr\\|compression"
data/datasets  compression           zstd                   local
data/datasets  atime                 off                    local
data/datasets  xattr                 sa                     local
data/datasets  relatime              on                     local
root@rancher-node-highmem02:~# zfs create data/datasets/2024-12-06
root@rancher-node-highmem02:~# zfs get all data/datasets/2024-12-06 | grep "time\\|xattr\\|compression"
data/datasets/2024-12-06  compression           zstd                   inherited from data/datasets
data/datasets/2024-12-06  atime                 off                    inherited from data/datasets
data/datasets/2024-12-06  xattr                 sa                     inherited from data/datasets
data/datasets/2024-12-06  relatime              on                     inherited from data/datasets

[x] 3. Prepare ssh connection from one machine to the other (and vice-versa)

root@rancher-node-highmem02:~# ssh root@rancher-node-highmem01 date
Mon Jan 27 02:03:41 PM UTC 2025

[x] 4. Transfer the zfs dataset from highmem01 to highmem02 (latest graph)

root@rancher-node-highmem02:~# ssh root@rancher-node-highmem01 zfs send -cvL data/datasets/2024-12-06/compressed@20250127T132857 | zfs receive data/datasets/2024-12-06/compressed
full send of data/datasets/2024-12-06/compressed@20250127T132857 estimated size is 4.96T
total estimated size is 4.96T
TIME        SENT   SNAPSHOT data/datasets/2024-12-06/compressed@20250127T132857
14:10:05    372M   data/datasets/2024-12-06/compressed@20250127T132857
...
14:34:51    529G   data/datasets/2024-12-06/compressed@20250127T132857
...
16:25:39   3.12T   data/datasets/2024-12-06/compressed@20250127T132857
...
17:29:32   4.50T   data/datasets/2024-12-06/compressed@20250127T132857
...
18:59:44   6.49T   data/datasets/2024-12-06/compressed@20250127T132857
root@rancher-node-highmem02:~# zfs list -t snapshot | grep -v data/rancher
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
data/datasets/2023-08-07@test                           56K      -      368K  -
data/datasets/2024-03-31/compressed@graph-2024-03-31  63.5G      -     4.79T  -
data/datasets/2024-12-06/compressed@20250127T132857      0B      -     4.95T  -

[x] 5. Install the zfs graph dataset to proper path

root@rancher-node-highmem02:~# zfs set mountpoint=/srv/softwareheritage/ssd/graph/2024-12-06/compressed data/datasets/2024-12-06/compressed
root@rancher-node-highmem02:~# zfs list | grep -v data/rancher
NAME                                  USED  AVAIL     REFER  MOUNTPOINT
...
data/datasets/2024-03-31/compressed  4.86T  22.1T     4.79T  /srv/softwareheritage/ssd/graph/2024-03-31/compressed
data/datasets/2024-12-06             4.95T  22.1T       96K  none
data/datasets/2024-12-06/compressed  4.95T  22.1T     4.95T  /srv/softwareheritage/ssd/graph/2024-12-06/compressed
...
root@rancher-node-highmem02:~# ls /srv/softwareheritage/ssd/graph/2024-12-06/compressed
graph.edges.count.txt        graph.labels.fcl.bytearray   graph.persons.count.txt                        graph.property.content.is_skipped.bits  graph-transposed.graph
graph.edges.stats.txt        graph.labels.fcl.pointers    graph.persons.csv.zst                          graph.property.content.length.bin       graph-transposed-labelled.ef
graph.ef                     graph.labels.fcl.properties  graph.persons.pthash                           graph.property.message.bin              graph-transposed-labelled.labeloffsets
graph.graph                  graph.labels.pthash          graph.properties                               graph.property.message.offset.bin       graph-transposed-labelled.labels
graph-labelled.ef            graph.labels.pthash.order    graph.property.author_id.bin                   graph.property.tag_name.bin             graph-transposed-labelled.properties
graph-labelled.labeloffsets  graph.node2swhid.bin         graph.property.author_timestamp.bin            graph.property.tag_name.offset.bin      graph-transposed.offsets
graph-labelled.labels        graph.node2type.bin          graph.property.author_timestamp_offset.bin     graph.pthash                            graph-transposed.properties
graph-labelled.properties    graph.nodes.count.txt        graph.property.committer_id.bin                graph.pthash.order                      logs
graph.labels.count.txt       graph.nodes.stats.txt        graph.property.committer_timestamp.bin         graph.stats                             meta
graph.labels.csv.zst         graph.offsets                graph.property.committer_timestamp_offset.bin  graph-transposed.ef

[x] 5. swh/infra/ci-cd/swh-charts!538 (merged): swh-charts: Adapt to deploy a graph using that dataset on that node

mentioned in commit swh/infra/ci-cd/swh-charts@fe806b1b

mentioned in commit swh/infra/ci-cd/swh-charts@cbd91132

mentioned in merge request swh/infra/ci-cd/swh-charts!538 (merged)

changed the description

mentioned in commit swh/infra/ci-cd/swh-charts@448036f5

marked the checklist item Disable ^ as completed

closed

Reinstall rancher-node-metal05 as rancher-node-highmem02

Designs

Child items 0

Activity