From 274a2b96fb90d8b24c8a4e7b9492e99199340645 Mon Sep 17 00:00:00 2001 From: "Antoine R. Dumont (@ardumont)" <ardumont@softwareheritage.org> Date: Fri, 17 Jan 2025 17:56:57 +0100 Subject: [PATCH] rancher/debian: Add debian upgrade procedure for rancher cluster Refs. swh/infra/sysadm-environment#5415 --- docs/sysadm/data-silos/index.rst | 1 + docs/sysadm/data-silos/rancher/debian.rst | 260 ++++++++++++++++++++++ 2 files changed, 261 insertions(+) create mode 100644 docs/sysadm/data-silos/rancher/debian.rst diff --git a/docs/sysadm/data-silos/index.rst b/docs/sysadm/data-silos/index.rst index 264187a5..a303176c 100644 --- a/docs/sysadm/data-silos/index.rst +++ b/docs/sysadm/data-silos/index.rst @@ -9,3 +9,4 @@ Data silos kafka/index elasticsearch/index winery/index + rancher/debian diff --git a/docs/sysadm/data-silos/rancher/debian.rst b/docs/sysadm/data-silos/rancher/debian.rst new file mode 100644 index 00000000..cf87c7f7 --- /dev/null +++ b/docs/sysadm/data-silos/rancher/debian.rst @@ -0,0 +1,260 @@ +.. _upgrade-debian-rancher-cluster: + +Upgrade Procedure for Debian Nodes in a Rancher Cluster +======================================================= + +.. admonition:: Intended audience + :class: important + + sysadm staff members + +Purpose +-------- + +This page documents the steps to upgrade Debian nodes running in a Rancher cluster. The +upgrade process involves various commands and checks before and after rebooting the +node. + +Prerequisites +------------- + ++ Familiarity with SSH and CLI-based command execution ++ Out-of-band Access to the node (IDRAC/ILO) for reboot ++ Access to the node through SSH (requires the vpn) + +Step 0: Initial Steps +--------------------- + +For vm nodes +~~~~~~~~~~~~ + +For VM nodes, we can take a vm snapshot in case something goes wrong during the +migration. Connect to the proxmox ui and select the node, click on the snapshot menu and +hit ``Take snapshot``. + +We can then switch to the console view to have access to the serial console (in case +something bad happened during the reboot). + +For bare metal nodes +~~~~~~~~~~~~~~~~~~~~ + +Ensure the out of band access to the machine is ok. This definitely helps when something +goes wrong during a reboot (disk order or names change, network, ...). + +Rancher snapshot +~~~~~~~~~~~~~~~~ + +To disable Rancher etcd snapshots go to ``Cluster Management`` from +`Rancher UI <https://rancher.euwest.azure.internal.softwareheritage.org/dashboard/>`_, +choose a cluster then there are two methods: + +1. edit the YAML configuration: + + From a cluster dashboard choose ``Edit YAML``. + + .. code:: yaml + + etcd: + disableSnapshots: false + s3: null + snapshotRetention: 5 + snapshotScheduleCron: <min> */5 * * * + + .. code:: yaml + + etcd: + s3: + bucket: backup-rke2-etcd + cloudCredentialName: cattle-global-data::<xxx> + endpoint: minio.admin.swh.network + folder: <folder> + snapshotRetention: 5 + snapshotScheduleCron: <min> */5 * * * + + ======================= === + folder min + ======================= === + archive-production-rke2 00 + cluster-admin-rke2 15 + archive-staging-rke2 30 + test-staging-rke2 45 + ======================= === + +2. edit the configuration: + + - from a cluster dashboard ``Edit Config``; + - in ``etcd`` section tab choose disable in ``Backup Snapshots to S3`` section; + - if there is a custom configuration, CoreDNS for example, plan/apply the ``terraform`` cluster deployment. + + +.. admonition:: Edit configuration graphically + :class: warning + + With `Edit Config` all the custom configurations (CoreDNS) will be overwrite. + +Check the clusters leases and configmaps used by Rancher snapshots: + + .. code:: bash + + á… for context in $(kubectx | awk '/-rke2/');do + echo -e "---\nEtcd leader in cluster $context" + kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ + -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ + --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ + --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ + endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ + xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' + echo "Leases and configmaps in cluster $context" + for name in rke2 rke2-etcd;do + kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ + awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' + kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ + awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' + done + done + --- + Etcd leader in cluster archive-production-rke2 + rancher-node-production-rke2-mgmt1 + Leases and configmaps in cluster archive-production-rke2 + ConfigMap rke2 "holderIdentity":"rancher-node-production-rke2-mgmt1" + Lease rke2 rancher-node-production-rke2-mgmt1 + ConfigMap rke2-etcd "holderIdentity":"rancher-node-production-rke2-mgmt1" + Lease rke2-etcd rancher-node-production-rke2-mgmt1 + --- + Etcd leader in cluster archive-staging-rke2 + rancher-node-staging-rke2-mgmt1 + Leases and configmaps in cluster archive-staging-rke2 + ConfigMap rke2 "holderIdentity":"rancher-node-staging-rke2-mgmt1" + Lease rke2 rancher-node-staging-rke2-mgmt1 + ConfigMap rke2-etcd "holderIdentity":"rancher-node-staging-rke2-mgmt1" + Lease rke2-etcd rancher-node-staging-rke2-mgmt1 + --- + Etcd leader in cluster cluster-admin-rke2 + rancher-node-admin-rke2-mgmt3 + Leases and configmaps in cluster cluster-admin-rke2 + ConfigMap rke2 "holderIdentity":"rancher-node-admin-rke2-mgmt2" + Lease rke2 rancher-node-admin-rke2-mgmt2 + ConfigMap rke2-etcd "holderIdentity":"rancher-node-admin-rke2-mgmt2" + Lease rke2-etcd rancher-node-admin-rke2-mgmt2 + --- + Etcd leader in cluster test-staging-rke2 + rancher-node-test-rke2-mgmt1 + Leases and configmaps in cluster test-staging-rke2 + ConfigMap rke2 "holderIdentity":"rancher-node-test-rke2-mgmt1" + Lease rke2 rancher-node-test-rke2-mgmt1 + ConfigMap rke2-etcd "holderIdentity":"rancher-node-test-rke2-mgmt1" + Lease rke2-etcd rancher-node-test-rke2-mgmt1 + +`https://www.suse.com/support/kb/doc/?id=000021447 <https://www.suse.com/support/kb/doc/?id=000021447>`_ + +Step 1: Migrate to the next debian suite +---------------------------------------- + +Update the Debian version of the node (e.g. bullseye to bookworm) using the following +command: + +.. code:: + + root@node:~# /usr/local/bin/migrate-to-${NEXT_CODENAME}.sh + +Note: The script should be present on the machine (installed through puppet). + +Step 2: Run Puppet Agent +------------------------- + +Once the upgrade procedure happened, run the puppet agent to apply any necessary +configuration changes (e.g. /etc/apt/sources.list change, etc...) + +.. code:: + + root@node:~# puppet agent -t + +Step 3: Autoremove and Purge +----------------------------- + +Perform autoremove to remove unnecessary packages left-over from the migration: + +.. code:: + + root@node:~# apt autoremove + +Step 4: Put an argocd sync window +--------------------------------- + +Our deployments are managed by argocd which keeps in sync all the deployments. We want +to temporarily disable this sync. + +Go to the argocd ui and put the sync window from allow to deny. + +We want this so we can adapt the deployments scalability to a minimum. That decreases +the overall number of pods running, hence less churn around moving pods from one node to +another (which will eventually have to also migrate). + +Note that either the deployment scale is to be adapted or the keda scaled objects (e.g. +loader*, replayer, ...). It depends on the deployments. + +Step 5: Drain the node +---------------------- + +Now that we scale down the deployments, we still have some pods running and we want to +keep running but not on the currently upgrading node. + +For this, we must drain the node so pods are redistributed back to the other cluster +nodes. + +.. code:: + + user@admin-node:~# kubectl --cluster-context archive-production-rke2 \ + drain \ + --delete-emptydir-data=true \ + --ignore-daemonsets=true \ + $NODE_UPGRADING + +Wait for the cli to return and for the pods stopped to be running on the other nodes of +the cluster. + +Step 6: Reboot the Node +------------------------ + +We are finally ready to reboot the node, so just do it: + +.. code:: + + root@node:~# reboot + +You can connect to the serial console of the machine to follow through the reboot. + +Step 7: Clean up some more +-------------------------- + +Once the machine is restarted, some cleanup might be necessary. + +.. code:: + + root@node:~# apt autopurge + +In the case of the bullseye-bookworm migration, on some vms, we needed to uninstall some +package and disable some new failing services. + +.. code:: + + root@node:~# apt purge -y openipmi + root@node:~# systemctl reset-failed # so icinga stops complaining + + +Step 8: Join back the rancher cluster +------------------------------------- + +After the node reboots, check the node joined back the Rancher cluster. + +And then must ``uncordon`` the node so the kube scheduler can schedule pods on this node +again (the node will be mared as ``ready``. + +Post cluster migration +---------------------- + +Once all the nodes of the cluster have been migrated: + +- Remove the argocd sync window so the cluster is back to nominal state. +- Enable back the Rancher etcd snapshots. +- Check the `holderIdentity` value in `rke2` and `rke2-lease` leases and configmaps. -- GitLab