Antoine R. Dumont · c9119f1f · 9d6acc6d · 274a2b96 · c9119f1f · 9d6acc6d
--- a/docs/sysadm/data-silos/rancher/debian.rst 0 → 100644

+ 145

− 0
+++ b/docs/sysadm/data-silos/rancher/debian.rst 0 → 100644

+ 145

− 0
+.. _upgrade-debian-in-rancher:
+
+Upgrade Procedure for Debian Nodes in a Rancher Cluster
+======================================================
+
+.. admonition:: Intended audience
+   :class: important
+
+   sysadm staff members
+
+Purpose
+--------
+
+This page documents the steps to upgrade Debian nodes running in a Rancher cluster. The
+upgrade process involves various commands and checks before and after rebooting the
+node.
+
+Prerequisites
+-------------
+
+ Familiarity with SSH and CLI-based command execution
+ Out-of-band Access to the node (IDRAC/ILO) for reboot
+ Access to the node through SSH (requires the vpn)
+
+Step 0: Initial Steps
+---------------------
+
+For vm nodes
+~~~~~~~~~~~~
+
+For VM nodes, we can take a vm snapshot in case something goes wrong during the
+migration. Connect to the proxmox ui and select the node, click on the snapshot menu and
+hit ``Take snapshot``.
+
+We can then switch to the console view to have access to the serial console (in case
+something bad happened during the reboot).
+
+For bare metal nodes
+~~~~~~~~~~~~~~~~~~~~
+
+Ensure the out of band access to the machine is ok. This definitely helps when something
+goes wrong during a reboot (disk order or names change, network, ...).
+
+Step 1: Migrate to the next debian suite
+----------------------------------------
+
+Update the Debian version of the node (e.g. bullseye to bookworm) using the following
+command:
+
+.. code::
+
+   root@node:~# /usr/local/bin/migrate-to-${NEXT_CODENAME}.sh
+
+Note: The script should be present on the machine (installed through puppet).
+
+Step 2: Run Puppet Agent
+-------------------------
+
+Once the upgrade procedure happened, run the puppet agent to apply any necessary
+configuration changes (e.g. /etc/apt/sources.list change, etc...)
+
+.. code::
+
+   root@node:~# puppet agent -t
+
+Step 3: Autoremove and Purge
+-----------------------------
+
+Perform autoremove to remove unnecessary packages left-over from the migration:
+
+.. code::
+
+   root@node:~# apt autoremove
+
+Step 4: Put an argocd sync window
+---------------------------------
+
+Our deployments are managed by argocd which keeps in sync all the deployments. We want
+to temporarily disable this sync.
+
+Go to the argocd ui and put the sync window from allow to deny.
+
+We want this so we can adapt the deployments scalability to a minimum. That decreases
+the overall number of pods running, hence less churn around moving pods from one node to
+another (which will eventually have to also migrate).
+
+Note that either the deployment scale is to be adapted or the keda scaled objects (e.g.
+loader*, replayer, ...). It depends on the deployments.
+
+Step 5: Drain the node
+----------------------
+
+Now that we scale down the deployments, we still have some pods running and we want to
+keep running but not on the currently upgrading node.
+
+For this, we must drain the node so pods are redistributed back to the other cluster
+nodes.
+
+.. code::
+
+   user@admin-node:~# kubectl --cluster-context archive-production-rke2 \
+     drain \
+       --delete-emptydir-data=true \
+       --ignore-daemonsets=true \
+      $NODE_UPGRADING
+
+Wait for the cli to return and for the pods stopped to be running on the other nodes of
+the cluster.
+
+Step 6: Reboot the Node
+------------------------
+
+We are finally ready to reboot the node, so just do it:
+
+.. code::
+
+   root@node:~# reboot
+
+You can connect to the serial console of the machine to follow through the reboot.
+
+Step 7: Clean up some more
+--------------------------
+
+Once the machine is restarted, some cleanup might be necessary.
+
+.. code::
+
+   root@node:~# apt autopurge
+
+In the case of the bullseye-bookworm migration, on some vms, we needed to uninstall some
+package and disable some new failing services.
+
+.. code::
+
+   root@node:~# apt purge -y openipmi
+   root@node:~# systemctl reset-failed   # so icinga stops complaining
+
+
+Step 8: Join back the rancher cluster
+-------------------------------------
+
+After the node reboots, check the node joined back the Rancher cluster.
+
+And then must ``uncordon`` the node so the kube scheduler can schedule pods on this node
+again (the node will be mared as ``ready``.