Antoine R. Dumont · 8ce7a338 · e63ae66c · 8ce7a338
--- a/docs/sysadm/data-silos/elasticsearch/debian-upgrade.rst 0 → 100644

+ 168

− 0
+++ b/docs/sysadm/data-silos/elasticsearch/debian-upgrade.rst 0 → 100644

+ 168

− 0
+.. _upgrade-debian-elasticsearch-cluster:
+
+Upgrade Procedure for Debian Nodes in an Elasticsearch Cluster
+==============================================================
+
+.. admonition:: Intended audience
+   :class: important
+
+   sysadm staff members
+
+Purpose
+--------
+
+This page documents the steps to upgrade Debian nodes running in an Elasticsearch
+cluster. The upgrade process involves various commands and checks before and after
+rebooting the node.
+
+Prerequisites
+-------------
+
+ Familiarity with SSH and CLI-based command execution
+ Out-of-band Access to the node (IDRAC/ILO) for reboot
+ Access to the node through SSH (requires the vpn)
+
+Step 0: Initial Steps
+---------------------
+
+The elasticsearch nodes are only running on bare metal machines. So, ensuring the out of
+band access to the machine is ok. This definitely helps when something goes wrong during
+a reboot (disk order or names change, network, ...).
+
+Step 1: Migrate to the next debian suite
+----------------------------------------
+
+Update the Debian version of the node (e.g. bullseye to bookworm) using the following
+command:
+
+.. code::
+
+   root@node:~# /usr/local/bin/migrate-to-${NEXT_CODENAME}.sh
+
+Note: The script should be present on the machine (installed through puppet).
+
+Step 2: Run Puppet Agent
+-------------------------
+
+Once the upgrade procedure happened, run the puppet agent to apply any necessary
+configuration changes (e.g. /etc/apt/sources.list change, etc...)
+
+.. code::
+
+   root@node:~# puppet agent -t
+
+Step 2: Stop Puppet Agent
+-------------------------
+
+As we will stop the service, we don't want the agent to start it back again.
+
+.. code::
+
+   root@node:~# puppet agent --disable "Ongoing debian upgrade"
+
+Step 4: Autoremove and Purge
+-----------------------------
+
+Perform autoremove to remove unnecessary packages left-over from the migration.
+
+.. code::
+
+   root@node:~# apt autoremove
+
+Step 5: Stop the elasticsearch service
+--------------------------------------
+
+The cluster can support one non-responding node so it's ok to stop the service.
+
+We can check the cluster's status which should stay green after the elasticsearch
+service is stopped.
+
+.. code::
+
+   root@node:~# systemctl stop elasticsearch
+   root@node:~# curl -s $server/_cluster/health | jq .status
+   "green"
+
+Note: ``$server`` if of the form ``hostname:9200`` (with hostname another cluster node
+than the one we are currently upgrading)
+
+Step 6: Reboot the node
+-----------------------
+
+We are ready to reboot the node:
+
+.. code::
+
+   root@node:~# reboot
+
+You can connect to the serial console of the machine to follow through the reboot.
+
+Step 7: Clean up some more
+--------------------------
+
+Once the machine is restarted, some cleanup might be necessary.
+
+.. code::
+
+   root@node:~# apt autopurge
+
+Step 8: Activate puppet agent
+-----------------------------
+
+Activate back the puppet agent and make it run. This will start back the elasticsearch
+service again.
+
+.. code::
+
+   root@node:~# puppet agent --enable && puppet agent --test
+
+Step 8: Join back the cluster
+-----------------------------
+
+After the service restarted, check the node joined back the cluster.
+
+.. code::
+
+   root@node:~# curl -s $server/_cat/allocation?v\&s=node;
+   root@node:~# curl -s $server/_cluster/health | jq .number_of_nodes
+
+For example:
+
+.. code::
+
+   root@esnode1:~# server=http://esnode1.internal.softwareheritage.org:9200; date; \
+     curl -s $server/_cat/allocation?v\&s=node; echo; \
+     curl -s $server/_cluster/health | jq
+   Wed Jan 29 09:57:01 UTC 2025
+   shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host           ip             node    node.role
+      638                0                 0.0                 5.6tb        5.6tb     5.6tb      1.1tb      6.8tb           82 192.168.100.61 192.168.100.61 esnode1 cdfhilmrstw
+      634                0                 0.0                 5.7tb        5.7tb     5.7tb        1tb      6.8tb           84 192.168.100.62 192.168.100.62 esnode2 cdfhilmrstw
+      639                0                 0.0                 5.6tb        5.6tb     5.6tb      1.1tb      6.8tb           82 192.168.100.63 192.168.100.63 esnode3 cdfhilmrstw
+      644                0                 0.0                 5.6tb        5.6tb     5.6tb      8.2tb     13.8tb           40 192.168.100.64 192.168.100.64 esnode7 cdfhilmrstw
+      645                0                 0.0                 5.5tb        5.5tb     5.5tb      5.9tb     11.4tb           48 192.168.100.65 192.168.100.65 esnode8 cdfhilmrstw
+      666                0                 0.0                 5.1tb        5.1tb     5.1tb      6.3tb     11.4tb           44 192.168.100.66 192.168.100.66 esnode9 cdfhilmrstw
+
+   {
+     "cluster_name": "swh-logging-prod",
+     "status": "green",
+     "timed_out": false,
+     "number_of_nodes": 6,
+     "number_of_data_nodes": 6,
+     "active_primary_shards": 1933,
+     "active_shards": 3866,
+     "relocating_shards": 0,
+     "initializing_shards": 0,
+     "unassigned_shards": 0,
+     "delayed_unassigned_shards": 0,
+     "number_of_pending_tasks": 0,
+     "number_of_in_flight_fetch": 0,
+     "task_max_waiting_in_queue_millis": 0,
+     "active_shards_percent_as_number": 100
+   }
+
+Post cluster migration
+----------------------
+
+As the cluster should stay green all along the migration, there is nothing more to check
+(we just did that after each node).
+