Skip to content
Snippets Groups Projects
Verified Commit 408fb430 authored by Vincent Sellier's avatar Vincent Sellier
Browse files

sysadm/winery: Document secondary restoration procedures

Related to swh/infra/sysadm-environment#5271
parent 7ef0bcda
No related branches found
No related tags found
No related merge requests found
Pipeline #7563 passed
......@@ -46,7 +46,7 @@ network.
The details of the installation is available in the
`internal inventory <https://inventory.internal.admin.swh.network/dcim/rack-elevations/?site_id=7>`_
The network is composed several ip ranges:
The network is composed of several ip ranges:
============ ====================== ============================ ======== ==== ==== ======== =======
Range Description VLAN Frontend MONs OSDs Switches Bastion
......@@ -59,7 +59,7 @@ X.X.X.X/28 Uplink vlan Uplink - Management / ID CEA
10.25.3.0/24 Management addresses None X X X X
============ ====================== ============================ ======== ==== ==== ======== =======
Inside each range, the addresses are dispatched according these rules:
Inside each range, the addresses are dispatched according to these rules:
========= =========
Type Range
......
.. _winery-proc-frontends:
Procedures related to the winery frontends
==========================================
Frontends procedures
====================
.. admonition:: Intended audience
:class: important
sysadm staff members
.. todo::
This page is a work in progress.
Pacemaker maintenance mode
--------------------------
In maintenance mode, pacemaker will not attempt to manage the service or switch the ips from one
node to another.
.. _winery-pacemaker-maintenance:
- Force the maintenance mode
.. code-block:: shell
crm_attribute --name maintenance-mode --update true
- Go back to the nominal mode
.. code-block:: shell
crm_attribute --name maintenance-mode --delete
- check the status
Nominal mode:
.. code-block:: shell
root@gloin001:~# crm status
Status of pacemakerd: 'Pacemaker is running' (last updated 2024-03-06 18:45:31 +01:00)
Cluster Summary:
* Stack: corosync
* Current DC: gloin001 (version 2.1.5-a3f44794f94) - MIXED-VERSION partition with quorum
* Last updated: Wed Mar 6 18:45:31 2024
* Last change: Wed Mar 6 18:45:27 2024 by root via crm_attribute on gloin001
* 2 nodes configured
* 4 resource instances configured
Node List:
* Online: [ gloin001 gloin002 ]
Full List of Resources:
* r_vip_pub (ocf:heartbeat:IPaddr2): Started gloin001
* r_vip_ha (ocf:heartbeat:IPaddr2): Started gloin001
* Clone Set: ha_postgresql [r_postgresql] (promotable):
* Promoted: [ gloin001 ]
* Unpromoted: [ gloin002 ]
..
In maintenance:
.. code-block:: shell
root@gloin001:~# crm status
Status of pacemakerd: 'Pacemaker is running' (last updated 2024-03-06 18:43:58 +01:00)
Cluster Summary:
* Stack: corosync
* Current DC: gloin001 (version 2.1.5-a3f44794f94) - MIXED-VERSION partition with quorum
* Last updated: Wed Mar 6 18:43:58 2024
* Last change: Wed Mar 6 18:41:47 2024 by root via crm_attribute on gloin001
* 2 nodes configured
* 4 resource instances configured
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Online: [ gloin001 gloin002 ]
Full List of Resources:
* r_vip_pub (ocf:heartbeat:IPaddr2): Started gloin001 (unmanaged)
* r_vip_ha (ocf:heartbeat:IPaddr2): Started gloin001 (unmanaged)
* Clone Set: ha_postgresql [r_postgresql] (promotable, unmanaged):
* r_postgresql (ocf:heartbeat:pgsqlms): Unpromoted gloin002 (unmanaged)
* r_postgresql (ocf:heartbeat:pgsqlms): Promoted gloin001 (unmanaged)
Clear the pacemaker error status of a resource
----------------------------------------------
For example:
.. code-block:: shell
crm_resource -r r_postgresql -H gloin002 -C
Restore a postgresql secondary from the primary
-----------------------------------------------
- Activate the :ref:`pacemaker maintenance mode <winery-pacemaker-maintenance>`
- Stop postgresql via pacemaker (here the postgresql on gloin002)
.. code-block:: shell
crm --wait resource ban r_postgresql gloin002
Check the postgresql logs to check the status
If the postgresql doesn't stop, it can be force with:
.. code-block:: shell
export VERSION=<version>
sudo -u postgres /usr/lib/postgresql/$VERSION/bin/pg_ctl -D /var/lib/postgresql/$VERSION/main stop
- Delete or move the content of the postgresql data directory in ``/var/lib/postgresql/<version>/main``
- Launch the restoration from the master
.. code-block:: shell
sudo -u postgres pg_basebackup -h 10.25.1.1 -D /var/lib/postgresql/16/main/ -P -U replicator --wal-method=fetch
- Restore the :ref:`nominal pacemaker mode <winery-pacemaker-maintenance>`
Postgresql should restart and recover its lag.
- Check the pacemaker after the secondary is up to date
......@@ -16,4 +16,4 @@ This page documents the different procedures related to the winery production en
vpn
switches
frontends
cephs
ceph
pandoc -f markdown -t rst tables.md -o /tmp/tables.rst --columns=120
| Range | Description | VLAN | Frontend | MONs | OSDs | Switches | Bastion |
| ------------ | ---------------------- | ---------------------------- | :------: | :---: | :---: | :------: | :-----: |
| X.X.X.X/28 | Uplink vlan | Uplink - Frontend / ID CEA | X | | | | |
| X.X.X.X/28 | Uplink vlan | Uplink - Management / ID CEA | | | | | X |
| 10.25.6.0/24 | Default / installation | Default / 1 | X | X | X | | |
| 10.25.1.0/24 | VLAN for ceph access | Ceph clients / 2 | X | X | X | | |
| 10.25.2.0/24 | VLAN for ceph internal | Ceph cluster / 3 | | | X | | |
| 10.25.3.0/24 | Management addresses | None | X | X | X | X | |
| Type | Range |
| -------- | --------- |
| Frontend | .1-.10 |
| MONs | .11-.20 |
| OSDs | .21-.100 |
| switches | .240-.253 |
| GW | .254 |
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment