Recover from failed postgresql replica @ CEA

gloin001's postgresql partition ran out of disk space on 2024-04-26 at 06:53 UTC.

automatic postgresql failover didn't work

Apr 26 08:53:11 gloin001 pacemaker-controld[5424]:  notice: Result of monitor operation for r_postgresql on gloin001: error (Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1))
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]:  notice: r_postgresql_monitor_15000@gloin001 output [ ocf-exit-reason:Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)\nRedundant argument in sprintf at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functi...
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]:  notice: Transition 5443 action 12 (r_postgresql_monitor_15000 on gloin001): expected 'promoted' but got 'error'
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 26 08:53:11 gloin001 pacemaker-attrd[5422]:  notice: Setting last-failure-r_postgresql#monitor_15000[gloin001]: (unset) -> 1714114391
Apr 26 08:53:11 gloin001 pacemaker-attrd[5422]:  notice: Setting fail-count-r_postgresql#monitor_15000[gloin001]: (unset) -> 1
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  notice: On loss of quorum: Ignore
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for monitor of r_postgresql:1 on gloin001 at Apr 26 08:53:11 2024
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  notice: Actions: Recover    r_postgresql:1     (   Promoted gloin001 )
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  notice: Calculated transition 5725, saving inputs in /var/lib/pacemaker/pengine/pe-input-5.bz2
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  notice: On loss of quorum: Ignore
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for monitor of r_postgresql:1 on gloin001 at Apr 26 08:53:11 2024
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  notice: Actions: Recover    r_postgresql:1     (   Promoted gloin001 )
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]:  notice: Calculated transition 5726, saving inputs in /var/lib/pacemaker/pengine/pe-input-6.bz2
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_pre_notify_demote_0 on gloin002
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_pre_notify_demote_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Requesting local execution of notify operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pgsqlms(r_postgresql)[1920158]: INFO: If this instance is in warm standby, this resource agent only supports hot standby
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Result of notify operation for r_postgresql on gloin001: ok
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating demote operation r_postgresql_demote_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Requesting local execution of demote operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Result of demote operation for r_postgresql on gloin001: error (Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1))
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: r_postgresql_demote_0@gloin001 output [ ocf-exit-reason:Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)\nRedundant argument in sprintf at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.p...
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Transition 5726 aborted by operation r_postgresql_demote_0 'modify' on gloin001: Event failed
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Transition 5726 action 10 (r_postgresql_demote_0 on gloin001): expected 'ok' but got 'error'
Apr 26 08:53:12 gloin001 pacemaker-attrd[5422]:  notice: Setting last-failure-r_postgresql#demote_0[gloin001]: (unset) -> 1714114392
Apr 26 08:53:12 gloin001 pacemaker-attrd[5422]:  notice: Setting fail-count-r_postgresql#demote_0[gloin001]: (unset) -> 1
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_post_notify_demote_0 on gloin002
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_post_notify_demote_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Requesting local execution of notify operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Transition 5726 aborted by status-1-last-failure-r_postgresql.demote_0 doing create last-failure-r_postgresql#demote_0=1714114392: Transient attribute change
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Result of notify operation for r_postgresql on gloin001: ok
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Transition 5726 (Complete=12, Pending=0, Fired=0, Skipped=2, Incomplete=31, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Stopped
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]:  notice: On loss of quorum: Ignore
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]:  warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for demote of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]:  warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for demote of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]:  notice: Actions: Stop       r_vip_ha           (            gloin001 )  due to required ha_postgresql notified (blocked)
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]:  notice: Actions: Recover    r_postgresql:1     ( Unpromoted -> Promoted gloin001 )
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]:  notice: Calculated transition 5727, saving inputs in /var/lib/pacemaker/pengine/pe-input-7.bz2
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_pre_notify_stop_0 on gloin002
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_pre_notify_stop_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Requesting local execution of notify operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Result of notify operation for r_postgresql on gloin001: ok
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Initiating stop operation r_postgresql_stop_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]:  notice: Requesting local execution of stop operation for r_postgresql on gloin001
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: Result of stop operation for r_postgresql on gloin001: error (Unexpected state for instance "r_postgresql" (returned 1))
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: r_postgresql_stop_0@gloin001 output [ ocf-exit-reason:Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)\nRedundant argument in sprintf at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm ...
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: Transition 5727 aborted by operation r_postgresql_stop_0 'modify' on gloin001: Event failed
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: Transition 5727 action 2 (r_postgresql_stop_0 on gloin001): expected 'ok' but got 'error'
Apr 26 08:53:13 gloin001 pacemaker-attrd[5422]:  notice: Setting last-failure-r_postgresql#stop_0[gloin001]: (unset) -> 1714114393
Apr 26 08:53:13 gloin001 pacemaker-attrd[5422]:  notice: Setting fail-count-r_postgresql#stop_0[gloin001]: (unset) -> INFINITY
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: Initiating notify operation r_postgresql_post_notify_stop_0 on gloin002
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: Transition 5727 aborted by status-1-last-failure-r_postgresql.stop_0 doing create last-failure-r_postgresql#stop_0=1714114393: Transient attribute change
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]:  notice: Transition 5727 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=21, Source=/var/lib/pacemaker/pengine/pe-input-7.bz2): Stopped
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]:  notice: On loss of quorum: Ignore
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]:  warning: Unexpected result (error: Unexpected state for instance "r_postgresql" (returned 1)) was recorded for stop of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]:  warning: Unexpected result (error: Unexpected state for instance "r_postgresql" (returned 1)) was recorded for stop of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]:  warning: ha_postgresql cannot run on gloin001 due to reaching migration threshold (clean up resource to allow again)
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]:  warning: ha_postgresql cannot run on gloin001 due to reaching migration threshold (clean up resource to allow again)
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]:  notice: Unmanaged instance r_postgresql:1 will be left promoted on gloin001

manual postgresql failover (banning postgresql from gloin001 crm_resource --ban --master --resource ha_postgresql) didn't work.
(forcefully) stopping and disabling corosync & pacemaker on gloin001 (to fence the node) made them crash on gloin002
restarting corosync & pacemaker on gloin002 made the resource recovery succeed

Current status:

gloin001: disk full, corosync & pacemaker disabled
gloin002: promoted as primary postgresql, corosync & pacemaker running with no cluster

Issues to solve:

deficient postgresql logging (no logs to be found, need to update postgresql.conf)
WALs were accumulating on gloin001, why? (postgresql reports no WALs ready / no archival failures, but /var/lib/postgresql/16/main/pg_wal is full up)
- at least WALs aren't accumulating on gloin002
postgresql was still running on gloin002 and could have been used for reads, even if writes were failing (move winery reader to the local postgresql instance)
maybe we should add a third node to corosync?

Edited Apr 26, 2024 by Nicolas Dandrimont