Recover from failed postgresql replica @ CEA
gloin001's postgresql partition ran out of disk space on 2024-04-26 at 06:53 UTC.
- automatic postgresql failover didn't work
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]: notice: Result of monitor operation for r_postgresql on gloin001: error (Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1))
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]: notice: r_postgresql_monitor_15000@gloin001 output [ ocf-exit-reason:Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)\nRedundant argument in sprintf at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functi...
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]: notice: Transition 5443 action 12 (r_postgresql_monitor_15000 on gloin001): expected 'promoted' but got 'error'
Apr 26 08:53:11 gloin001 pacemaker-controld[5424]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 26 08:53:11 gloin001 pacemaker-attrd[5422]: notice: Setting last-failure-r_postgresql#monitor_15000[gloin001]: (unset) -> 1714114391
Apr 26 08:53:11 gloin001 pacemaker-attrd[5422]: notice: Setting fail-count-r_postgresql#monitor_15000[gloin001]: (unset) -> 1
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: notice: On loss of quorum: Ignore
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for monitor of r_postgresql:1 on gloin001 at Apr 26 08:53:11 2024
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: notice: Actions: Recover r_postgresql:1 ( Promoted gloin001 )
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: notice: Calculated transition 5725, saving inputs in /var/lib/pacemaker/pengine/pe-input-5.bz2
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: notice: On loss of quorum: Ignore
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for monitor of r_postgresql:1 on gloin001 at Apr 26 08:53:11 2024
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: notice: Actions: Recover r_postgresql:1 ( Promoted gloin001 )
Apr 26 08:53:11 gloin001 pacemaker-schedulerd[5423]: notice: Calculated transition 5726, saving inputs in /var/lib/pacemaker/pengine/pe-input-6.bz2
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_pre_notify_demote_0 on gloin002
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_pre_notify_demote_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Requesting local execution of notify operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pgsqlms(r_postgresql)[1920158]: INFO: If this instance is in warm standby, this resource agent only supports hot standby
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Result of notify operation for r_postgresql on gloin001: ok
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating demote operation r_postgresql_demote_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Requesting local execution of demote operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Result of demote operation for r_postgresql on gloin001: error (Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1))
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: r_postgresql_demote_0@gloin001 output [ ocf-exit-reason:Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)\nRedundant argument in sprintf at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.p...
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Transition 5726 aborted by operation r_postgresql_demote_0 'modify' on gloin001: Event failed
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Transition 5726 action 10 (r_postgresql_demote_0 on gloin001): expected 'ok' but got 'error'
Apr 26 08:53:12 gloin001 pacemaker-attrd[5422]: notice: Setting last-failure-r_postgresql#demote_0[gloin001]: (unset) -> 1714114392
Apr 26 08:53:12 gloin001 pacemaker-attrd[5422]: notice: Setting fail-count-r_postgresql#demote_0[gloin001]: (unset) -> 1
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_post_notify_demote_0 on gloin002
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_post_notify_demote_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Requesting local execution of notify operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Transition 5726 aborted by status-1-last-failure-r_postgresql.demote_0 doing create last-failure-r_postgresql#demote_0=1714114392: Transient attribute change
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Result of notify operation for r_postgresql on gloin001: ok
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Transition 5726 (Complete=12, Pending=0, Fired=0, Skipped=2, Incomplete=31, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Stopped
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]: notice: On loss of quorum: Ignore
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]: warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for demote of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]: warning: Unexpected result (error: Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)) was recorded for demote of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]: notice: Actions: Stop r_vip_ha ( gloin001 ) due to required ha_postgresql notified (blocked)
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]: notice: Actions: Recover r_postgresql:1 ( Unpromoted -> Promoted gloin001 )
Apr 26 08:53:12 gloin001 pacemaker-schedulerd[5423]: notice: Calculated transition 5727, saving inputs in /var/lib/pacemaker/pengine/pe-input-7.bz2
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_pre_notify_stop_0 on gloin002
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_pre_notify_stop_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Requesting local execution of notify operation for r_postgresql on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Result of notify operation for r_postgresql on gloin001: ok
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Initiating stop operation r_postgresql_stop_0 locally on gloin001
Apr 26 08:53:12 gloin001 pacemaker-controld[5424]: notice: Requesting local execution of stop operation for r_postgresql on gloin001
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: Result of stop operation for r_postgresql on gloin001: error (Unexpected state for instance "r_postgresql" (returned 1))
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: r_postgresql_stop_0@gloin001 output [ ocf-exit-reason:Instance "r_postgresql" controldata is not consistent with pg_isready (returned: 1)\nRedundant argument in sprintf at /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm ...
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: Transition 5727 aborted by operation r_postgresql_stop_0 'modify' on gloin001: Event failed
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: Transition 5727 action 2 (r_postgresql_stop_0 on gloin001): expected 'ok' but got 'error'
Apr 26 08:53:13 gloin001 pacemaker-attrd[5422]: notice: Setting last-failure-r_postgresql#stop_0[gloin001]: (unset) -> 1714114393
Apr 26 08:53:13 gloin001 pacemaker-attrd[5422]: notice: Setting fail-count-r_postgresql#stop_0[gloin001]: (unset) -> INFINITY
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: Initiating notify operation r_postgresql_post_notify_stop_0 on gloin002
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: Transition 5727 aborted by status-1-last-failure-r_postgresql.stop_0 doing create last-failure-r_postgresql#stop_0=1714114393: Transient attribute change
Apr 26 08:53:13 gloin001 pacemaker-controld[5424]: notice: Transition 5727 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=21, Source=/var/lib/pacemaker/pengine/pe-input-7.bz2): Stopped
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]: notice: On loss of quorum: Ignore
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]: warning: Unexpected result (error: Unexpected state for instance "r_postgresql" (returned 1)) was recorded for stop of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]: warning: Unexpected result (error: Unexpected state for instance "r_postgresql" (returned 1)) was recorded for stop of r_postgresql:1 on gloin001 at Apr 26 08:53:12 2024
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]: warning: ha_postgresql cannot run on gloin001 due to reaching migration threshold (clean up resource to allow again)
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]: warning: ha_postgresql cannot run on gloin001 due to reaching migration threshold (clean up resource to allow again)
Apr 26 08:53:13 gloin001 pacemaker-schedulerd[5423]: notice: Unmanaged instance r_postgresql:1 will be left promoted on gloin001
- manual postgresql failover (banning postgresql from gloin001
crm_resource --ban --master --resource ha_postgresql
) didn't work. - (forcefully) stopping and disabling corosync & pacemaker on gloin001 (to fence the node) made them crash on gloin002
- restarting corosync & pacemaker on gloin002 made the resource recovery succeed
Current status:
- gloin001: disk full, corosync & pacemaker disabled
- gloin002: promoted as primary postgresql, corosync & pacemaker running with no cluster
Issues to solve:
-
deficient postgresql logging (no logs to be found, need to update postgresql.conf) -
WALs were accumulating on gloin001, why? (postgresql reports no WALs ready / no archival failures, but /var/lib/postgresql/16/main/pg_wal is full up) - at least WALs aren't accumulating on gloin002
-
postgresql was still running on gloin002 and could have been used for reads, even if writes were failing (move winery reader to the local postgresql instance) -
maybe we should add a third node to corosync?