components/alerting: Create an alert for unreachable cassandra nodes
1 unresolved thread
1 unresolved thread
Related to swh/infra/sysadm-environment#5026 (closed)
This template will deploy a PrometheusRule
in staging and production to trigger an alert if a cassandra cluster node is unreachable.
ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml cassandra-alerting . | grep -A 12 groups
groups:
- name: critical-cassandra-service.rules
rules:
- alert: Cassandra_Service_Degraded_In_Production
annotations:
description: "The {{ $labels.instance }} node is unreachable for more than 5 minutes. This node seems down."
summary: "The {{ $labels.service }} is degraded. Please check the {{ $labels.instance }} status."
expr: up{service="cassandra-servers-svc"} == 0
for: 5m
labels:
severity: critical
namespace: cattle-monitoring-system
ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml cassandra-alerting . | \
grep -A 12 groups | promtool check rules
Checking standard input
SUCCESS: 1 rules found
Merge request reports
Activity
assigned to @guillaume
- Resolved by Antoine R. Dumont
Nice.
I'm wondering whether this be should attached to the cassandra templates (in the swh chart) instead?
Like we did for the scheduler [1]
@vsellier what do you think (as you already mentioned in [2])?
[1] !162 (merged)
- Resolved by Guillaume Samson
- Resolved by Antoine R. Dumont
- Resolved by Guillaume Samson
- Resolved by Guillaume Samson
added 1 commit
- 1ab6c384 - components/alerting: Create an alert for unreachable cassandra nodes
added 1 commit
- a871c9d3 - components/alerting: Create an alert for unrepaired table size
added 4 commits
-
a871c9d3...58cf5d91 - 2 commits from branch
production
- dcd233f9 - components/alerting: Create an alert for unreachable cassandra nodes
- bb2cb443 - components/alerting: Create an alert for unrepaired table size
-
a871c9d3...58cf5d91 - 2 commits from branch
Please register or sign in to reply