swh: Bootstrap alerts for rabbitmq
What
This installs 1 alert per production-like cluster (staging, production) which will trigger a warning message if a queue exceeds 100k messages.
It's deactivated for now as we need to think about the alertmanager in the static infra. This alert cannot currently show up in the elastic infra alertmanager.
Tests
- make swh-helm-diff happy [1]
- minikube installs the 1 rule without issues (make swh-minikube)
- query in grafana used within the template actually returns data
- Template sounds sensible and matching what we have in the k8s-cluster-config repository (where the alerts are statically installed for now).
[1]
$ make swh-helm-diff
./helm-diff.sh swh
[swh] Comparing changes between branches production and add-basic-alert-on-rabbitmq...
Switched to branch 'production'
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for swh/values/default.yaml...
[swh] Generate config in production branch for swh/values/minikube.yaml...
[swh] Generate config in production branch for swh/values/production-cassandra.yaml...
[swh] Generate config in production branch for swh/values/production.yaml...
[swh] Generate config in production branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in production branch for swh/values/staging.yaml...
Switched to branch 'add-basic-alert-on-rabbitmq'
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/default.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/minikube.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging.yaml...
------------- diff for swh/values/default.yaml -------------
No differences
------------- diff for swh/values/minikube.yaml -------------
No differences
------------- diff for swh/values/production-cassandra.yaml -------------
No differences
------------- diff for swh/values/production.yaml -------------
--- /tmp/swh-chart.swh.pJtDF6xj/production.yaml.before 2023-10-03 10:18:27.259407817 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/production.yaml.after 2023-10-03 10:18:28.631408068 +0200
@@ -24520,20 +24520,43 @@
# Source: swh/templates/pod-priority/priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: swh-tools
namespace: swh
value: 50
globalDefault: false
description: Tooling helper (swh-toolbox)
---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ labels:
+ app: swh-alerts
+ name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+ namespace: swh
+spec:
+ groups:
+ - name: rabbitmq-too-many-messages-in-queue.rules
+ rules:
+ - alert: RabbitmqTooManyMessagesInQueue
+ expr: |-
+ max_over_time(rabbitmq_queue_messages_ready{environment="production"}[5m]) > 100000
+ annotations:
+ description: "production: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+ summary: "A queue exceeds a given threshold in environment production"
+ for: 30m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
+---
# Source: swh/templates/checker-deposit/keda-autoscaling.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checker-deposit-operators
namespace: swh
spec:
scaleTargetRef:
apiVersion: apps/v1 # Optional. Default: apps/v1
kind: Deployment # Optional. Default: Deployment
------------- diff for swh/values/staging-cassandra.yaml -------------
No differences
------------- diff for swh/values/staging.yaml -------------
--- /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.before 2023-10-03 10:18:27.791407914 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.after 2023-10-03 10:18:29.163408165 +0200
@@ -21619,20 +21619,43 @@
# Source: swh/templates/pod-priority/priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: swh-tools
namespace: swh
value: 50
globalDefault: false
description: Tooling helper (swh-toolbox)
---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ labels:
+ app: swh-alerts
+ name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+ namespace: swh
+spec:
+ groups:
+ - name: rabbitmq-too-many-messages-in-queue.rules
+ rules:
+ - alert: RabbitmqTooManyMessagesInQueue
+ expr: |-
+ max_over_time(rabbitmq_queue_messages_ready{environment="staging"}[5m]) > 100000
+ annotations:
+ description: "staging: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+ summary: "A queue exceeds a given threshold in environment staging"
+ for: 30m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
+---
# Source: swh/templates/checker-deposit/keda-autoscaling.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checker-deposit-operators
namespace: swh
spec:
scaleTargetRef:
apiVersion: apps/v1 # Optional. Default: apps/v1
kind: Deployment # Optional. Default: Deployment
Edited by Antoine R. Dumont