Skip to content

swh: Bootstrap alerts for rabbitmq

Antoine R. Dumont requested to merge add-basic-alert-on-rabbitmq into production

What

This installs 1 alert per production-like cluster (staging, production) which will trigger a warning message if a queue exceeds 100k messages.

It's deactivated for now as we need to think about the alertmanager in the static infra. This alert cannot currently show up in the elastic infra alertmanager.

Tests

  • make swh-helm-diff happy [1]
  • minikube installs the 1 rule without issues (make swh-minikube)
  • query in grafana used within the template actually returns data
  • Template sounds sensible and matching what we have in the k8s-cluster-config repository (where the alerts are statically installed for now).

[1]

$ make swh-helm-diff
./helm-diff.sh swh
[swh] Comparing changes between branches production and add-basic-alert-on-rabbitmq...
Switched to branch 'production'
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for swh/values/default.yaml...
[swh] Generate config in production branch for swh/values/minikube.yaml...
[swh] Generate config in production branch for swh/values/production-cassandra.yaml...
[swh] Generate config in production branch for swh/values/production.yaml...
[swh] Generate config in production branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in production branch for swh/values/staging.yaml...
Switched to branch 'add-basic-alert-on-rabbitmq'
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/default.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/minikube.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging.yaml...


------------- diff for swh/values/default.yaml -------------

No differences


------------- diff for swh/values/minikube.yaml -------------

No differences


------------- diff for swh/values/production-cassandra.yaml -------------

No differences


------------- diff for swh/values/production.yaml -------------

--- /tmp/swh-chart.swh.pJtDF6xj/production.yaml.before  2023-10-03 10:18:27.259407817 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/production.yaml.after   2023-10-03 10:18:28.631408068 +0200
@@ -24520,20 +24520,43 @@
 # Source: swh/templates/pod-priority/priority.yaml
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
   name: swh-tools
   namespace: swh
 value: 50
 globalDefault: false
 description: Tooling helper (swh-toolbox)
 ---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+  namespace: swh
+spec:
+  groups:
+  - name: rabbitmq-too-many-messages-in-queue.rules
+    rules:
+    - alert: RabbitmqTooManyMessagesInQueue
+      expr: |-
+        max_over_time(rabbitmq_queue_messages_ready{environment="production"}[5m]) > 100000
+      annotations:
+        description: "production: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "A queue exceeds a given threshold in environment production"
+      for: 30m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment


------------- diff for swh/values/staging-cassandra.yaml -------------

No differences


------------- diff for swh/values/staging.yaml -------------

--- /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.before     2023-10-03 10:18:27.791407914 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.after      2023-10-03 10:18:29.163408165 +0200
@@ -21619,20 +21619,43 @@
 # Source: swh/templates/pod-priority/priority.yaml
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
   name: swh-tools
   namespace: swh
 value: 50
 globalDefault: false
 description: Tooling helper (swh-toolbox)
 ---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+  namespace: swh
+spec:
+  groups:
+  - name: rabbitmq-too-many-messages-in-queue.rules
+    rules:
+    - alert: RabbitmqTooManyMessagesInQueue
+      expr: |-
+        max_over_time(rabbitmq_queue_messages_ready{environment="staging"}[5m]) > 100000
+      annotations:
+        description: "staging: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "A queue exceeds a given threshold in environment staging"
+      for: 30m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment

Refs. swh/infra/sysadm-environment#5048 (closed)

Edited by Antoine R. Dumont

Merge request reports