Skip to content
Snippets Groups Projects

swh: Bootstrap alerts for rabbitmq

Merged Antoine R. Dumont requested to merge add-basic-alert-on-rabbitmq into production
All threads resolved!

What

This installs 1 alert per production-like cluster (staging, production) which will trigger a warning message if a queue exceeds 100k messages.

It's deactivated for now as we need to think about the alertmanager in the static infra. This alert cannot currently show up in the elastic infra alertmanager.

Tests

  • make swh-helm-diff happy [1]
  • minikube installs the 1 rule without issues (make swh-minikube)
  • query in grafana used within the template actually returns data
  • Template sounds sensible and matching what we have in the k8s-cluster-config repository (where the alerts are statically installed for now).

[1]

$ make swh-helm-diff
./helm-diff.sh swh
[swh] Comparing changes between branches production and add-basic-alert-on-rabbitmq...
Switched to branch 'production'
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for swh/values/default.yaml...
[swh] Generate config in production branch for swh/values/minikube.yaml...
[swh] Generate config in production branch for swh/values/production-cassandra.yaml...
[swh] Generate config in production branch for swh/values/production.yaml...
[swh] Generate config in production branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in production branch for swh/values/staging.yaml...
Switched to branch 'add-basic-alert-on-rabbitmq'
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/default.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/minikube.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging.yaml...


------------- diff for swh/values/default.yaml -------------

No differences


------------- diff for swh/values/minikube.yaml -------------

No differences


------------- diff for swh/values/production-cassandra.yaml -------------

No differences


------------- diff for swh/values/production.yaml -------------

--- /tmp/swh-chart.swh.pJtDF6xj/production.yaml.before  2023-10-03 10:18:27.259407817 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/production.yaml.after   2023-10-03 10:18:28.631408068 +0200
@@ -24520,20 +24520,43 @@
 # Source: swh/templates/pod-priority/priority.yaml
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
   name: swh-tools
   namespace: swh
 value: 50
 globalDefault: false
 description: Tooling helper (swh-toolbox)
 ---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+  namespace: swh
+spec:
+  groups:
+  - name: rabbitmq-too-many-messages-in-queue.rules
+    rules:
+    - alert: RabbitmqTooManyMessagesInQueue
+      expr: |-
+        max_over_time(rabbitmq_queue_messages_ready{environment="production"}[5m]) > 100000
+      annotations:
+        description: "production: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "A queue exceeds a given threshold in environment production"
+      for: 30m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment


------------- diff for swh/values/staging-cassandra.yaml -------------

No differences


------------- diff for swh/values/staging.yaml -------------

--- /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.before     2023-10-03 10:18:27.791407914 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.after      2023-10-03 10:18:29.163408165 +0200
@@ -21619,20 +21619,43 @@
 # Source: swh/templates/pod-priority/priority.yaml
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
   name: swh-tools
   namespace: swh
 value: 50
 globalDefault: false
 description: Tooling helper (swh-toolbox)
 ---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+  namespace: swh
+spec:
+  groups:
+  - name: rabbitmq-too-many-messages-in-queue.rules
+    rules:
+    - alert: RabbitmqTooManyMessagesInQueue
+      expr: |-
+        max_over_time(rabbitmq_queue_messages_ready{environment="staging"}[5m]) > 100000
+      annotations:
+        description: "staging: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "A queue exceeds a given threshold in environment staging"
+      for: 30m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment

Refs. swh/infra/sysadm-environment#5048 (closed)

Edited by Antoine R. Dumont

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    • 4e0f6775 - swh: Bootstrap alerts for rabbitmq

    Compare with previous version

  • Antoine R. Dumont resolved all threads

    resolved all threads

  • Antoine R. Dumont changed the description

    changed the description

  • Antoine R. Dumont added 3 commits

    added 3 commits

    • b4133a8c - 1 commit from branch production
    • 63025d49 - v116: Release swh.web v0.2.38
    • fd4fbc56 - swh: Bootstrap alerts for rabbitmq

    Compare with previous version

  • Antoine R. Dumont added 8 commits

    added 8 commits

    • fd4fbc56...3dc2ca45 - 7 commits from branch production
    • 1328f59e - swh: Add alert on rabbitmq when queue exceeds a given nb of messages

    Compare with previous version

  • Antoine R. Dumont changed the description

    changed the description

  • added 1 commit

    • 22471e29 - swh: Add alert on rabbitmq when queue exceeds a given nb of messages

    Compare with previous version

  • added 1 commit

    • 186a4112 - swh: Add alert on rabbitmq when queue exceeds a given nb of messages

    Compare with previous version

  • Antoine R. Dumont changed the description

    changed the description

  • Antoine R. Dumont mentioned in merge request !207 (merged)

    mentioned in merge request !207 (merged)

  • Please register or sign in to reply
    Loading