Snippets Groups Projects

Merged Antoine R. Dumont requested to merge add-basic-alert-on-rabbitmq into production 1 year ago

All threads resolved!

What

This installs 1 alert per production-like cluster (staging, production) which will trigger a warning message if a queue exceeds 100k messages.

It's deactivated for now as we need to think about the alertmanager in the static infra. This alert cannot currently show up in the elastic infra alertmanager.

Tests

make swh-helm-diff happy [1]
minikube installs the 1 rule without issues (make swh-minikube)
query in grafana used within the template actually returns data
Template sounds sensible and matching what we have in the k8s-cluster-config repository (where the alerts are statically installed for now).

[1]

$ make swh-helm-diff
./helm-diff.sh swh
[swh] Comparing changes between branches production and add-basic-alert-on-rabbitmq...
Switched to branch 'production'
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for swh/values/default.yaml...
[swh] Generate config in production branch for swh/values/minikube.yaml...
[swh] Generate config in production branch for swh/values/production-cassandra.yaml...
[swh] Generate config in production branch for swh/values/production.yaml...
[swh] Generate config in production branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in production branch for swh/values/staging.yaml...
Switched to branch 'add-basic-alert-on-rabbitmq'
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/default.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/minikube.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging.yaml...


------------- diff for swh/values/default.yaml -------------

No differences


------------- diff for swh/values/minikube.yaml -------------

No differences


------------- diff for swh/values/production-cassandra.yaml -------------

No differences


------------- diff for swh/values/production.yaml -------------

--- /tmp/swh-chart.swh.pJtDF6xj/production.yaml.before  2023-10-03 10:18:27.259407817 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/production.yaml.after   2023-10-03 10:18:28.631408068 +0200
@@ -24520,20 +24520,43 @@
 # Source: swh/templates/pod-priority/priority.yaml
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
   name: swh-tools
   namespace: swh
 value: 50
 globalDefault: false
 description: Tooling helper (swh-toolbox)
 ---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+  namespace: swh
+spec:
+  groups:
+  - name: rabbitmq-too-many-messages-in-queue.rules
+    rules:
+    - alert: RabbitmqTooManyMessagesInQueue
+      expr: |-
+        max_over_time(rabbitmq_queue_messages_ready{environment="production"}[5m]) > 100000
+      annotations:
+        description: "production: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "A queue exceeds a given threshold in environment production"
+      for: 30m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment


------------- diff for swh/values/staging-cassandra.yaml -------------

No differences


------------- diff for swh/values/staging.yaml -------------

--- /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.before     2023-10-03 10:18:27.791407914 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.after      2023-10-03 10:18:29.163408165 +0200
@@ -21619,20 +21619,43 @@
 # Source: swh/templates/pod-priority/priority.yaml
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
   name: swh-tools
   namespace: swh
 value: 50
 globalDefault: false
 description: Tooling helper (swh-toolbox)
 ---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+  namespace: swh
+spec:
+  groups:
+  - name: rabbitmq-too-many-messages-in-queue.rules
+    rules:
+    - alert: RabbitmqTooManyMessagesInQueue
+      expr: |-
+        max_over_time(rabbitmq_queue_messages_ready{environment="staging"}[5m]) > 100000
+      annotations:
+        description: "staging: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "A queue exceeds a given threshold in environment staging"
+      for: 30m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment

Refs. swh/infra/sysadm-environment#5048 (closed)

Edited 1 year ago by Antoine R. Dumont

Activity

Antoine R. Dumont added 1 commit 1 year ago
added 1 commit

9640ac44 - cluster-components: Bootstrap alerts for rabbitmq

Compare with previous version
Antoine R. Dumont changed the description 1 year ago

changed the description
Antoine R. Dumont marked this merge request as ready 1 year ago

marked this merge request as ready
Vincent Sellier @vsellier · 1 year ago

Owner

Resolved 1 year ago by Antoine R. Dumont

I'm wondering if it should not be declared in the swh chart with the scheduler runner configuration or something like that.

Doing this we'll not have to bother with the different environments and it will be easily available to everybody who use the chart.

Last reply by Antoine R. Dumont 1 year ago
Antoine R. Dumont added 4 commits 1 year ago
added 4 commits

9640ac44...afe249da - 3 commits from branch production

c86952ae - cluster-components: Bootstrap alerts for rabbitmq

Compare with previous version
Antoine R. Dumont added 1 commit 1 year ago
added 1 commit

eed2e2be - swh: Bootstrap alerts for rabbitmq

Compare with previous version
Antoine R. Dumont changed title from cluster-components: Bootstrap alerts for rabbitmq to swh: Bootstrap alerts for rabbitmq 1 year ago

changed title from cluster-components: Bootstrap alerts for rabbitmq to swh: Bootstrap alerts for rabbitmq
Antoine R. Dumont changed the description 1 year ago

changed the description
Vincent Sellier @vsellier started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Antoine R. Dumont
Last reply by Antoine R. Dumont 1 year ago

Vincent Sellier @vsellier started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Antoine R. Dumont

Vincent Sellier @vsellier started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Antoine R. Dumont

Vincent Sellier @vsellier started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Antoine R. Dumont

Vincent Sellier @vsellier started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Antoine R. Dumont

Antoine R. Dumont added 1 commit 1 year ago

added 1 commit

4e0f6775 - swh: Bootstrap alerts for rabbitmq

Compare with previous version

Antoine R. Dumont resolved all threads 1 year ago

resolved all threads

Antoine R. Dumont changed the description 1 year ago

changed the description

Antoine R. Dumont added 3 commits 1 year ago

added 3 commits

b4133a8c - 1 commit from branch production
63025d49 - v116: Release swh.web v0.2.38
fd4fbc56 - swh: Bootstrap alerts for rabbitmq

Compare with previous version

Antoine R. Dumont mentioned in merge request swh/devel/swh-scheduler!354 (merged) 1 year ago

mentioned in merge request swh/devel/swh-scheduler!354 (merged)

Antoine R. Dumont added 8 commits 1 year ago

added 8 commits

fd4fbc56...3dc2ca45 - 7 commits from branch production
1328f59e - swh: Add alert on rabbitmq when queue exceeds a given nb of messages

Compare with previous version

Antoine R. Dumont changed the description 1 year ago

changed the description

Antoine R. Dumont added 1 commit 1 year ago

added 1 commit

22471e29 - swh: Add alert on rabbitmq when queue exceeds a given nb of messages

Compare with previous version

Antoine R. Dumont added 1 commit 1 year ago

added 1 commit

186a4112 - swh: Add alert on rabbitmq when queue exceeds a given nb of messages

Compare with previous version

Antoine R. Dumont changed the description 1 year ago

changed the description

Antoine R. Dumont merged 1 year ago

merged

Antoine R. Dumont mentioned in merge request !207 (merged) 1 year ago

mentioned in merge request !207 (merged)

Please register or sign in to reply