swh: Bootstrap alerts for rabbitmq
All threads resolved!
All threads resolved!
What
This installs 1 alert per production-like cluster (staging, production) which will trigger a warning message if a queue exceeds 100k messages.
It's deactivated for now as we need to think about the alertmanager in the static infra. This alert cannot currently show up in the elastic infra alertmanager.
Tests
- make swh-helm-diff happy [1]
- minikube installs the 1 rule without issues (make swh-minikube)
- query in grafana used within the template actually returns data
- Template sounds sensible and matching what we have in the k8s-cluster-config repository (where the alerts are statically installed for now).
[1]
$ make swh-helm-diff
./helm-diff.sh swh
[swh] Comparing changes between branches production and add-basic-alert-on-rabbitmq...
Switched to branch 'production'
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for swh/values/default.yaml...
[swh] Generate config in production branch for swh/values/minikube.yaml...
[swh] Generate config in production branch for swh/values/production-cassandra.yaml...
[swh] Generate config in production branch for swh/values/production.yaml...
[swh] Generate config in production branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in production branch for swh/values/staging.yaml...
Switched to branch 'add-basic-alert-on-rabbitmq'
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/default.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/minikube.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/production.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging-cassandra.yaml...
[swh] Generate config in add-basic-alert-on-rabbitmq branch for swh/values/staging.yaml...
------------- diff for swh/values/default.yaml -------------
No differences
------------- diff for swh/values/minikube.yaml -------------
No differences
------------- diff for swh/values/production-cassandra.yaml -------------
No differences
------------- diff for swh/values/production.yaml -------------
--- /tmp/swh-chart.swh.pJtDF6xj/production.yaml.before 2023-10-03 10:18:27.259407817 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/production.yaml.after 2023-10-03 10:18:28.631408068 +0200
@@ -24520,20 +24520,43 @@
# Source: swh/templates/pod-priority/priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: swh-tools
namespace: swh
value: 50
globalDefault: false
description: Tooling helper (swh-toolbox)
---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ labels:
+ app: swh-alerts
+ name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+ namespace: swh
+spec:
+ groups:
+ - name: rabbitmq-too-many-messages-in-queue.rules
+ rules:
+ - alert: RabbitmqTooManyMessagesInQueue
+ expr: |-
+ max_over_time(rabbitmq_queue_messages_ready{environment="production"}[5m]) > 100000
+ annotations:
+ description: "production: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+ summary: "A queue exceeds a given threshold in environment production"
+ for: 30m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
+---
# Source: swh/templates/checker-deposit/keda-autoscaling.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checker-deposit-operators
namespace: swh
spec:
scaleTargetRef:
apiVersion: apps/v1 # Optional. Default: apps/v1
kind: Deployment # Optional. Default: Deployment
------------- diff for swh/values/staging-cassandra.yaml -------------
No differences
------------- diff for swh/values/staging.yaml -------------
--- /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.before 2023-10-03 10:18:27.791407914 +0200
+++ /tmp/swh-chart.swh.pJtDF6xj/staging.yaml.after 2023-10-03 10:18:29.163408165 +0200
@@ -21619,20 +21619,43 @@
# Source: swh/templates/pod-priority/priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: swh-tools
namespace: swh
value: 50
globalDefault: false
description: Tooling helper (swh-toolbox)
---
+# Source: swh/templates/scheduler/alert-rabbitmq.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ labels:
+ app: swh-alerts
+ name: rabbitmq-too-many-messages-in-queue-alertmanager.rules
+ namespace: swh
+spec:
+ groups:
+ - name: rabbitmq-too-many-messages-in-queue.rules
+ rules:
+ - alert: RabbitmqTooManyMessagesInQueue
+ expr: |-
+ max_over_time(rabbitmq_queue_messages_ready{environment="staging"}[5m]) > 100000
+ annotations:
+ description: "staging: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+ summary: "A queue exceeds a given threshold in environment staging"
+ for: 30m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
+---
# Source: swh/templates/checker-deposit/keda-autoscaling.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checker-deposit-operators
namespace: swh
spec:
scaleTargetRef:
apiVersion: apps/v1 # Optional. Default: apps/v1
kind: Deployment # Optional. Default: Deployment
Edited by Antoine R. Dumont
Merge request reports
Activity
Filter activity
added 1 commit
- 9640ac44 - cluster-components: Bootstrap alerts for rabbitmq
- Resolved by Antoine R. Dumont
I'm wondering if it should not be declared in the swh chart with the scheduler runner configuration or something like that.
Doing this we'll not have to bother with the different environments and it will be easily available to everybody who use the chart.
added 4 commits
-
9640ac44...afe249da - 3 commits from branch
production
- c86952ae - cluster-components: Bootstrap alerts for rabbitmq
-
9640ac44...afe249da - 3 commits from branch
- Resolved by Antoine R. Dumont
- Resolved by Antoine R. Dumont
- Resolved by Antoine R. Dumont
- Resolved by Antoine R. Dumont
- Resolved by Antoine R. Dumont
mentioned in merge request swh/devel/swh-scheduler!354 (merged)
added 8 commits
-
fd4fbc56...3dc2ca45 - 7 commits from branch
production
- 1328f59e - swh: Add alert on rabbitmq when queue exceeds a given nb of messages
-
fd4fbc56...3dc2ca45 - 7 commits from branch
added 1 commit
- 22471e29 - swh: Add alert on rabbitmq when queue exceeds a given nb of messages
added 1 commit
- 186a4112 - swh: Add alert on rabbitmq when queue exceeds a given nb of messages
mentioned in merge request !207 (merged)
Please register or sign in to reply