staging/scheduler: Activate alerts for stale lister tasks
- Feb 09, 2024
Antoine R. Dumont authored
Refs. swh/infra/sysadm-environment#5213
Maybe that will work.
[swh] Comparing changes between branches production and add-alert-on-stale-scheduler-tasks (per environment)...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment staging, namespace swh...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra-next-version...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment staging...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment staging...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment staging...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment production, namespace swh...
[swh] Generate config in production branch for environment production, namespace swh-cassandra...
[swh] Generate config in production branch for environment production, namespace swh-cassandra-next-version...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment production...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment production...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment production...
------------- diff for environment staging namespace swh -------------
--- /tmp/swh-chart.swh.Im6ero7A/staging-swh.before 2024-02-08 17:14:54.803529872 +0100
+++ /tmp/swh-chart.swh.Im6ero7A/staging-swh.after 2024-02-08 17:14:55.539529019 +0100
@@ -25527,20 +25527,53 @@
expr: |-
max_over_time(rabbitmq_queue_messages_ready{environment="staging", instance=~"scheduler0.*"}[5m]) > 4000
description: "staging: High number of messages in rabbitmq queue <{{ $ }}> (server: <{{ $labels.server }}>)"
summary: "A queue exceeds a given threshold in environment <staging>, rabbitmq instance <scheduler0>"
for: 30m
severity: warning
namespace: cattle-monitoring-system
+# Source: swh/templates/scheduler/alert-scheduler-lister-tasks.yaml
+kind: PrometheusRule
+ labels:
+ app: swh-alerts
+ name: scheduler-stale-recurring-tasks.rules
+ namespace: swh
+ groups:
+ - name: scheduler-stale-recurring-tasks.rules
+ rules:
+ - alert: SchedulerStaleRecurringTask-full
+ expr: |-
+ histogram_quantile(0.1, sum(sql_swh_scheduler_delay{environment="staging", policy="recurring",current_interval="90 days",status="next_run_scheduled"}) by (le)) > 7.776e+06
+ annotations:
+ description: "staging: Stale scheduler full lister tasks in scheduler <{{ $ }}> (server: <{{ $labels.server }}>)"
+ summary: "Existing lister tasks in stale state in environment <staging>"
+ for: 30m
+ labels:
+ severity: critical
+ namespace: cattle-monitoring-system
+ - alert: SchedulerStaleRecurringTask-incremental
+ expr: |-
+ histogram_quantile(0.1, sum(sql_swh_scheduler_delay{environment="staging", policy="recurring",current_interval="1 day",status="next_run_scheduled"}) by (le)) > 172800
+ annotations:
+ description: "staging: Stale scheduler incremental lister tasks in scheduler <{{ $ }}> (server: <{{ $labels.server }}>)"
+ summary: "Existing lister tasks in stale state in environment <staging>"
+ for: 30m
+ labels:
+ severity: critical
+ namespace: cattle-monitoring-system
# Source: swh/templates/checker-deposit/keda-autoscaling.yaml
kind: ScaledObject
name: checker-deposit-operators
namespace: swh
apiVersion: apps/v1 # Optional. Default: apps/v1
kind: Deployment # Optional. Default: Deployment
------------- diff for environment staging namespace swh-cassandra -------------
No differences
------------- diff for environment staging namespace swh-cassandra-next-version -------------
No differences
------------- diff for environment production namespace swh -------------
No differences
------------- diff for environment production namespace swh-cassandra -------------
No differences
Refs. swh/infra/sysadm-environment#5213