Skip to content

staging/scheduler: Activate alerts for stale lister tasks

Antoine R. Dumont requested to merge add-alert-on-stale-scheduler-tasks into production

Maybe that will work.

helm diff
[swh] Comparing changes between branches production and add-alert-on-stale-scheduler-tasks (per environment)...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment staging, namespace swh...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra-next-version...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment staging...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment staging...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment staging...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment production, namespace swh...
[swh] Generate config in production branch for environment production, namespace swh-cassandra...
[swh] Generate config in production branch for environment production, namespace swh-cassandra-next-version...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment production...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment production...
[swh] Generate config in add-alert-on-stale-scheduler-tasks branch for environment production...


------------- diff for environment staging namespace swh -------------

--- /tmp/swh-chart.swh.Im6ero7A/staging-swh.before      2024-02-08 17:14:54.803529872 +0100
+++ /tmp/swh-chart.swh.Im6ero7A/staging-swh.after       2024-02-08 17:14:55.539529019 +0100
@@ -25527,20 +25527,53 @@
       expr: |-
         max_over_time(rabbitmq_queue_messages_ready{environment="staging", instance=~"scheduler0.*"}[5m]) > 4000
       annotations:
         description: "staging: High number of messages in rabbitmq queue <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
         summary: "A queue exceeds a given threshold in environment <staging>, rabbitmq instance <scheduler0>"
       for: 30m
       labels:
         severity: warning
         namespace: cattle-monitoring-system
 ---
+# Source: swh/templates/scheduler/alert-scheduler-lister-tasks.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    app: swh-alerts
+  name: scheduler-stale-recurring-tasks.rules
+  namespace: swh
+spec:
+  groups:
+  - name: scheduler-stale-recurring-tasks.rules
+    rules:
+    - alert: SchedulerStaleRecurringTask-full
+      expr: |-
+        histogram_quantile(0.1, sum(sql_swh_scheduler_delay{environment="staging", policy="recurring",current_interval="90 days",status="next_run_scheduled"}) by (le)) > 7.776e+06
+      annotations:
+        description: "staging: Stale scheduler full lister tasks in scheduler <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "Existing lister tasks in stale state in environment <staging>"
+      for: 30m
+      labels:
+        severity: critical
+        namespace: cattle-monitoring-system
+    - alert: SchedulerStaleRecurringTask-incremental
+      expr: |-
+        histogram_quantile(0.1, sum(sql_swh_scheduler_delay{environment="staging", policy="recurring",current_interval="1 day",status="next_run_scheduled"}) by (le)) > 172800
+      annotations:
+        description: "staging: Stale scheduler incremental lister tasks in scheduler <{{ $labels.name }}> (server: <{{ $labels.server }}>)"
+        summary: "Existing lister tasks in stale state in environment <staging>"
+      for: 30m
+      labels:
+        severity: critical
+        namespace: cattle-monitoring-system
+---
 # Source: swh/templates/checker-deposit/keda-autoscaling.yaml
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: checker-deposit-operators
   namespace: swh
 spec:
   scaleTargetRef:
     apiVersion:    apps/v1     # Optional. Default: apps/v1
     kind:          Deployment  # Optional. Default: Deployment


------------- diff for environment staging namespace swh-cassandra -------------

No differences


------------- diff for environment staging namespace swh-cassandra-next-version -------------

No differences


------------- diff for environment production namespace swh -------------

No differences


------------- diff for environment production namespace swh-cassandra -------------

No differences

Refs. swh/infra/sysadm-environment#5213 (closed)

Merge request reports