Skip to content

Kube cronjobs alerting

Guillaume Samson requested to merge kube_cronjobs_alerting into production

These modifications will create two alerts (warning) if:

  • a cronjob is no more executed (suspension status);
  • a cronjob can execute concurrent jobs.

I define all rules in one group swh.<environment>.rules.

Diff production
ᐅ diff -u kube-production kube-cronjobs-production
--- kube-production     2023-11-21 14:58:25.604870534 +0100
+++ kube-cronjobs-production    2023-11-21 14:57:51.296344215 +0100
@@ -45,17 +45,15 @@
                 key: password
                 name: alertmanager-irc-relay-config
 ---
-# Source: cluster-config/templates/alerting/cassandra-alerting.yaml
+# Source: cluster-config/templates/alerting/swh-alerting.yaml
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
-  labels:
-    app: cassandra
-  name: cassandra-service.rules
+  name: swh.production.rules
   namespace: cattle-monitoring-system
 spec:
   groups:
-  - name: cassandra-service.rules
+  - name: swh.production.rules
     rules:
     - alert: Cassandra_Degraded_Service_In_Production
       annotations:
@@ -75,3 +73,21 @@
       labels:
         severity: critical
         namespace: cattle-monitoring-system
+    - alert: Concurrent_Cronjob_Is_Allowed_In_Production
+      annotations:
+        description: "The concurrency_policy of cronjob {{ $labels.cronjob }} is {{ $labels.concurrency_policy }}."
+        summary: "Please set the concurrency_policy of cronjob {{ $labels.cronjob }} to 'Forbid' on cluster {{ $labels.cluster_name }}."
+      expr: kube_cronjob_info{concurrency_policy!="Forbid"}
+      for: 15m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+    - alert: Cronjob_Is_Suspended_In_Production
+      annotations:
+        description: "The cronjob {{ $labels.cronjob }} is suspended for more than 5 minutes."
+        summary: "Please set the suspension status of cronjob  {{ $labels.cronjob }} to 'false' on cluster {{ $labels.cluster_name }}."
+      expr: kube_cronjob_spec_suspend > 0
+      for: 5m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
Diff staging
ᐅ diff -u kube-staging kube-cronjobs-staging      
--- kube-staging        2023-11-21 14:58:53.629285154 +0100
+++ kube-cronjobs-staging       2023-11-21 14:58:08.764614846 +0100
@@ -210,17 +210,15 @@
                 key: password
                 name: alertmanager-irc-relay-config
 ---
-# Source: cluster-config/templates/alerting/cassandra-alerting.yaml
+# Source: cluster-config/templates/alerting/swh-alerting.yaml
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
-  labels:
-    app: cassandra
-  name: cassandra-service.rules
+  name: swh.staging.rules
   namespace: cattle-monitoring-system
 spec:
   groups:
-  - name: cassandra-service.rules
+  - name: swh.staging.rules
     rules:
     - alert: Cassandra_Degraded_Service_In_Staging
       annotations:
@@ -240,3 +238,21 @@
       labels:
         severity: critical
         namespace: cattle-monitoring-system
+    - alert: Concurrent_Cronjob_Is_Allowed_In_Staging
+      annotations:
+        description: "The concurrency_policy of cronjob {{ $labels.cronjob }} is {{ $labels.concurrency_policy }}."
+        summary: "Please set the concurrency_policy of cronjob {{ $labels.cronjob }} to 'Forbid' on cluster {{ $labels.cluster_name }}."
+      expr: kube_cronjob_info{concurrency_policy!="Forbid"}
+      for: 15m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
+    - alert: Cronjob_Is_Suspended_In_Staging
+      annotations:
+        description: "The cronjob {{ $labels.cronjob }} is suspended for more than 5 minutes."
+        summary: "Please set the suspension status of cronjob  {{ $labels.cronjob }} to 'false' on cluster {{ $labels.cluster_name }}."
+      expr: kube_cronjob_spec_suspend > 0
+      for: 5m
+      labels:
+        severity: warning
+        namespace: cattle-monitoring-system
Check rules
~/_swh_src/sysadm-environment/swh-charts/cluster-components (kube_cronjobs_alerting ✔) ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting .  | \
grep groups -A 38
  groups:
  - name: swh.production.rules
    rules:
    - alert: Cassandra_Degraded_Service_In_Production
      annotations:
        description: "The {{ $labels.instance }} node is unreachable for more than 15 minutes. This node seems down."
        summary: "The {{ $labels.service }} is degraded. Please check the {{ $labels.instance }} status."
      expr: up{service="cassandra-servers-svc"} == 0
      for: 15m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
    - alert: Cassandra_Unrepaired_Table_In_Production
      annotations:
        description: "The unrepaired bytes of table {{ $labels.table }} is more than 200 Gb."
        summary: "Please trigger a repair on the table {{ $labels.table }} in keyspace {{ $labels.keyspace }}."
      expr: sum by (keyspace, table) (cassandra_table_bytesunrepaired{table!="",job="cassandra-servers-svc"}) > 2.147483648e+11
      for: 5m
      labels:
        severity: critical
        namespace: cattle-monitoring-system
    - alert: Concurrent_Cronjob_Is_Allowed_In_Production
      annotations:
        description: "The concurrency_policy of cronjob {{ $labels.cronjob }} is {{ $labels.concurrency_policy }}."
        summary: "Please set the concurrency_policy of cronjob {{ $labels.cronjob }} to 'Forbid' on cluster {{ $labels.cluster_name }}."
      expr: kube_cronjob_info{concurrency_policy!="Forbid"}
      for: 15m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
    - alert: Cronjob_Is_Suspended_In_Production
      annotations:
        description: "The cronjob {{ $labels.cronjob }} is suspended for more than 5 minutes."
        summary: "Please set the suspension status of cronjob  {{ $labels.cronjob }} to 'false' on cluster {{ $labels.cluster_name }}."
      expr: kube_cronjob_spec_suspend > 0
      for: 5m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
~/_swh_src/sysadm-environment/swh-charts/cluster-components (kube_cronjobs_alerting ✔) ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting .  | \
grep groups -A 38 | promtool check rules
Checking standard input
  SUCCESS: 4 rules found

Merge request reports