Skip to content

components/alerting: Create a custom KubeHpaMaxedOut rule

Guillaume Samson requested to merge prometheus_rules_hpamaxedout into production

These modifications will deploy a rule identical to the default rule in kube-prometheus-stack, excluding Keda HPAs.
The default rule KubeHpaMaxedOut need to be disable (already done in staging environment).

Helm diff
[cluster-components] Comparing changes between branches production and prometheus_rules_hpamaxedout...
Your branch is up to date with 'origin/production'.
[cluster-components] Generate config in production branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/minikube.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/test-staging-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/minikube.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/test-staging-rke2.yaml...


------------- diff for cluster-components/values/admin-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/admin-rke2.yaml.before, 29 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/admin-rke2.yaml.after, 29 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/archive-production-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/archive-production-rke2.yaml.before, 15 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/archive-production-rke2.yaml.after, 15 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

spec.groups.swh-production.rules.rules  (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-production.rules)
  + one list entry added:
    - alert: HPA_Maxed_Out_In_Production
      annotations:
        description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes."
        runbook_url: "https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout"
        summary: "HPA is running at max replicas"
      expr: |
        kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
          ==
        kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
      for: 15m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
    
  



------------- diff for cluster-components/values/archive-staging-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/archive-staging-rke2.yaml.before, 15 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/archive-staging-rke2.yaml.after, 15 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

spec.groups.swh-staging.rules.rules  (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-staging.rules)
  + one list entry added:
    - alert: HPA_Maxed_Out_In_Staging
      annotations:
        description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes."
        runbook_url: "https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout"
        summary: "HPA is running at max replicas"
      expr: |
        kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
          ==
        kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
      for: 15m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
    
  



------------- diff for cluster-components/values/gitlab-production.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-production.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-production.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/gitlab-staging.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-staging.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-staging.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/minikube.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/minikube.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/minikube.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/rancher.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/rancher.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/rancher.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/test-staging-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.JiFwlfd4/test-staging-rke2.yaml.before, four documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.JiFwlfd4/test-staging-rke2.yaml.after, four documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/
Rules validity check
cd cluster-components
ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting . | \ 
grep groups -A 61 | promtool check rules
Checking standard input
  SUCCESS: 6 rules found

Merge request reports