Skip to content

components/alerting: Add ingress slow down alert

Guillaume Samson requested to merge ingress_requests_slow_down into production

Related to swh/infra/sysadm-environment#5352 (closed)

These modifications will create an alert when an ingress when an ingress has 90% of requests lasting 10 seconds or more within the last 2 minutes in the staging and the production environment.

Helm-diff
[cluster-components] Comparing changes between branches production and ingress_requests_slow_down...
Your branch is up to date with 'origin/production'.
[cluster-components] Generate config in production branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/default.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/local-cluster.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/test-staging-rke2.yaml...
Your branch is up to date with 'origin/ingress_requests_slow_down'.
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/default.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/local-cluster.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in ingress_requests_slow_down branch for cluster-components/values/test-staging-rke2.yaml...


------------- diff for cluster-components/values/admin-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/admin-rke2.yaml.before, 32 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/admin-rke2.yaml.after, 32 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/archive-production-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/archive-production-rke2.yaml.before, 23 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/archive-production-rke2.yaml.after, 23 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

spec.groups.swh-production.rules.rules  (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-production.rules)
  + one list entry added:
    - alert: Ingress_Slow_Down_In_Production
      annotations:
        description: "Ingress {{ $labels.exported_namespace }}/{{ $labels.ingress }} requests with 10s duration or more are 90% of total requests within the last 2 minutes."
        summary: "90% of requests have 10s duration or more."
      expr: |
        sum((delta(nginx_ingress_controller_request_duration_seconds_count[2m]) > 10000)) by (ingress, exported_namespace)
        /
        sum((delta(nginx_ingress_controller_request_duration_seconds_count[2m]))) by (ingress, exported_namespace)
        * 100 > 90
      for: 5m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
    
  



------------- diff for cluster-components/values/archive-staging-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/archive-staging-rke2.yaml.before, 59 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/archive-staging-rke2.yaml.after, 59 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

spec.groups.swh-staging.rules.rules  (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-staging.rules)
  + one list entry added:
    - alert: Ingress_Slow_Down_In_Staging
      annotations:
        description: "Ingress {{ $labels.exported_namespace }}/{{ $labels.ingress }} requests with 10s duration or more are 90% of total requests within the last 2 minutes."
        summary: "90% of requests have 10s duration or more."
      expr: |
        sum((delta(nginx_ingress_controller_request_duration_seconds_count[2m]) > 10000)) by (ingress, exported_namespace)
        /
        sum((delta(nginx_ingress_controller_request_duration_seconds_count[2m]))) by (ingress, exported_namespace)
        * 100 > 90
      for: 5m
      labels:
        severity: warning
        namespace: cattle-monitoring-system
    
  



------------- diff for cluster-components/values/default.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/default.yaml.before, six documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/default.yaml.after, six documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/gitlab-production.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/gitlab-production.yaml.before, seven documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/gitlab-production.yaml.after, seven documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/gitlab-staging.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/gitlab-staging.yaml.before, seven documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/gitlab-staging.yaml.after, seven documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/local-cluster.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/local-cluster.yaml.before, eight documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/local-cluster.yaml.after, eight documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/rancher.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/rancher.yaml.before, seven documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/rancher.yaml.after, seven documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/test-staging-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.gAu6CW4W/test-staging-rke2.yaml.before, 18 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.gAu6CW4W/test-staging-rke2.yaml.after, 18 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/

Prometheus rules check
~/_swh_src/sysadm-environment/swh-charts/cluster-components (ingress_requests_slow_down ✔) ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting . | \
grep "^[[:space:]]*groups:" -A 100 | promtool check rules
Checking standard input
  SUCCESS: 9 rules found

To understand the value ingress.slowDownPeriod note that the metric nginx_ingress_controller_request_duration_seconds_count is expressed in milliseconds:

Screenshot_from_2024-08-01_14-49-50

Merge request reports