components/alerting: Add ingress error rate alert
Related to swh/infra/sysadm-environment#5353 (closed)
These modifications will create an alert if an ingress has more than 10% errors within the last 15 minutes in the staging and the production environments.
Helm-diff
[cluster-components] Comparing changes between branches production and ingress_errors_alert...
Your branch is up to date with 'origin/production'.
[cluster-components] Generate config in production branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/default.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/local-cluster.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/test-staging-rke2.yaml...
Your branch is up to date with 'origin/ingress_errors_alert'.
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/default.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/local-cluster.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in ingress_errors_alert branch for cluster-components/values/test-staging-rke2.yaml...
------------- diff for cluster-components/values/admin-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/admin-rke2.yaml.before, 32 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/admin-rke2.yaml.after, 32 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/archive-production-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/archive-production-rke2.yaml.before, 23 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/archive-production-rke2.yaml.after, 23 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned one difference
|___/
spec.groups.swh-production.rules.rules (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-production.rules)
+ one list entry added:
- alert: Ingress_Errors_In_Production
annotations:
description: "Ingress {{ $labels.exported_namespace }}/{{ $labels.ingress }} has {{ $value }}% errors within the last 15 minutes. The host is {{ $labels.host }}."
summary: "Ingress {{ $labels.ingress }} has more than 10% errors."
expr: |
sum(irate(nginx_ingress_controller_requests{status=~"[45].."}[15m])) by (ingress)
/
sum(irate(nginx_ingress_controller_requests[5m])) by (ingress) * 100 > 10
for: 15m
labels:
severity: warning
namespace: cattle-monitoring-system
------------- diff for cluster-components/values/archive-staging-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/archive-staging-rke2.yaml.before, 59 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/archive-staging-rke2.yaml.after, 59 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned one difference
|___/
spec.groups.swh-staging.rules.rules (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-staging.rules)
+ one list entry added:
- alert: Ingress_Errors_In_Staging
annotations:
description: "Ingress {{ $labels.exported_namespace }}/{{ $labels.ingress }} has {{ $value }}% errors within the last 15 minutes. The host is {{ $labels.host }}."
summary: "Ingress {{ $labels.ingress }} has more than 10% errors."
expr: |
sum(irate(nginx_ingress_controller_requests{status=~"[45].."}[15m])) by (ingress)
/
sum(irate(nginx_ingress_controller_requests[5m])) by (ingress) * 100 > 10
for: 15m
labels:
severity: warning
namespace: cattle-monitoring-system
------------- diff for cluster-components/values/default.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/default.yaml.before, six documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/default.yaml.after, six documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/gitlab-production.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/gitlab-production.yaml.before, seven documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/gitlab-production.yaml.after, seven documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/gitlab-staging.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/gitlab-staging.yaml.before, seven documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/gitlab-staging.yaml.after, seven documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/local-cluster.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/local-cluster.yaml.before, eight documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/local-cluster.yaml.after, eight documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/rancher.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/rancher.yaml.before, seven documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/rancher.yaml.after, seven documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/test-staging-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.F6a4UQgG/test-staging-rke2.yaml.before, 18 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.F6a4UQgG/test-staging-rke2.yaml.after, 18 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
Checking the rules
~/_swh_src/sysadm-environment/swh-charts/cluster-components (ingress_errors_alert ✔) ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting . | \
grep "^[[:space:]]*groups:" -A 86 | promtool check rules
Checking standard input
SUCCESS: 8 rules found
10% errors will create a large number of alerts.
I don't know if we should use rate
or irate
function:
- rate
- irate