Skip to content

cluster-components: Add alert for save code now pending requests

Antoine R. Dumont requested to merge mr/add-alert-on-savecodenow into production

As per this board [1], which identifies the peak we had this saturday.

[1] https://grafana.softwareheritage.org/goto/b7cggP-Iz?orgId=1

helm diff
[cluster-components] Comparing changes between branches production and mr/add-alert-on-savecodenow...
Switched to branch 'production'
Your branch is up to date with 'origin/production'.
[cluster-components] Generate config in production branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/minikube.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/test-staging-rke2.yaml...
Switched to branch 'mr/add-alert-on-savecodenow'
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/minikube.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in mr/add-alert-on-savecodenow branch for cluster-components/values/test-staging-rke2.yaml...


------------- diff for cluster-components/values/admin-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/admin-rke2.yaml.before, 29 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/admin-rke2.yaml.after, 29 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/archive-production-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/archive-production-rke2.yaml.before, 15 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/archive-production-rke2.yaml.after, 15 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

spec.groups.swh-production.rules.rules  (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-production.rules)
  + one list entry added:
    - alert: SaveCodeNow_Is_Stale_In_Production
    │ annotations:
    │ │ description: "The save code now request status may be lagging for more than 5 minutes."
    │ │ summary: "Please check the svix server on cluster {{ $labels.cluster_name }}."
    │ expr: "avg_over_time(swh_web_submitted_save_requests{environment="production",status="pending"}[1h]) > 10"
    │ for: 5m
    │ labels:
    │ │ severity: warning
    │ │ namespace: cattle-monitoring-system



------------- diff for cluster-components/values/archive-staging-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/archive-staging-rke2.yaml.before, 15 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/archive-staging-rke2.yaml.after, 15 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

spec.groups.swh-staging.rules.rules  (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-staging.rules)
  + one list entry added:
    - alert: SaveCodeNow_Is_Stale_In_Staging
    │ annotations:
    │ │ description: "The save code now request status may be lagging for more than 5 minutes."
    │ │ summary: "Please check the svix server on cluster {{ $labels.cluster_name }}."
    │ expr: "avg_over_time(swh_web_submitted_save_requests{environment="staging",status="pending"}[1h]) > 5"
    │ for: 5m
    │ labels:
    │ │ severity: warning
    │ │ namespace: cattle-monitoring-system



------------- diff for cluster-components/values/gitlab-production.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/gitlab-production.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/gitlab-production.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/gitlab-staging.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/gitlab-staging.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/gitlab-staging.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/minikube.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/minikube.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/minikube.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/rancher.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/rancher.yaml.before
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/rancher.yaml.after
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/



------------- diff for cluster-components/values/test-staging-rke2.yaml -------------

     _        __  __
   _| |_   _ / _|/ _|  between /tmp/swh-chart.cluster-components.fnCaeg3n/test-staging-rke2.yaml.before, 13 documents
 / _' | | | | |_| |_       and /tmp/swh-chart.cluster-components.fnCaeg3n/test-staging-rke2.yaml.after, 13 documents
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned no differences
        |___/

Refs. swh/infra/sysadm-environment#5275 (closed)

Edited by Antoine R. Dumont

Merge request reports