Monitoring/alerting solution for container-based services

marked this issue as related to #4523 (closed)

added System administration meta-task priority:High labels

mentioned in issue #4523 (closed)

changed the description

mentioned in issue #4691 (closed)

changed the description

assigned to @vsellier

added state:wip label

After some fights, I finally successfully configured the alertmanager to send notifications to a test account on pagerduty The configuration is finally quit simple (cf snippet $1494)

According the tests, it seems it's simpler to deploy the alert configuration in each cluster.

The supported alert receivers by default are:

email
opsgenie
pagerduty
pushover
slack
sns
victorops
webhook
wechat
telegram
webex

Referenced webhook implementations: https://prometheus.io/docs/operating/integrations/#alertmanager-webhook-receiver

If we want to send the notifications to an irc channel, it seems we need to deploy a dedicated webhook for that

tips:

Configuration detail: https://docs.openshift.com/container-platform/4.7/rest_api/monitoring_apis/alertmanagerconfig-monitoring-coreos-com-v1alpha1.html
test an alert:

kubectl  run -ti -n cattle-monitoring-system --image debian debian -- bash
apt update && apt install -y curl
curl -v -d '[{"labels": {"Alertname": "PagerDuty Test"}}]' hperated:9093/api/v1/alerts

set log level to debug (rancher deployments): (top level configuration)

alertmanager:
  alertmanagerSpec:
    logLevel: debug
nodeExporter:
...

Possible irc bot:

IRC notifications are working well with alertmanager-irc-relay:

These are some alert examples:

11:06 -- Notice(swhambot): Alert CRITICAL firing - admin/cluster-admin - PrometheusRuleFailures - Prometheus cattle-monitoring-system/prometheus-rancher-monitoring-prometheus-0 has failed to evaluate 30 rules in the last 5m.
11:06 -- Notice(swhambot): Alert CRITICAL firing - admin/cluster-admin - PrometheusRuleFailures - Prometheus cattle-monitoring-system/prometheus-rancher-monitoring-prometheus-0 has failed to evaluate 10 rules in the last 5m.
11:06 -- Notice(swhambot): Alert WARNING firing - admin/cluster-admin - KubePodNotReady - Pod cattle-monitoring-system/debian11 has been in a non-ready state for longer than 15 minutes.
11:06 -- Notice(swhambot): Alert WARNING resolved - admin/cluster-admin - KubePodNotReady - Pod cattle-monitoring-system/debian8 has been in a non-ready state for longer than 15 minutes.

We could deployed the hook globally for the infra on the admin cluster for example.

There is also a prometheus ALERTS keyword to retrieve the active alerts. It could allow to make a global grafana dashboard to aggregate all the alerts from all the clusters

PS: the alertmanager-irc-relay was imported in gitab and the docker image pushed in its registry (https://gitlab.softwareheritage.org/infra/ci-cd/3rdparty/alertmanager-irc-relay)

The bot is deployed in the admin cluster (by argocd). It's only reachable from this cluster at the moment, the time to test the way to properly configure the alerts.

When we'll be ready to extend it's usage to the other clusters, it'll be exposed through a protected public ingress end point.

The alert will be sent to the swh-sysadm channel

marked this issue as related to #4710 (closed)

Argocd metrics activated. A (grafana dashboard)[https://grafana.softwareheritage.org/d/LCAgc9rWz/argocd?orgId=1] (imported from the argocd repository was imported in our grafana.

A couple a prometheus was created to start:

OutOfSync applications (warning 1h, critical 24h)
App not healthy
Kubernetes connection lost
Internal redis errors detected

changed the description

This is an interesting tool to debug routing issues of the alerts: https://prometheus.io/webtools/alerting/routing-tree-editor/

By default, the generated alertmanager configuration must have the label namespace='cattle-monitoring-system' to be sent to the webhook. As the prometheus rules created for argocd are created on the argocd namespace, they are not sent to the irc channel

The reason: https://github.com/prometheus-operator/prometheus-operator/discussions/3733

@vsellier do we have access to the alertmanager web UI? I'd like to try to understand why the rancher alerts are seemingly being notified every 15 minutes on irc when the repeatInterval is set to 12h

yep the ui is available through rancher: https://rancher.euwest.azure.internal.softwareheritage.org/k8s/clusters/c-q2wd4/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-alertmanager:9093/proxy/#/alerts

The notifications are probably real notification due to the rancher flip-flops.

The alert says "unavailable for more than one hour" for the same resource every 15 minutes, and we don't get any recovery messages, so I would be very surprised if it was really notifying that much because of flip-flops.

https://rancher.euwest.azure.internal.softwareheritage.org/k8s/clusters/c-q2wd4/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-prometheus:9090/proxy/graph?g0.expr=ALERTS%7Balertname%3D%22ArgoCDClusterConnectionLost%22%7D&g0.tab=0&g0.stacked=1&g0.show_exemplars=0&g0.range_input=12h shows the alert being in "firing" state consistently over the time period with multiple notifications.