Monitoring/alerting solution for container-based services
Currently our static services are monitored through icinga via puppet collected resources. We need to find an alerting solution for container-based services.
Non exhaustive list of possible alerts:
-
Service not responding (based on blackbox-exporter metrics) -
cert-manager certificate expiration, abnormal number of certificate issues -
kubernetes -
abnormal number of restart of a pod -
abnormal load on a cluster (memory / cpu / ...) -
Abnormal pending pod count -
Abnormal pod request / limit ratio
-
-
Functional alerts -
TBD
-
-
Cassandra / Reaper statistics -
Autoscaling limit reached -
Node statistics -
Minio (available metrics: http://<minio service>:9000/minio/v2/metrics/node -
S3 api availability -
Console availability -
minio_s3_requests_errors_total
-
-
reaper -
sum(io_cassandrareaper_service_SegmentRunner_postpone)
Indicate a timeout on a segment or an error to to schedule a segment's repair
-
-
Cassandra -
Size of unrepaired date (raise an alert at a given threshold) (20Go? to adapt according the values during a normal run) -
or percentage of repaired: cassandra_table_percentrepaired -
Oversized mutation (alert -> missing data)
-
-
ArgoCD -
Application stats -
Argocd internal status -
Argocd website responding
-
-
<any other ideas> - swh
-
#5048 (closed): Alert when too many messages in queue event occurs -
Alert on increasing swh_web_save_requests_delay_seconds -
Alert on swh_web_accepted_save_requests -
not yet scheduled count and no running(?) -
failed rate too high compared to success rate -
scn scheduled entries too high(?)
-
-
Migrated from T4525 (view on Phabricator)
Edited by Antoine R. Dumont