Monitoring/alerting solution for container-based services
Currently our static services are monitored through icinga via puppet collected resources. We need to find an alerting solution for container-based services.
Non exhaustive list of possible alerts:
- Service not responding (based on blackbox-exporter metrics)
- cert-manager certificate expiration, abnormal number of certificate issues
-
kubernetes
- abnormal number of restart of a pod
- abnormal load on a cluster (memory / cpu / ...)
- Abnormal pending pod count
- Abnormal pod request / limit ratio
-
Functional alerts
- TBD
- Cassandra / Reaper statistics
- Autoscaling limit reached
- Node statistics
-
Minio (available metrics: http://<minio service>:9000/minio/v2/metrics/node
- S3 api availability
- Console availability
-
minio_s3_requests_errors_total
-
reaper
-
sum(io_cassandrareaper_service_SegmentRunner_postpone)
Indicate a timeout on a segment or an error to to schedule a segment's repair
-
-
Cassandra
- Size of unrepaired date (raise an alert at a given threshold) (20Go? to adapt according the values during a normal run)
- or percentage of repaired: cassandra_table_percentrepaired
- Oversized mutation (alert -> missing data)
-
ArgoCD
- Application stats
- Argocd internal status
- Argocd website responding
- <any other ideas>
- swh
- #5048 (closed): Alert when too many messages in queue event occurs
- Alert on increasing swh_web_save_requests_delay_seconds
-
Alert on swh_web_accepted_save_requests
- not yet scheduled count and no running(?)
- failed rate too high compared to success rate
- scn scheduled entries too high(?)
Migrated from T4525 (view on Phabricator)