Skip to content

GitLab

Explore

Sign in
Register

Monitoring/alerting solution for container-based services

Currently our static services are monitored through icinga via puppet collected resources. We need to find an alerting solution for container-based services.

Non exhaustive list of possible alerts:

Service not responding (based on blackbox-exporter metrics)
cert-manager certificate expiration, abnormal number of certificate issues
kubernetes
- abnormal number of restart of a pod
- abnormal load on a cluster (memory / cpu / ...)
- Abnormal pending pod count
- Abnormal pod request / limit ratio
Functional alerts
- TBD
Cassandra / Reaper statistics
Autoscaling limit reached
Node statistics
Minio (available metrics: http://<minio service>:9000/minio/v2/metrics/node
- S3 api availability
- Console availability
- minio_s3_requests_errors_total
reaper
- sum(io_cassandrareaper_service_SegmentRunner_postpone) Indicate a timeout on a segment or an error to to schedule a segment's repair
Cassandra
- Size of unrepaired date (raise an alert at a given threshold) (20Go? to adapt according the values during a normal run)
- or percentage of repaired: cassandra_table_percentrepaired
- Oversized mutation (alert -> missing data)
ArgoCD
- Application stats
- Argocd internal status
- Argocd website responding
<any other ideas>
swh
- #5048 (closed): Alert when too many messages in queue event occurs
- Alert on increasing swh_web_save_requests_delay_seconds
- Alert on swh_web_accepted_save_requests
  - not yet scheduled count and no running(?)
  - failed rate too high compared to success rate
  - scn scheduled entries too high(?)

Migrated from T4525 (view on Phabricator)

Edited Feb 20, 2024 by Antoine R. Dumont

Assignee Loading

Time tracking Loading