Skip to content

Monitoring/alerting solution for container-based services

Currently our static services are monitored through icinga via puppet collected resources. We need to find an alerting solution for container-based services.

Non exhaustive list of possible alerts:

  • Service not responding (based on blackbox-exporter metrics)
  • cert-manager certificate expiration, abnormal number of certificate issues
  • kubernetes
    • abnormal number of restart of a pod
    • abnormal load on a cluster (memory / cpu / ...)
    • Abnormal pending pod count
    • Abnormal pod request / limit ratio
  • Functional alerts
    • TBD
  • Cassandra / Reaper statistics
  • Autoscaling limit reached
  • Node statistics
  • Minio (available metrics: http://<minio service>:9000/minio/v2/metrics/node
    • S3 api availability
    • Console availability
    • minio_s3_requests_errors_total
  • reaper
    • sum(io_cassandrareaper_service_SegmentRunner_postpone) Indicate a timeout on a segment or an error to to schedule a segment's repair
  • Cassandra
    • Size of unrepaired date (raise an alert at a given threshold) (20Go? to adapt according the values during a normal run)
    • or percentage of repaired: cassandra_table_percentrepaired
    • Oversized mutation (alert -> missing data)
  • ArgoCD
    • Application stats
    • Argocd internal status
    • Argocd website responding
  • <any other ideas>
  • swh
    • #5048 (closed): Alert when too many messages in queue event occurs
    • Alert on increasing swh_web_save_requests_delay_seconds
    • Alert on swh_web_accepted_save_requests
      • not yet scheduled count and no running(?)
      • failed rate too high compared to success rate
      • scn scheduled entries too high(?)

Migrated from T4525 (view on Phabricator)

Edited by Antoine R. Dumont