[cassandra] holes in the monitoring
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Vincent Sellier added Metrics/monitoring cassandra labels
added Metrics/monitoring cassandra labels
- Vincent Sellier assigned to @vsellier
assigned to @vsellier
- Owner
or some long-standing process triggered regularly with a sufficiently high load preventing metrics from being gathered (e.g. gc) ?
Collapse replies - Author Owner
indeed, it could be something like that.
Only the jmx metrics seem impacted. The point is all the metrics of all the servers are lost during the black hole which makes me think it could also have something not working correctly on prometheus or prometheus agents or cassandra service definition or ... ;)
- Author Owner
the number of metrics exposed by cassandra is quite huge:
vsellier@cassandra01 ~ % curl -s http://localhost:7070 | grep -v "^#" | wc -l 24566
Edited by Vincent Sellier
- Author Owner
The default scape_timeout of a service is 10s.
- job_name: serviceMonitor/cassandra/cassandra-jmx-exporter/0 honor_labels: true honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http
the last scrapings of the metrics took between 4s and 6s, it's possible it reaches the timeout duration at some point. (from [1])
During my investigation, the target
serviceMonitor/cassandra/cassandra-jmx-exporter/0
sometimes completely disappears from the targets.There are no prometheus logs or kubernetes events explaining that.
Collapse replies - Author Owner
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@46ee60ac
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@46ee60ac
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@472b2301
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@472b2301
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@8cd52754
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@8cd52754
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@646b76de
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@646b76de
- Author Owner
The recommended way to do the manual endpoint declaration is now to use an Endpointslice object, but unfortunately, it seems prometheus doen't discover the cassandra nodes in this case: https://kubernetes.io/docs/concepts/services-networking/service/#services-without-selectors
The point could be this kind of service should not declare any selector to not trigger the kubernetes auto discovery. The last commit is trying that, let's see if it improves something
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@25d45132
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@25d45132
- Author Owner
The monitoring is now stable without the service selector.
The root cause of the flip-flop seems to be the kubelet / api manager regularly restarting on both the staging and production cluster.
I will create another issue to address this problem. - Vincent Sellier closed
closed
- Vincent Sellier added 1h of time spent at 2023-05-10
added 1h of time spent at 2023-05-10
- Vincent Sellier added 2h of time spent at 2023-05-08
added 2h of time spent at 2023-05-08
- Vincent Sellier mentioned in issue #4876 (closed)
mentioned in issue #4876 (closed)