Federate prometheus instances through thanos
Thanos is the swiss-army knife for prometheus federation/HA/clustering.
It allows querying a global view of multiple, potentially redundant, prometheus data stores, by pushing data from prometheus instances to centralised object stores, then providing query frontends for each of these data stores.
Plan:
-
Install manual thanos services in mmca (temporary provenance server) -
Push historical data from mmca to a thanos datastore bucket -
Push historical data from pergamon to a thanos datastore bucket -
infra/swh-sysadmin-provisioning!80: Provision thanos query dedicated node (+ inventory update) -
D8092: Expose a thanos query service to read from those datastore -
infra/puppet/puppet-swh-site!534: Expose thanos gateway service to access historical data - [ ] Expose thanos gateway on mmca (historical data access)-> will make it run on thanos node -
infra/puppet/puppet-swh-site!534: Update thanos query to read from those gateways as well -
Fix communication between thanos and pergamon nodes (firewall) -
Fix communication between thanos and mmca nodes (certs) -
infra/puppet/puppet-swh-site!532: Drop mmca's prometheus federation from puppet -
mmca: drop history on Prometheus server (/var/lib/Prometheus/metrics2) [3] -
mmca: Clean up historical data from bucket mmca-metrics-0
[3] -
Switch grafana datasource from pergamon's prometheus to the thanos query service -
Instantiate thanos sidecar service in staging cluster (then reference it to thanos node) - [ ] Instantiate prometheus/thanos services in staging environmentno more need for it since #4540 (closed) -
Instantiate prometheus/thanos services in archive-staging environment -
Instantiate prometheus/thanos services in archive-production environment -
Instantiate prometheus/thanos services in admin environment -
Instantiate prometheus/thanos services in azure environment -
Instantiate prometheus/thanos services in gitlab staging environment -
Instantiate prometheus/thanos services in gitlab production environment -
Instantiate prometheus/thanos services in rancher environment -
Federate it through thanos (puppet run on thanos
node should add their grpc entries) -
Drop pergamon's prometheus -
Document
Draft note can be found in the hedgedoc document [2].
-
[2] https://hedgedoc.softwareheritage.org/X1henrmkT8yL6_W9R0YpGg?both
-
[3] A switch tryout to thanos' query service showed that we double the metrics since pergamon and mmca both have the historical data (mmca's are no longer needed now since pergamon has it through the old federation so we can drop it now)
Migrated from T4385 (view on Phabricator)
Edited by Vincent Sellier