Deploy svix server in production
Plan:
-
swh/infra/ci-cd/swh-charts!367 (closed): Try rabbitmq as messaging queue system ~> fail -
Do we keep redis as backend? Yes, then. -
Deploy -
Create a swh-svix database in albertina -
Create svix-server namespace -
test-staging-rke2: Use clustered redis within the svix stack. -
test-staging-rke2: swh/infra/ci-cd/swh-charts!378 (merged): Adapt deployment so svix-server is able to use a clustered redis. -
archive-staging-rke2: Migrate staging deployment to the same setup as test-staging-rke2 -
Checks: Is the archive staging rke2 svix instance still functional? status: it's functional up to failures where svix gets lost. -
Test redis-standalone with cephfs mount: cephfs volume refuses to mount (probably because the ceph cluster is not in great shape [2]) -
swh/infra/ci-cd/k8s-clusters-conf!42 (closed): Create redis standalon in svix-server namespace (data in local volume attached to a node, redis monitoring export, ...) -
swh/infra/ci-cd/swh-charts!369 (closed): swh-charts/cluster-components: Deploy svix through chart -
Update network policies (charts got adapted according to tests and previous developments ^) -
Configure svix server -
Add a dns entry in the pergamon dns -
swh/infra/ci-cd/swh-charts!385 (merged):Add a blackbox configuration to monitor the svix server availability
-
Note: Just a point of information. The retention policy is configured at 14 days on the staging instance. While The 'messagecontent' table is indeed cleaned up accordingly ('payload' part of messages), the 'message' table (and other related tables messagedestination, messageattempt tables) will apparently grow infinite [1] (they are not cleaned up in staging). So we may need to have an extra cleaner process for this too.
[1]
swh-svix=# select now(), count(*) from message;
now | count
-------------------------------+---------
2024-03-05 10:26:17.714142+00 | 7476509
(1 row)
swh-svix=# select now(), count(*) from message where expiration < now();
now | count
-------------------------------+---------
2024-03-05 10:26:22.225471+00 | 7001008
(1 row)
[2] That's still a no-go. If we can't rely on the cephfs volume to be resilient and be remounting automatically when ceph gets back, that will block svix too so no go for now.
Refs. #5105 (closed)
Edited by Antoine R. Dumont