Skip to content

production/storage-rpc: Use tcp liveness probe

Antoine R. Dumont requested to merge stabilize-storage-liveness-probe into staging

That should avoid having cascading effect. When workers are too busy to handle that probe, the http liveness probe fails, this ends up restarting the pod, in effect, killing the ongoing requests.

[1] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-tcp-liveness-probe

If it's working well, we should probably do the same for the remaining rpc services (for another diff). (Another commit does the same for the webapp template in the mr).

(It's currently tested without issue on both the storage rpc and the webapp instances).

make swh-helm-diff
[swh] Comparing changes between branches production and stabilize-storage-liveness-probe (per environment)...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment staging, namespace swh...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra-next-version...
[swh] Generate config in stabilize-storage-liveness-probe branch for environment staging...
[swh] Generate config in stabilize-storage-liveness-probe branch for environment staging...
[swh] Generate config in stabilize-storage-liveness-probe branch for environment staging...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment production, namespace swh...
[swh] Generate config in production branch for environment production, namespace swh-cassandra...
[swh] Generate config in production branch for environment production, namespace swh-cassandra-next-version...
[swh] Generate config in stabilize-storage-liveness-probe branch for environment production...
[swh] Generate config in stabilize-storage-liveness-probe branch for environment production...
[swh] Generate config in stabilize-storage-liveness-probe branch for environment production...


------------- diff for environment staging namespace swh -------------

--- /tmp/swh-chart.swh.3D6eXKBU/staging-swh.before      2024-01-25 13:22:22.935902853 +0100
+++ /tmp/swh-chart.swh.3D6eXKBU/staging-swh.after       2024-01-25 13:22:23.959902479 +0100
@@ -23918,22 +23918,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: THREADS
@@ -24057,22 +24056,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: THREADS


------------- diff for environment staging namespace swh-cassandra -------------

--- /tmp/swh-chart.swh.3D6eXKBU/staging-swh-cassandra.before    2024-01-25 13:22:23.375902692 +0100
+++ /tmp/swh-chart.swh.3D6eXKBU/staging-swh-cassandra.after     2024-01-25 13:22:24.223902383 +0100
@@ -22398,22 +22398,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: THREADS


------------- diff for environment staging namespace swh-cassandra-next-version -------------

--- /tmp/swh-chart.swh.3D6eXKBU/staging-swh-cassandra-next-version.before       2024-01-25 13:22:23.639902596 +0100
+++ /tmp/swh-chart.swh.3D6eXKBU/staging-swh-cassandra-next-version.after        2024-01-25 13:22:24.439902304 +0100
@@ -20894,22 +20894,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: THREADS


------------- diff for environment production namespace swh -------------

--- /tmp/swh-chart.swh.3D6eXKBU/production-swh.before   2024-01-25 13:22:24.835902160 +0100
+++ /tmp/swh-chart.swh.3D6eXKBU/production-swh.after    2024-01-25 13:22:25.411901950 +0100
@@ -32070,22 +32070,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: THREADS
@@ -32435,22 +32434,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: THREADS


------------- diff for environment production namespace swh-cassandra -------------

--- /tmp/swh-chart.swh.3D6eXKBU/production-swh-cassandra.before 2024-01-25 13:22:25.019902093 +0100
+++ /tmp/swh-chart.swh.3D6eXKBU/production-swh-cassandra.after  2024-01-25 13:22:25.643901865 +0100
@@ -14245,22 +14245,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: STATSD_HOST
@@ -14602,22 +14601,21 @@
             - containerPort: 5002
               name: rpc
           readinessProbe:
             httpGet:
               path: /
               port: rpc
             initialDelaySeconds: 15
             failureThreshold: 30
             periodSeconds: 5
           livenessProbe:
-            httpGet:
-              path: /
+            tcpSocket:
               port: rpc
             initialDelaySeconds: 10
             periodSeconds: 5
           command:
           - /bin/bash
           args:
           - -c
           - /opt/swh/entrypoint.sh
           env:
             - name: STATSD_HOST

Refs. swh/infra/sysadm-environment#5215 (closed)

Edited by Antoine R. Dumont

Merge request reports