production/web: Align archive instance with previous instance moma (!303) · Merge requests · Platform / Infrastructure / CI CD / Helm charts for swh packages

Merged Antoine R. Dumont requested to merge align-web-configuration-with-moma into staging 1 year ago

All threads resolved!

Compared to moma, it's currently undersized to 2 replicas, 2 workers, 2 threads so 8 requests handling. As no gunicorn configuration is provided, the default values are set by the docker image's default values set in the Dockerfile.

In moma, it was configured with 32 workers (with a timeout of 3600s). So we bump the number of replicas to 4, and the number of workers to 4 as well. That makes up a total of 32 requests that can be handled, which matches what moma used to do.

This doubles the memory request since we doubled the number of workers, and current use is near 95% use of memory. This also aligns the request timeout to 3600s.

Another commit adapts the web template so it uses the same default as the Dockerfile declares. This has the benefit to explicit the current setup in environment variables (without impact with regards to the current setup)

make swh-helm-diff

[swh] Comparing changes between branches production and align-web-configuration-with-moma (per environment)...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment staging, namespace swh...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra...
[swh] Generate config in production branch for environment staging, namespace swh-cassandra-next-version...
[swh] Generate config in align-web-configuration-with-moma branch for environment staging...
[swh] Generate config in align-web-configuration-with-moma branch for environment staging...
[swh] Generate config in align-web-configuration-with-moma branch for environment staging...
Your branch is up to date with 'origin/production'.
[swh] Generate config in production branch for environment production, namespace swh...
[swh] Generate config in production branch for environment production, namespace swh-cassandra...
[swh] Generate config in production branch for environment production, namespace swh-cassandra-next-version...
[swh] Generate config in align-web-configuration-with-moma branch for environment production...
[swh] Generate config in align-web-configuration-with-moma branch for environment production...
[swh] Generate config in align-web-configuration-with-moma branch for environment production...


------------- diff for environment staging namespace swh -------------

--- /tmp/swh-chart.swh.eDzE0c3T/staging-swh.before      2024-01-16 18:07:44.478771192 +0100
+++ /tmp/swh-chart.swh.eDzE0c3T/staging-swh.after       2024-01-16 18:07:45.110766624 +0100
@@ -24621,20 +24621,26 @@
                   value: webapp-postgresql.internal.staging.swh.network
             initialDelaySeconds: 3
             periodSeconds: 10
             timeoutSeconds: 30
           command:
             - /bin/bash
           args:
             - -c
             - /opt/swh/entrypoint.sh
           env:
+            - name: WORKERS
+              value: "2"
+            - name: THREADS
+              value: "2"
+            - name: TIMEOUT
+              value: "3600"
             - name: STATSD_HOST
               value: prometheus-statsd-exporter
             - name: STATSD_PORT
               value: "9125"
             - name: LOG_LEVEL
               value: "INFO"
             - name: SWH_CONFIG_FILENAME
               value: /etc/swh/config.yml
             - name: SWH_SENTRY_ENVIRONMENT
               value: staging


------------- diff for environment staging namespace swh-cassandra -------------

--- /tmp/swh-chart.swh.eDzE0c3T/staging-swh-cassandra.before    2024-01-16 18:07:44.682769718 +0100
+++ /tmp/swh-chart.swh.eDzE0c3T/staging-swh-cassandra.after     2024-01-16 18:07:45.314765150 +0100
@@ -23105,20 +23105,26 @@
                   value: webapp.staging.swh.network
             initialDelaySeconds: 3
             periodSeconds: 10
             timeoutSeconds: 30
           command:
             - /bin/bash
           args:
             - -c
             - /opt/swh/entrypoint.sh
           env:
+            - name: WORKERS
+              value: "2"
+            - name: THREADS
+              value: "2"
+            - name: TIMEOUT
+              value: "3600"
             - name: STATSD_HOST
               value: prometheus-statsd-exporter
             - name: STATSD_PORT
               value: "9125"
             - name: LOG_LEVEL
               value: "INFO"
             - name: SWH_CONFIG_FILENAME
               value: /etc/swh/config.yml
             - name: SWH_SENTRY_ENVIRONMENT
               value: staging


------------- diff for environment staging namespace swh-cassandra-next-version -------------

--- /tmp/swh-chart.swh.eDzE0c3T/staging-swh-cassandra-next-version.before       2024-01-16 18:07:44.878768301 +0100
+++ /tmp/swh-chart.swh.eDzE0c3T/staging-swh-cassandra-next-version.after        2024-01-16 18:07:45.506763762 +0100
@@ -21282,20 +21282,26 @@
                   value: webapp-cassandra-next-version.internal.staging.swh.network
             initialDelaySeconds: 3
             periodSeconds: 10
             timeoutSeconds: 30
           command:
             - /bin/bash
           args:
             - -c
             - /opt/swh/entrypoint.sh
           env:
+            - name: WORKERS
+              value: "2"
+            - name: THREADS
+              value: "2"
+            - name: TIMEOUT
+              value: "3600"
             - name: STATSD_HOST
               value: prometheus-statsd-exporter
             - name: STATSD_PORT
               value: "9125"
             - name: LOG_LEVEL
               value: "INFO"
             - name: SWH_CONFIG_FILENAME
               value: /etc/swh/config.yml
             - name: SWH_SENTRY_ENVIRONMENT
               value: staging


------------- diff for environment production namespace swh -------------

--- /tmp/swh-chart.swh.eDzE0c3T/production-swh.before   2024-01-16 18:07:45.778761796 +0100
+++ /tmp/swh-chart.swh.eDzE0c3T/production-swh.after    2024-01-16 18:07:46.210758674 +0100
@@ -29902,20 +29902,26 @@
                   value: webapp1.internal.softwareheritage.org
             initialDelaySeconds: 3
             periodSeconds: 10
             timeoutSeconds: 30
           command:
             - /bin/bash
           args:
             - -c
             - /opt/swh/entrypoint.sh
           env:
+            - name: WORKERS
+              value: "2"
+            - name: THREADS
+              value: "2"
+            - name: TIMEOUT
+              value: "3600"
             - name: STATSD_HOST
               value: prometheus-statsd-exporter
             - name: STATSD_PORT
               value: "9125"
             - name: LOG_LEVEL
               value: "INFO"
             - name: SWH_CONFIG_FILENAME
               value: /etc/swh/config.yml
             - name: SWH_SENTRY_ENVIRONMENT
               value: production
@@ -29985,21 +29991,21 @@
 # Source: swh/templates/web/deployment.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   namespace: swh
   name: web-archive
   labels:
     app: web-archive
 spec:
   revisionHistoryLimit: 2
-  replicas: 2
+  replicas: 4
   selector:
     matchLabels:
       app: web-archive
   strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
   template:
     metadata:
       labels:
@@ -30115,21 +30121,21 @@
           args:
             - -c
             - cp -r $PWD/.local/share/swh/web/static/ /usr/share/swh/web/static/
           volumeMounts:
           - name: static
             mountPath: /usr/share/swh/web/static
       containers:
         - name: web-archive
           resources:
             requests:
-              memory: 3072Mi
+              memory: 6144Mi
               cpu: 350m
           image: container-registry.softwareheritage.org/swh/infra/swh-apps/web:20240111.1
           imagePullPolicy: IfNotPresent
           ports:
             - containerPort: 5004
               name: webapp
           readinessProbe:
             httpGet:
               path: /
               port: webapp
@@ -30149,20 +30155,26 @@
                   value: archive.softwareheritage.org
             initialDelaySeconds: 3
             periodSeconds: 10
             timeoutSeconds: 30
           command:
             - /bin/bash
           args:
             - -c
             - /opt/swh/entrypoint.sh
           env:
+            - name: WORKERS
+              value: "4"
+            - name: THREADS
+              value: "2"
+            - name: TIMEOUT
+              value: "3600"
             - name: STATSD_HOST
               value: prometheus-statsd-exporter
             - name: STATSD_PORT
               value: "9125"
             - name: LOG_LEVEL
               value: "INFO"
             - name: SWH_CONFIG_FILENAME
               value: /etc/swh/config.yml
             - name: SWH_SENTRY_ENVIRONMENT
               value: production


------------- diff for environment production namespace swh-cassandra -------------

--- /tmp/swh-chart.swh.eDzE0c3T/production-swh-cassandra.before 2024-01-16 18:07:45.938760640 +0100
+++ /tmp/swh-chart.swh.eDzE0c3T/production-swh-cassandra.after  2024-01-16 18:07:46.366757547 +0100
@@ -14907,20 +14907,26 @@
                   value: webapp-cassandra.internal.softwareheritage.org
             initialDelaySeconds: 3
             periodSeconds: 10
             timeoutSeconds: 30
           command:
             - /bin/bash
           args:
             - -c
             - /opt/swh/entrypoint.sh
           env:
+            - name: WORKERS
+              value: "2"
+            - name: THREADS
+              value: "2"
+            - name: TIMEOUT
+              value: "3600"
             - name: STATSD_HOST
               value: prometheus-statsd-exporter
             - name: STATSD_PORT
               value: "9125"
             - name: LOG_LEVEL
               value: "INFO"
             - name: SWH_CONFIG_FILENAME
               value: /etc/swh/config.yml
             - name: SWH_SENTRY_ENVIRONMENT
               value: production

Refs. swh/infra/sysadm-environment#5110 (closed)

Edited 1 year ago by Antoine R. Dumont

Activity

Antoine R. Dumont added 2 commits 1 year ago
added 2 commits

d1b4445d - 1 commit from branch staging

b0184d80 - production/web: Align archive instance with previous instance moma

Compare with previous version
Nicolas Dandrimont @olasd · 1 year ago

Owner

FWIW, 32 workers * 5 threads * 2 replicas is 320 worker threads which is 10 times the size that it was on moma (32 workers * 1 thread * 1 replica). The current deployment (5 workers * 5 threads * 2 replicas) is already larger than moma was.

We should probably add some statsd instrumentation to our gunicorn instances to check if they're really being overwhelmed, and ideally turn that into a keda prometheus autoscaler.

Edited 1 year ago by Nicolas Dandrimont
Nicolas Dandrimont @olasd · 1 year ago

Owner

(overall if we want to enlarge the deployment I think we should increase the number of pods to spread the load & reduce SPOFs, rather than increase the size of individual pods)
Nicolas Dandrimont @olasd · 1 year ago

Owner

Resolved 1 year ago by Antoine R. Dumont

After looking a bit more, the current deployment looks like 2 worker processes and 2 threads per worker process over 2 replicas (so, 2*2*2 = 8 worker threads overall), which is indeed smaller than moma is (and doesn't seem to match what you said is supposed to be the default).

It does make sense to bump that (but going all the way to 320 is a bit much!)

Edited 1 year ago by Nicolas Dandrimont

Last reply by Antoine R. Dumont 1 year ago
Antoine R. Dumont @ardumont · 1 year ago

Author Owner

Resolved 1 year ago by Antoine R. Dumont

So either:

8 workers

1 thread

2 replicas

or:

4 workers

4 threads

2 replicas

I don't know what's the most sensible here, any ideas?

Last reply by Antoine R. Dumont 1 year ago
Nicolas Dandrimont @olasd · 1 year ago

Owner

Resolved 1 year ago by Antoine R. Dumont

To unconfuse the default value situation, changing the template from

value: {{ $web_config.gunicorn.threads | default 5 | quote }}

to

value: {{ dig "gunicorn" "threads" 5 $web_config | quote }}

(and dropping the if $web_config.gunicorn block) might make sense

Edited 1 year ago by Nicolas Dandrimont

Last reply by Antoine R. Dumont 1 year ago
Antoine R. Dumont added 1 commit 1 year ago
added 1 commit

89f6ab94 - production/web: Align archive instance with previous instance moma

Compare with previous version
Antoine R. Dumont changed the description 1 year ago

changed the description
Antoine R. Dumont added 1 commit 1 year ago
added 1 commit

3e597877 - template/web: Simplify the gunicorn setup

Compare with previous version
Antoine R. Dumont added 2 commits 1 year ago
added 2 commits

e821fc5c - production/web: Align archive instance with previous instance moma

8b64ba29 - template/web: Simplify the gunicorn setup

Compare with previous version
Antoine R. Dumont changed the description 1 year ago

changed the description
Nicolas Dandrimont @olasd started a thread on an old version of the diff 1 year ago

Resolved 1 year ago by Antoine R. Dumont
Last reply by Antoine R. Dumont 1 year ago