Migrate scheduler services to elastic infra
That should ease the deployment of new listers & loaders. And decrease the dependency on the debian package too.
That will also start the scaffolding of the scheduler in swh-charts.
Charts ready (development):
- Build swh-scheduler apps image
- swh/infra/ci-cd/swh-charts!132 (merged): swh-scheduler-schedule-recurrent
- swh/infra/ci-cd/swh-charts!134 (merged): swh-scheduler-listener
- swh/infra/ci-cd/swh-charts!134 (merged): swh-scheduler-runner
- swh/infra/ci-cd/swh-charts!134 (merged): swh-scheduler-runner-priority
- swh/infra/ci-cd/swh-charts!132 (merged): gunicorn-swh-service [ #4780 (closed) ]
- swh/infra/ci-cd/swh-charts!135 (merged): swh-scheduler-update-metrics.{timer,service}
- swh/infra/ci-cd/swh-charts!135 (merged): swh-scheduler-journal-client
Plan:
-
swh-apps: Add swh-scheduler image
-
swh-charts: Reference image ^
-
Create template for scheduler service(s)
-
Tests in minikube
-
Tests in staging
- Label nodes [1]
- Checks -> fail
- Loop to fix issues
- All good [2]
-
Stop temporarily (for now) in scheduler0.staging to check the staging one is able to keep up
- scheduler0.staging: puppet agent --disable; systemctl stop swh-scheduler-schedule-recurrent
- After > 1 day of run, queues are still getting filled regularly [0] && pod is fine [0']
- scheduler0.staging: systemctl stop swh-scheduler-runner
- Deploy scheduler-runner pod
- scheduler0.staging: systemctl stop swh-scheduler-runner-priority
- Deploy scheduler-runner-priority pod
- scheduler0.staging: systemctl stop swh-scheduler-listener
- Deploy scheduler-listener
- Deploy journal client
- Deploy update-metrics
-
Deploy scheduler rpc in staging (requires an ingress)
-
Migrate elastic staging services relying on the static scheduler rpc to the elastic scheduler rpc
-
status on scheduler services running in staging
-
Deploy scheduler services to production
-
Deactivate most services in saatchi (but rpc)
-
Deactivate rpc in saatchi
-
puppet: Clean up scheduler nodes to drop all migrated scheduler services (but rpc) -> at the end of this, the static scheduler node should only run rabbitmq
[0] https://grafana.softwareheritage.org/goto/dRfcY_kIk?orgId=1
[0'] https://grafana.softwareheritage.org/goto/Qplk8_zIk?orgId=1
[1]
$ kubectl --context archive-staging-rke2 label --overwrite node rancher-node-staging-rke2-worker6 swh/scheduler=true
node/rancher-node-staging-rke2-worker6 labeled
$ kubectl --context archive-staging-rke2 label --overwrite node rancher-node-staging-rke2-worker1 swh/scheduler=true
node/rancher-node-staging-rke2-worker1 labeled
[2]
scheduler-schedule-recurrent Running swh command --log-level INFO scheduler --config-file /etc/swh/config.yml schedule-recurrent
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type content with policy already_visited_order_by_lag: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type content with policy never_visited_oldest_update_first: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type content with policy origins_without_last_update: fetched 1.0, requested 0.2
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:content: 1 visits scheduled in queue swh.loader.core.tasks.LoadContent
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git-checkout with policy already_visited_order_by_lag: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git-checkout with policy never_visited_oldest_update_first: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type git-checkout with policy origins_without_last_update: fetched 1.0, requested 0.2
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:git-checkout: 66 visits scheduled in queue swh.loader.git.tasks.LoadGitCheckout
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type tarball-directory with policy already_visited_order_by_lag: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type tarball-directory with policy never_visited_oldest_update_first: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type tarball-directory with policy origins_without_last_update: fetched 1.0, requested 0.2
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:tarball-directory: 200 visits scheduled in queue swh.loader.core.tasks.LoadTarballDirectory
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type tarball-directory with policy already_visited_order_by_lag: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type tarball-directory with policy never_visited_oldest_update_first: fetched 0.0, requested 0.4
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:Skewed fetch for visit type tarball-directory with policy origins_without_last_update: fetched 1.0, requested 0.2
scheduler-schedule-recurrent INFO:swh.scheduler.celery_backend.recurrent_visits:tarball-directory: 77 visits scheduled in queue swh.loader.core.tasks.LoadTarballDirectory
Stream closed EOF for swh/scheduler-schedule-recurrent-5d8644f5d8-f7gbw (prepare-configuration)