celery_backend/runner: Switch write order to rabbitmq then postgresql

Messages are now first sent to rabbitmq then postgresql. In the nominal case where all writes are ok, that changes nothing vs the previous implementation (postgresql first then rabbitmq). In degraded performance though, that's supposedly better. 1. If we cannot write to rabbitmq, then we won't write to postgresql either, that function will raise and stop. 2. If we can write to rabbitmq first, then the messages will be consumed independently from this. And then, if we cannot write to postgresql (for some reason), then we just lose the information we sent the task already. This means the same task will be rescheduled and we'll have a go at it again. As those kind of tasks are supposed to be idempotent, that should not a major issue for their upstream. Also, those tasks are mostly listers now and they have a state management of their own, so that should definitely mostly noops (if the ingestion from the previous run went fine). Edge cases scenario like down site will behave as before. Refs. swh/infra/sysadm-environment#5512

celery_backend/runner: Switch write order to rabbitmq then postgresql
Messages are now first sent to rabbitmq then postgresql. In the nominal case where all writes are ok, that changes nothing vs the previous implementation (postgresql first then rabbitmq). In degraded performance though, that's supposedly better. 1. If we cannot write to rabbitmq, then we won't write to postgresql either, that function will raise and stop. 2. If we can write to rabbitmq first, then the messages will be consumed independently from this. And then, if we cannot write to postgresql (for some reason), then we just lose the information we sent the task already. This means the same task will be rescheduled and we'll have a go at it again. As those kind of tasks are supposed to be idempotent, that should not a major issue for their upstream. Also, those tasks are mostly listers now and they have a state management of their own, so that should definitely mostly noops (if the ingestion from the previous run went fine). Edge cases scenario like down site will behave as before. Refs. swh/infra/sysadm-environment#5512
b98f6f92 · Antoine R. Dumont · 41f6156b · b98f6f92
Unverified Commit b98f6f92 authored 2 weeks ago by Antoine R. Dumont
--- a/swh/scheduler/celery_backend/runner.py
+++ b/swh/scheduler/celery_backend/runner.py
@@ -37,21 +37,27 @@ def write_to_backends(
    """Utility function to unify the writing to rabbitmq and the scheduler backends in a
    consistent way (towards transaction-like).

-    In the current state of affairs, the messages are written in postgresql first and
-    then sent to rabbitmq.
+    Messages are first sent to rabbitmq then postgresql.

-    That could pose an issue in case we can write to postgresql and then fail to write
-    in the rabbitmq backend. As those tasks are then marked as "next_run_scheduled",
-    they won't be scheduled again. And as those messages never hit rabbitmq in the first
-    place, that leaves records in a pending state in the postgresql backend (history
-    log).
+    In the nominal case where all writes are ok, that changes nothing vs the previous
+    implementation (postgresql first then rabbitmq).

-    It's not an issue if we cannot write to postgresql, because then this code raises
-    and we do not send the messages to rabbitmq.
+    In degraded performance though, that's supposedly better.
+
+    1. If we cannot write to rabbitmq, then we won't write to postgresql either, that
+    function will raise and stop.
+
+    2. If we can write to rabbitmq first, then the messages will be consumed
+    independently from this. And then, if we cannot write to postgresql (for some
+    reason), then we just lose the information we sent the task already. This means the
+    same task will be rescheduled and we'll have a go at it again. As those kind of
+    tasks are supposed to be idempotent, that should not a major issue for their upstream.
+
+    Also, those tasks are mostly listers now and they have a state management of their
+    own, so that should definitely mostly noops (if the ingestion from the previous run
+    went fine). Edge cases scenario like down site will behave as before.

    """
-    backend.mass_schedule_task_runs(backend_tasks)
-    logger.debug("Written %s celery tasks", len(backend_tasks))
    for with_priority, backend_name, backend_id, args, kwargs in celery_tasks:
        kw = dict(
            task_id=backend_id,
@@ -62,6 +68,8 @@ def write_to_backends(
            kw["queue"] = f"save_code_now:{backend_name}"
        app.send_task(backend_name, **kw)
    logger.debug("Sent %s celery tasks", len(backend_tasks))
+    backend.mass_schedule_task_runs(backend_tasks)
+    logger.debug("Written %s celery tasks", len(backend_tasks))


 def run_ready_tasks(