scheduler/runner: Write first to rabbitmq then to postgresql
The first commit explicts the current behavior and what could pose problem. It writes first to postgresql, then rabbitmq. If anything prevents writing to rabbitmq we could have dangling tasks stuck in 'next_run_scheduled'. They are then never rescheduled unless some cron or operational people unstuck them (sql query).
The second & third commits ensures the writes to postgresql happens after the writes to rabbitmq occur. When introspecting through logs, the current bad scenario described here happens. This should definitely reduce the number of occurrences but won't solve all cases. If a rabbitmq message is indeed sent but for some reason, the message is consumed but not fully consumed, that can still occur. So we'll still need a cleanup process. That's still worth considering as this adds a bit more resilience.
If this proposal is accepted. What needs to happen:
- Tag scheduler
- Deploy this new version ^
- unstuck manually the stuck "tasks" (both staging & production) environments
- monitor whether we still have stuck tasks popping up regularly (maybe craft a prometheus alert [or something] able to count the tasks in the stuck state)
Merge request reports
Activity
Jenkins job DSCH/gitlab-builds #322 succeeded in 4 min 37 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 1 commit
- 48c68c1d - celery_backend/runner: Switch write order to rabbitmq then postgresql
Jenkins job DSCH/gitlab-builds #323 succeeded in 4 min 11 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DSCH/gitlab-builds #324 succeeded in 4 min 14 sec.
See Console Output, Blue Ocean and Coverage Report for more details.mentioned in issue swh/infra/sysadm-environment#5512 (closed)
- Resolved by Antoine R. Dumont
added 1 commit
- b5dcb980 - runner: Update task status only after sending the tasks to rabbitmq
Jenkins job DSCH/gitlab-builds #325 failed in 4 min 14 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 1 commit
- d4e9af17 - runner: Update task status only after sending the tasks to rabbitmq
Jenkins job DSCH/gitlab-builds #326 failed in 4 min 18 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 1 commit
- 9412c28c - runner: Update task status only after sending the tasks to rabbitmq
Jenkins job DSCH/gitlab-builds #327 failed in 4 min 10 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 1 commit
- cbdb7c4d - runner: Update task status only after sending the tasks to rabbitmq
Jenkins job DSCH/gitlab-builds #328 failed in 4 min 16 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Build failure is unrelated to this [1] [2]
[1] https://jenkins.softwareheritage.org/job/DSCH/job/gitlab-builds/328/consoleFull
[2]
14:28:47 swh/scheduler/tests/test_simulator.py:65: 14:28:47 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 14:28:47 swh/scheduler/simulator/common.py:139: in format 14:28:47 plot = self.metrics_plot() 14:28:47 swh/scheduler/simulator/common.py:96: in metrics_plot 14:28:47 return figure.show(legend=True) 14:28:47 .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:459: in show 14:28:47 xmin, xmax = self.x_limits() 14:28:47 .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:192: in x_limits 14:28:47 return self._limits(self._x_min, self._x_max, False) 14:28:47 .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:242: in _limits 14:28:47 return _choose(low, high, low_set, high_set) 14:28:47 .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:581: in _choose 14:28:47 diff = _diff(low, high) 14:28:47 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 14:28:47 14:28:47 low = datetime.datetime(2025, 3, 13, 13, 28, 43, 500065, tzinfo=datetime.timezone.utc) 14:28:47 high = datetime.datetime(2025, 3, 13, 13, 28, 43, 500065, tzinfo=datetime.timezone.utc) 14:28:47 14:28:47 def _diff(low, high): 14:28:47 if low == high: 14:28:47 if low == 0: 14:28:47 return 0.5 14:28:47 else: 14:28:47 > return abs(low * 0.1) 14:28:47 E TypeError: unsupported operand type(s) for *: 'datetime.datetime' and 'float' 14:28:47 14:28:47 .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:545: TypeError
@jenkins retry build
Jenkins job DSCH/gitlab-builds #329 failed in 4 min 17 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DSCH/gitlab-builds #330 failed in 4 min 20 sec.
See Console Output, Blue Ocean and Coverage Report for more details.- Resolved by Antoine R. Dumont
This seems indeed more logical to invert the backend writes, looks good to me.
Broken test is induced by these changes though.
added 5 commits
-
cbdb7c4d...c4f108cf - 2 commits from branch
swh/devel:master
- 41f6156b - celery_backend/runner: Extract a write to backends function
- b98f6f92 - celery_backend/runner: Switch write order to rabbitmq then postgresql
- 593be3d8 - runner: Update task status only after sending the tasks to rabbitmq
Toggle commit list-
cbdb7c4d...c4f108cf - 2 commits from branch
Jenkins job DSCH/gitlab-builds #338 failed in 4 min 13 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 1 commit
- 9fca500c - runner: Update task status only after sending the tasks to rabbitmq
Jenkins job DSCH/gitlab-builds #339 failed in 4 min 9 sec.
See Console Output, Blue Ocean and Coverage Report for more details.added 1 commit
- 163f5526 - runner: Update task status only after sending the tasks to rabbitmq
Jenkins job DSCH/gitlab-builds #340 failed in 4 min 10 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DSCH/gitlab-builds #341 failed in 4 min 11 sec.
See Console Output, Blue Ocean and Coverage Report for more details.- Resolved by Antoine R. Dumont
- Resolved by Antoine R. Dumont
Jenkins job DSCH/gitlab-builds #343 failed in 4 min 35 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DSCH/gitlab-builds #344 failed in 4 min 15 sec.
See Console Output, Blue Ocean and Coverage Report for more details.Jenkins job DSCH/gitlab-builds #345 succeeded in 4 min 18 sec.
See Console Output, Blue Ocean and Coverage Report for more details.- Resolved by Antoine R. Dumont