Skip to content
Snippets Groups Projects

scheduler/runner: Write first to rabbitmq then to postgresql

The first commit explicts the current behavior and what could pose problem. It writes first to postgresql, then rabbitmq. If anything prevents writing to rabbitmq we could have dangling tasks stuck in 'next_run_scheduled'. They are then never rescheduled unless some cron or operational people unstuck them (sql query).

The second & third commits ensures the writes to postgresql happens after the writes to rabbitmq occur. When introspecting through logs, the current bad scenario described here happens. This should definitely reduce the number of occurrences but won't solve all cases. If a rabbitmq message is indeed sent but for some reason, the message is consumed but not fully consumed, that can still occur. So we'll still need a cleanup process. That's still worth considering as this adds a bit more resilience.

If this proposal is accepted. What needs to happen:

  • Tag scheduler
  • Deploy this new version ^
  • unstuck manually the stuck "tasks" (both staging & production) environments
  • monitor whether we still have stuck tasks popping up regularly (maybe craft a prometheus alert [or something] able to count the tasks in the stuck state)

[1] Refs. swh/infra/sysadm-environment#5512 (closed)

Edited by Antoine R. Dumont

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Antoine R. Dumont marked this merge request as draft

    marked this merge request as draft

  • It's currently not enough as the runner calls the grab* tasks which update the task status to 'next-run-scheduled' too. So another commit is commit to improve on that.

  • added 1 commit

    • b5dcb980 - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #325 failed in 4 min 14 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • d4e9af17 - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #326 failed in 4 min 18 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • 9412c28c - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #327 failed in 4 min 10 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • cbdb7c4d - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #328 failed in 4 min 16 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Build failure is unrelated to this [1] [2]

    [1] https://jenkins.softwareheritage.org/job/DSCH/job/gitlab-builds/328/consoleFull

    [2]

    14:28:47  swh/scheduler/tests/test_simulator.py:65: 
    14:28:47  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    14:28:47  swh/scheduler/simulator/common.py:139: in format
    14:28:47      plot = self.metrics_plot()
    14:28:47  swh/scheduler/simulator/common.py:96: in metrics_plot
    14:28:47      return figure.show(legend=True)
    14:28:47  .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:459: in show
    14:28:47      xmin, xmax = self.x_limits()
    14:28:47  .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:192: in x_limits
    14:28:47      return self._limits(self._x_min, self._x_max, False)
    14:28:47  .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:242: in _limits
    14:28:47      return _choose(low, high, low_set, high_set)
    14:28:47  .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:581: in _choose
    14:28:47      diff = _diff(low, high)
    14:28:47  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    14:28:47  
    14:28:47  low = datetime.datetime(2025, 3, 13, 13, 28, 43, 500065, tzinfo=datetime.timezone.utc)
    14:28:47  high = datetime.datetime(2025, 3, 13, 13, 28, 43, 500065, tzinfo=datetime.timezone.utc)
    14:28:47  
    14:28:47      def _diff(low, high):
    14:28:47          if low == high:
    14:28:47              if low == 0:
    14:28:47                  return 0.5
    14:28:47              else:
    14:28:47  >               return abs(low * 0.1)
    14:28:47  E               TypeError: unsupported operand type(s) for *: 'datetime.datetime' and 'float'
    14:28:47  
    14:28:47  .tox/py311/lib/python3.11/site-packages/plotille/_figure.py:545: TypeError
  • @jenkins retry build

  • Jenkins job DSCH/gitlab-builds #329 failed in 4 min 17 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Antoine R. Dumont changed the description

    changed the description

  • Jenkins job DSCH/gitlab-builds #330 failed in 4 min 20 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 5 commits

    • cbdb7c4d...c4f108cf - 2 commits from branch swh/devel:master
    • 41f6156b - celery_backend/runner: Extract a write to backends function
    • b98f6f92 - celery_backend/runner: Switch write order to rabbitmq then postgresql
    • 593be3d8 - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #338 failed in 4 min 13 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Antoine R. Dumont marked this merge request as ready

    marked this merge request as ready

  • Antoine R. Dumont changed title from Draft: proposal: scheduler/runner: Invert backend write orders to (proposal) scheduler/runner: Invert backend write orders

    changed title from Draft: proposal: scheduler/runner: Invert backend write orders to (proposal) scheduler/runner: Invert backend write orders

  • Antoine R. Dumont changed the description

    changed the description

  • added 1 commit

    • 9fca500c - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #339 failed in 4 min 9 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • 163f5526 - runner: Update task status only after sending the tasks to rabbitmq

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #340 failed in 4 min 10 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • cc7dc416 - simulator: Try to fix the unclear error

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #341 failed in 4 min 11 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • 77df447a - simulator: Fix timestamps manipulation

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #343 failed in 4 min 35 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • 0a87a71b - simulator: Fix timestamps manipulation

    Compare with previous version

  • Jenkins job DSCH/gitlab-builds #344 failed in 4 min 15 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • added 1 commit

    • 4df7695c - simulator: Fix timestamps manipulation

    Compare with previous version

  • Antoine R. Dumont changed title from (proposal) scheduler/runner: Invert backend write orders to scheduler/runner: Write first to rabbitmq then to postgresql

    changed title from (proposal) scheduler/runner: Invert backend write orders to scheduler/runner: Write first to rabbitmq then to postgresql

  • Antoine R. Dumont changed the description

    changed the description

  • Jenkins job DSCH/gitlab-builds #345 succeeded in 4 min 18 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Antoine Lambert approved this merge request

    approved this merge request

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading