We regularly have stuck scheduler 'task'.
It may be because during deployment (or something else) the listing tasks takes
too long to finish and it got terminated (when exceeding the 1h cool down threshold).
Since we've been hit by this forever (or the scheduler inception, whatever comes first),
it might be time to add a process in charge of resetting the tasks state to 'next_run_not_scheduled'
when the next-run exceeds the 'current-interval' + some delay to determine (like 10 days?).
If that reveals not enough, we may end up adding an extra job to reset the stuck tasks.
[1] Writing ahead of time in pgsql then in rabbitmq (and fail to do so) can result in such entries (the current behavior). This can be improved upon by making the scheduler code write first to rabbitmq (then in pg, see [1]). That's not a silver bullet though. This situation can still happen if a deployment occurs and the rabbitmq message (without ack_late) is consumed but exceeds the 3600s warm up shutdown (for various reasons, deployment of new scheduler backend, too many queries overwhelming the rpc backend, ...).
Turns out the related issue only shows one problem.
We've got at least 190 stuck incremental listers (and we probably have the same problem for full listers too).
So that bumps up the priority to this one probably (I won't spend my time chasing and unstucking those manually any more).
2024-12-11 09:47:48 softwareheritage-scheduler@albertina:5432 λ select count(*) from task where type~'list-*' and priority is null and policy='recurring' and current_interval='1 day' and status='next_run_scheduled' and next_run < '2024-12-01';+-------+| count |+-------+| 190 |+-------+(1 row)Time: 335.232 ms
I think a simple cronjob running a SQL query that would reset/retry tasks that are recorded as "next_run_not_scheduled" for more than a couple of days would be sufficient to work around the issue at hand.
If we're to go the swh-scheduler cli route, I'd prefer that we take the time to design a proper "lister task scheduler" component, which would reduce the number of (buggy) legacy tasks once more.
I think a simple cronjob running a SQL query that would reset/retry tasks that are recorded as "next_run_scheduled" for more than a couple of days would be sufficient to work around the issue at hand.
That's definitely simpler to implement and deploy and quite faster.
If we're to go the swh-scheduler cli route, I'd prefer that we take the time to design a proper "lister task scheduler" component, which would reduce the number of (buggy) legacy tasks once more.
Given this, adding simple subcommands in the scheduler side should be helpful.
Something like:
one to check there is some stuck tasks (that should converge to None)
and another which unstuck them if the first call find issues
And then evolve the scheduler deployment (chart) to declare cron job (like we do in the webapp, deposit or storage) which calls them regularly.
AFAIUI, there will be some refactoring happening in the 2025 roadmap for the scheduler so the new "lister task scheduler" should be on that plan.
I don't think starting this redesign now with the internal reorganization on the verge of starting is a reasonable approach (if we want to fix/work around it).
[1]
2025-03-07 09:51:20 softwareheritage-scheduler@albertina:5432 λ select type, count(*) from task where status='next_run_scheduled' and type ~ 'list-*' and next_run + current_interval < now() group by type;+------------------------------+-------+| type | count |+------------------------------+-------+| list-bower | 1 || list-cgit | 1 || list-cran | 1 || list-debian-distribution | 1 || list-gitiles | 22 || list-gitlab-full | 1 || list-gitlab-incremental | 165 || list-gitweb | 19 || list-launchpad-incremental | 1 || list-npm-full | 1 || list-sourceforge-incremental | 1 |+------------------------------+-------+(11 rows)Time: 6540.148 ms (00:06.540)
I'm really not fan of implementing such a hack in the charts.
If it's not so complicated to integrate a small command line in the scheduler, I think it would be better to add it there. IMO, I will problably generate less technical debt there.
Another point is we have all the tools and config to call a scheduler command instead crafting an on demand psql authentication configuration.
If a scheduler refactoring occurs later, it will make it visible.
I thought we had settled on the quick sql hack as a cronjob,
I only agreed it'd be simpler to implement (because less cogs to touch, only a cronjob
to add in the chart, attached to the scheduler template or something).
But I was unsettled about pushing this in the deployment part as it's a behavior bug in
the scheduler code and not a real deployment problem. And I realise only now that I
failed to mention as much...
It seems the scheduler runner had selected a list of task by calling the postgresql function, which is in charge of updating the status of the task from next_run_not_scheduled to next_run_scheduled.
Once the task was selected and updated, the scheduler runner try to push the message to rabbitmq but encountered this error:
2025-01-02T00:04:32.167470611Z stderr F urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='rabbitmq.internal.softwareheritage.org', port=15672): Max retries exceeded with url: /api/queues/%2F/swh.loader.mercurial.tasks.LoadMercurialCheckout (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fd027c1b550>: Failed to resolve 'rabbitmq.internal.softwareheritage.org' ([Errno -3] Temporary failure in name resolution)"))
So the task entered in the never scheduled state.
I'm wondering if instead of adding a workaround, the queries in charge of selecting the tasks (in swh_scheduler_peek_no_priority_tasks and swh_scheduler_peek_any_ready_priority_tasks could not be updated from:
select * from task where next_run <= ts and type = task_type and status = 'next_run_not_scheduled' and priority is null order by next_run limit num_tasks;
to something with the same criteria of your update, embracing the possibility a task scheduling can be wrong somewhere at some point:
select * from task where next_run <= ts and type = task_type and ( (status = 'next_run_not_scheduled' and priority is null) or (status = 'next_run_scheduled' and next_run + current_interval < now()) order by next_run limit num_tasks;
That's definitely interesting, that could explain that next-run-scheduled behavior
stuckness indeeed.
And why the previous scheduler-listener implementation change did not change that
behavior. I recall the last time we checked, we only assumed [0] it was the
scheduler-listener (the one that flushes the tasks' state at the end of the scheduler
task processing) was missing either messages or had some wrong conditions on its code
somewhere.
[0] I don't recall whether we had some hard evidence on this nor where we last discusse
this.
I'd evolve the last suggestion sql query to keep the priority is null filter though
[1]
AFAIR it's for the basic tasks with no priority (e.g. task type list-*, deposit,
vault-cook) so what's scheduled by the scheduler-runner process.
We probably have the same behavior for the tasks with priority though. The ones
scheduled by the scheduler-runner-priority process (e.g. save-code-now today). I
recall it calls equivalent but different functions dedicated to deal with non-null
priorities.
[1]
select *from task where next_run <= ts and type = task_type and ((status ='next_run_not_scheduled' or (status = 'next_run_scheduled' and next_run + current_interval < now())) and priority is null)order by next_runlimit num_tasks;
Note: I gather the timeout error message is a wrong paste or something (it's about a
mercurial task and not the lister list-gitweb task mentioned in the scheduled task you
mentioned first).
Antoine R. Dumontchanged title from scheduler: Implement a cleaning process in charge of resetting 'next_run_not_scheduled' regularly to scheduler: Investigate stuck tasks in 'next_run_scheduled' status
changed title from scheduler: Implement a cleaning process in charge of resetting 'next_run_not_scheduled' regularly to scheduler: Investigate stuck tasks in 'next_run_scheduled' status
Investigating through logs and metrics, we noticed that there are peaks on the scheduler rpc sometimes which exceeds the number of replicas worker.
So we've bumped the number of replicas of the scheduler from 4 to 6.
We noticed we don't have any logs on the rpc service, so we'll activate temporarily the logs there to see a wee bit more.
We've activated the DEBUG log info for the scheduler runner too.
Absence of logs in the rpc service did not help when trying to troubleshoot the task which gets stuck as we fail short of having what's actually happening (parsing through elasticsearch logs).
We've activated the DEBUG log info for the scheduler runner too.
As a side note, I noticed the runner currently does lots of queries to
non-existent queues. It's because as we don't provide the task types to
schedule, it uses all the task types referenced in the scheduler backend.
So, there are plenty of http queries done by the runner on rabbitmq which are
useless. That's the query to determine the current number of messages in the
associated queue. There is a fallback number if nothing is found in rabbit.
Then, the runner continues on querying (sql) the scheduler backend for peeking
the task types to schedule (which won't return anything).
That's a lot of things done for noops. If we were to drop those unnecessary
queries, that could decrease altogether the number of queries done in the
ingress (and the gunicorn underneath). Thus decrease the 502 that happens once
in a while when there are a too high number of // queries.
Possible changes are:
swh/devel/swh-scheduler!409 (closed): scheduler code: either change the behavior to ignore the task type once we
detect the queue does not exist yet
scheduler deployment: Explicit the task types the runner should schedule
(the recent refactoring on runner allows this now)
I don't think we should ignore task types for queues that don't exist, as I believe the task scheduler is practically our only mechanism for creating new queues when they don't exist yet (as celery / rabbitmq will create the queue once a message gets sent to it).
And I'm not sure we should have an allow-list of task types either, we've spent a lot of effort to move towards automatic registration of tasks to avoid having to remember all the places where stuff needs to be turned on, it feels like a manual list of allowed tasks would be a regression.
I don't think we should ignore task types for queues that don't exist, as I believe the task scheduler is practically our only mechanism for creating new queues when they don't exist yet (as celery / rabbitmq will create the queue once a message gets sent to it).
Indeed, that's the point i made in the proposed implementation for this ^.
And I'm not sure we should have an allow-list of task types either, we've spent a lot of effort to move towards automatic registration of tasks to avoid having to remember all the places where stuff needs to be turned on, it feels like a manual list of allowed tasks would be a regression.
yes. I agree with the feeling of regression. But I don't see anything better (and simpler) to do.
On one side, we've got some 502 happening because too many requests or some other operational maintenance happening (deployment or what not). So having the means to reduce unnessary queries sounds like a way to go.
And, that ends up leaving dangling scheduling tasks which won't recover. Probably induced by the 502s.
There is still some resilience missing on the scheduler part... And some part could be fixed by the proposal [1]
Any more suggestions to improve the situation is welcome.
For the second part ^, I see one way that avoid a hard-coded list (that i refrained from doing so far because I find it complex).
We could add more intelligence in the swh chart. The scheduler template becomes aware
of the lister, cooker and deposit templates and for the extra-service runner checks the enabled state of those deployments, ...
And something equivalent for the scheduler-recurrent and the loaders...
It hit me too that the runner (without priority) should no longer be scheduling loading [1] tasks.
It's the job of the schedule-recurrent runner now.
So we could also enforce this in the scheduler runner code.
That'd be a simpler change than any other things discussed before.
And that'd reduce too the number of useless requests.
So we could also enforce this in the scheduler runner code.
That'd be a simpler change than any other things discussed before.
And that'd reduce too the number of useless requests.
Something like this should do the trick ^ [1]
I went a bit smarter than just hard-coding thing in the scheduler by
allowing some more flags which can then be driven by the cli call.
The idea being to adapt our deployment chart values by specifying what
patterns we wanna allow to be scheduled [2].
We've got [1] 38 types (out of 96 for now) which are not relevant for the runner to scheduler. Because it's the schedule-recurrent runner's job to schedule those. Only load-deposit should be scheduled by the runner. So the [2] proposal (from previous comment) sounds about right.
[1]
2025-03-17 16:39:49 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type;+-------+| count |+-------+| 96 |+-------+(1 row)Time: 5.280 ms2025-03-17 14:28:14 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'cook-%';+-------+| count |+-------+| 2 |+-------+(1 row)Time: 5.085 ms2025-03-17 14:28:20 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'list-%';+-------+| count |+-------+| 54 |+-------+(1 row)Time: 8.103 ms2025-03-17 14:31:38 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'check-%';+-------+| count |+-------+| 1 |+-------+(1 row)Time: 8.329 ms2025-03-17 14:31:40 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'load-%';+-------+| count |+-------+| 39 |+-------+(1 row)Time: 18.151 ms