scheduler: Investigate stuck tasks in 'next_run_scheduled' status

Turns out the related issue only shows one problem.

We've got at least 190 stuck incremental listers (and we probably have the same problem for full listers too).

So that bumps up the priority to this one probably (I won't spend my time chasing and unstucking those manually any more).

2024-12-11 09:47:48 softwareheritage-scheduler@albertina:5432 λ select count(*) from task where type~'list-*' and priority is null and policy='recurring' and current_interval='1 day' and status='next_run_scheduled' and next_run < '2024-12-01';
+-------+
| count |
+-------+
|   190 |
+-------+
(1 row)

Time: 335.232 ms

Couldn't this be managed directly by the scheduler? as a cli command or directly in the scheduling code

Yes, it could (detect whatever is off and reset the state).

I think a simple cronjob running a SQL query that would reset/retry tasks that are recorded as "next_run_not_scheduled" for more than a couple of days would be sufficient to work around the issue at hand.

If we're to go the swh-scheduler cli route, I'd prefer that we take the time to design a proper "lister task scheduler" component, which would reduce the number of (buggy) legacy tasks once more.

I think a simple cronjob running a SQL query that would reset/retry tasks that are recorded as "next_run_scheduled" for more than a couple of days would be sufficient to work around the issue at hand.

That's definitely simpler to implement and deploy and quite faster.

If we're to go the swh-scheduler cli route, I'd prefer that we take the time to design a proper "lister task scheduler" component, which would reduce the number of (buggy) legacy tasks once more.

Indeed.

(of course I meant next_run_scheduled)

haha, that's what i read ;)

There are 214 listers stuck in production (that was 190 last time I checked [2]), see the list grouped per type [1].

[2] #5512 (comment 191142)

Given this, adding simple subcommands in the scheduler side should be helpful.

Something like:

one to check there is some stuck tasks (that should converge to None)
and another which unstuck them if the first call find issues

And then evolve the scheduler deployment (chart) to declare cron job (like we do in the webapp, deposit or storage) which calls them regularly.

AFAIUI, there will be some refactoring happening in the 2025 roadmap for the scheduler so the new "lister task scheduler" should be on that plan. I don't think starting this redesign now with the internal reorganization on the verge of starting is a reasonable approach (if we want to fix/work around it).

[1]

2025-03-07 09:51:20 softwareheritage-scheduler@albertina:5432 λ select type, count(*) from task where status='next_run_scheduled' and type ~ 'list-*' and next_run + current_interval < now() group by type;
+------------------------------+-------+
|             type             | count |
+------------------------------+-------+
| list-bower                   |     1 |
| list-cgit                    |     1 |
| list-cran                    |     1 |
| list-debian-distribution     |     1 |
| list-gitiles                 |    22 |
| list-gitlab-full             |     1 |
| list-gitlab-incremental      |   165 |
| list-gitweb                  |    19 |
| list-launchpad-incremental   |     1 |
| list-npm-full                |     1 |
| list-sourceforge-incremental |     1 |
+------------------------------+-------+
(11 rows)

Time: 6540.148 ms (00:06.540)

I thought we had settled on the quick sql hack as a cronjob, waiting for the actual lister-related scheduler refactoring for implementing actual code?

I'm really not fan of implementing such a hack in the charts.

If it's not so complicated to integrate a small command line in the scheduler, I think it would be better to add it there. IMO, I will problably generate less technical debt there.

Another point is we have all the tools and config to call a scheduler command instead crafting an on demand psql authentication configuration.

If a scheduler refactoring occurs later, it will make it visible.

I thought we had settled on the quick sql hack as a cronjob,

I only agreed it'd be simpler to implement (because less cogs to touch, only a cronjob to add in the chart, attached to the scheduler template or something).

But I was unsettled about pushing this in the deployment part as it's a behavior bug in the scheduler code and not a real deployment problem. And I realise only now that I failed to mention as much...

For the record, I picked an impacted task randomly to try to understand what is really happening.

For this specific task:

softwareheritage-scheduler=> select * from task where id=415906993;
-[ RECORD 1 ]----+------------------------------------------------------
id               | 415906993
type             | list-gitweb
arguments        | {"args": [], "kwargs": {"instance": "git.stderr.nl"}}
next_run         | 2025-01-02 00:04:00.614148+00
current_interval | 1 day
status           | next_run_scheduled
policy           | recurring
retries_left     | 0
priority         |

It seems the scheduler runner had selected a list of task by calling the postgresql function, which is in charge of updating the status of the task from next_run_not_scheduled to next_run_scheduled. Once the task was selected and updated, the scheduler runner try to push the message to rabbitmq but encountered this error:

2025-01-02T00:04:32.167470611Z stderr F urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='rabbitmq.internal.softwareheritage.org', port=15672): Max retries exceeded with url: /api/queues/%2F/swh.loader.mercurial.tasks.LoadMercurialCheckout (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fd027c1b550>: Failed to resolve 'rabbitmq.internal.softwareheritage.org' ([Errno -3] Temporary failure in name resolution)"))

So the task entered in the never scheduled state.

I'm wondering if instead of adding a workaround, the queries in charge of selecting the tasks (in swh_scheduler_peek_no_priority_tasks and swh_scheduler_peek_any_ready_priority_tasks could not be updated from:

select * from task
  where next_run <= ts
        and type = task_type
        and status = 'next_run_not_scheduled'
        and priority is null
  order by next_run
  limit num_tasks;

to something with the same criteria of your update, embracing the possibility a task scheduling can be wrong somewhere at some point:

select * from task
  where next_run <= ts
        and type = task_type
        and ( 
         (status = 'next_run_not_scheduled' and priority is null) or
         (status = 'next_run_scheduled' and next_run + current_interval < now())
  order by next_run
  limit num_tasks;

Thanks!

That's definitely interesting, that could explain that next-run-scheduled behavior stuckness indeeed.

And why the previous scheduler-listener implementation change did not change that behavior. I recall the last time we checked, we only assumed [0] it was the scheduler-listener (the one that flushes the tasks' state at the end of the scheduler task processing) was missing either messages or had some wrong conditions on its code somewhere.

[0] I don't recall whether we had some hard evidence on this nor where we last discusse this.

I'd evolve the last suggestion sql query to keep the priority is null filter though [1]

AFAIR it's for the basic tasks with no priority (e.g. task type list-*, deposit, vault-cook) so what's scheduled by the scheduler-runner process.

We probably have the same behavior for the tasks with priority though. The ones scheduled by the scheduler-runner-priority process (e.g. save-code-now today). I recall it calls equivalent but different functions dedicated to deal with non-null priorities.

[1]

select *
from task where next_run <= ts
                and type = task_type
                and ((status ='next_run_not_scheduled' or (status = 'next_run_scheduled'
                         and next_run + current_interval < now()))
                     and priority is null)
order by next_run
limit num_tasks;

Note: I gather the timeout error message is a wrong paste or something (it's about a mercurial task and not the lister list-gitweb task mentioned in the scheduled task you mentioned first).

mentioned in commit ardumont/swh-scheduler@6a592e26

mentioned in merge request swh/devel/swh-scheduler!407 (merged)

mentioned in commit ardumont/swh-scheduler@48c68c1d

mentioned in commit ardumont/swh-scheduler@858a3709

mentioned in commit ardumont/swh-scheduler@85ee90e9

Without actually touching sql, I've a possible improvement mr which might fix the problem [1]

[1] swh/devel/swh-scheduler!407 (merged)

[2] if there is only 1 problem and that it is related to actually having some timeouts when communicating with rabbitmq at the wrong time.

mentioned in commit ardumont/swh-scheduler@b5dcb980

mentioned in commit ardumont/swh-scheduler@d4e9af17

mentioned in commit ardumont/swh-scheduler@9412c28c

mentioned in commit ardumont/swh-scheduler@cbdb7c4d

mentioned in commit swh/infra/ci-cd/swh-charts@df6882a7

mentioned in commit swh/infra/ci-cd/swh-charts@e00662d4

mentioned in commit swh/infra/ci-cd/swh-charts@a84602da

changed title from scheduler: Implement a cleaning process in charge of resetting 'next_run_not_scheduled' regularly to scheduler: Investigate stuck tasks in 'next_run_scheduled' status

Investigating through logs and metrics, we noticed that there are peaks on the scheduler rpc sometimes which exceeds the number of replicas worker. So we've bumped the number of replicas of the scheduler from 4 to 6.

We noticed we don't have any logs on the rpc service, so we'll activate temporarily the logs there to see a wee bit more. We've activated the DEBUG log info for the scheduler runner too.

Absence of logs in the rpc service did not help when trying to troubleshoot the task which gets stuck as we fail short of having what's actually happening (parsing through elasticsearch logs).

We've activated the DEBUG log info for the scheduler runner too.

As a side note, I noticed the runner currently does lots of queries to non-existent queues. It's because as we don't provide the task types to schedule, it uses all the task types referenced in the scheduler backend.

So, there are plenty of http queries done by the runner on rabbitmq which are useless. That's the query to determine the current number of messages in the associated queue. There is a fallback number if nothing is found in rabbit.

Then, the runner continues on querying (sql) the scheduler backend for peeking the task types to schedule (which won't return anything).

That's a lot of things done for noops. If we were to drop those unnecessary queries, that could decrease altogether the number of queries done in the ingress (and the gunicorn underneath). Thus decrease the 502 that happens once in a while when there are a too high number of // queries.

Possible changes are:

swh/devel/swh-scheduler!409 (closed): scheduler code: either change the behavior to ignore the task type once we detect the queue does not exist yet
scheduler deployment: Explicit the task types the runner should schedule (the recent refactoring on runner allows this now)

[1] non-existent queues

2025-03-14T08:51:44.095443192Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.julia.tasks.JuliaListerTask HTTP/1.1" 404 49
2025-03-14T08:51:44.174540193Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.crates.tasks.LoadCrates HTTP/1.1" 404 49
2025-03-14T08:51:44.309953983Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.puppet.tasks.LoadPuppet HTTP/1.1" 404 49
2025-03-14T08:51:44.604833133Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.core.tasks.LoadDirectory HTTP/1.1" 404 49
2025-03-14T08:51:44.619906734Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.arch.tasks.LoadArch HTTP/1.1" 404 49
2025-03-14T08:51:44.634616128Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.aur.tasks.LoadAur HTTP/1.1" 404 49
2025-03-14T08:51:44.656878957Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.conda.tasks.LoadConda HTTP/1.1" 404 49
2025-03-14T08:51:44.673892261Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.cpan.tasks.LoadCpan HTTP/1.1" 404 49
2025-03-14T08:51:44.717847398Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.hackage.tasks.LoadHackage HTTP/1.1" 404 49
2025-03-14T08:51:44.731829797Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.hex.tasks.LoadHex HTTP/1.1" 404 49
2025-03-14T08:51:44.746714367Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.rubygems.tasks.LoadRubyGems HTTP/1.1" 404 4
9
2025-03-14T08:51:44.761283619Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.rpm.tasks.LoadRpm HTTP/1.1" 404 49
2025-03-14T08:51:44.776636401Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.bioconductor.tasks.LoadBioconductor HTTP/1.
1" 404 49
2025-03-14T08:51:55.106089415Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.bioconductor.tasks.BioconductorListerTask HTTP/1.1"
 404 49
2025-03-14T08:51:55.184622817Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.bioconductor.tasks.BioconductorIncrementalListerTas
k HTTP/1.1" 404 49
2025-03-14T08:51:55.260457533Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.dlang.tasks.DlangListerTask HTTP/1.1" 404 49
2025-03-14T08:51:55.314049537Z scheduler-runner DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): rabbitmq.internal.softwareheritage.org:15672
2025-03-14T08:51:55.360353708Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.elm.tasks.ElmListerTask HTTP/1.1" 404 49
2025-03-14T08:51:55.378536249Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.julia.tasks.JuliaListerTask HTTP/1.1" 404 49
2025-03-14T08:51:55.500558400Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.crates.tasks.LoadCrates HTTP/1.1" 404 49
2025-03-14T08:51:55.667558543Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.puppet.tasks.LoadPuppet HTTP/1.1" 404 49
2025-03-14T08:51:55.954929590Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.core.tasks.LoadDirectory HTTP/1.1" 404 49
2025-03-14T08:51:55.970022708Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.arch.tasks.LoadArch HTTP/1.1" 404 49
2025-03-14T08:51:55.997838830Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.aur.tasks.LoadAur HTTP/1.1" 404 49
2025-03-14T08:51:56.017185008Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.conda.tasks.LoadConda HTTP/1.1" 404 49
2025-03-14T08:51:56.032214988Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.cpan.tasks.LoadCpan HTTP/1.1" 404 49
2025-03-14T08:51:56.076387250Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.hackage.tasks.LoadHackage HTTP/1.1" 404 49
2025-03-14T08:51:56.094002554Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.hex.tasks.LoadHex HTTP/1.1" 404 49
2025-03-14T08:51:56.109755682Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.rubygems.tasks.LoadRubyGems HTTP/1.1" 404 4
9
2025-03-14T08:51:56.125410956Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.rpm.tasks.LoadRpm HTTP/1.1" 404 49
2025-03-14T08:51:56.143052780Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.bioconductor.tasks.LoadBioconductor HTTP/1.
1" 404 49
2025-03-14T08:52:06.448173402Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.bioconductor.tasks.BioconductorListerTask HTTP/1.1"
 404 49
2025-03-14T08:52:06.531293055Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.bioconductor.tasks.BioconductorIncrementalListerTas
k HTTP/1.1" 404 49
2025-03-14T08:52:06.614105445Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.dlang.tasks.DlangListerTask HTTP/1.1" 404 49
2025-03-14T08:52:06.761969431Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.elm.tasks.ElmListerTask HTTP/1.1" 404 49
2025-03-14T08:52:06.827866180Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.lister.julia.tasks.JuliaListerTask HTTP/1.1" 404 49
2025-03-14T08:52:06.931072760Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.crates.tasks.LoadCrates HTTP/1.1" 404 49
2025-03-14T08:52:07.064805910Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.puppet.tasks.LoadPuppet HTTP/1.1" 404 49
2025-03-14T08:52:07.355396233Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.core.tasks.LoadDirectory HTTP/1.1" 404 49
2025-03-14T08:52:07.369698498Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.arch.tasks.LoadArch HTTP/1.1" 404 49
2025-03-14T08:52:07.383766778Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.aur.tasks.LoadAur HTTP/1.1" 404 49
2025-03-14T08:52:07.398159791Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.conda.tasks.LoadConda HTTP/1.1" 404 49
2025-03-14T08:52:07.412056511Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.cpan.tasks.LoadCpan HTTP/1.1" 404 49
2025-03-14T08:52:07.456334059Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.hackage.tasks.LoadHackage HTTP/1.1" 404 49
2025-03-14T08:52:07.470865310Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.hex.tasks.LoadHex HTTP/1.1" 404 49
2025-03-14T08:52:07.488032651Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.rubygems.tasks.LoadRubyGems HTTP/1.1" 404 4
9
2025-03-14T08:52:07.505114833Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.rpm.tasks.LoadRpm HTTP/1.1" 404 49
2025-03-14T08:52:07.520917011Z scheduler-runner DEBUG:urllib3.connectionpool:http://rabbitmq.internal.softwareheritage.org:15672 "GET /api/queues/%2F/swh.loader.package.bioconductor.tasks.LoadBioconductor HTTP/1.
1" 404 49

I don't think we should ignore task types for queues that don't exist, as I believe the task scheduler is practically our only mechanism for creating new queues when they don't exist yet (as celery / rabbitmq will create the queue once a message gets sent to it).

And I'm not sure we should have an allow-list of task types either, we've spent a lot of effort to move towards automatic registration of tasks to avoid having to remember all the places where stuff needs to be turned on, it feels like a manual list of allowed tasks would be a regression.

I don't think we should ignore task types for queues that don't exist, as I believe the task scheduler is practically our only mechanism for creating new queues when they don't exist yet (as celery / rabbitmq will create the queue once a message gets sent to it).

Indeed, that's the point i made in the proposed implementation for this ^.

And I'm not sure we should have an allow-list of task types either, we've spent a lot of effort to move towards automatic registration of tasks to avoid having to remember all the places where stuff needs to be turned on, it feels like a manual list of allowed tasks would be a regression.

yes. I agree with the feeling of regression. But I don't see anything better (and simpler) to do.

On one side, we've got some 502 happening because too many requests or some other operational maintenance happening (deployment or what not). So having the means to reduce unnessary queries sounds like a way to go.

And, that ends up leaving dangling scheduling tasks which won't recover. Probably induced by the 502s. There is still some resilience missing on the scheduler part... And some part could be fixed by the proposal [1]

Any more suggestions to improve the situation is welcome.

[1] #5512 (comment 199457)

For the second part ^, I see one way that avoid a hard-coded list (that i refrained from doing so far because I find it complex). We could add more intelligence in the swh chart. The scheduler template becomes aware of the lister, cooker and deposit templates and for the extra-service runner checks the enabled state of those deployments, ... And something equivalent for the scheduler-recurrent and the loaders...

But, like I said, that sounds a bit complex...

It hit me too that the runner (without priority) should no longer be scheduling loading [1] tasks. It's the job of the schedule-recurrent runner now.

So we could also enforce this in the scheduler runner code. That'd be a simpler change than any other things discussed before. And that'd reduce too the number of useless requests.

[1] besides {load,check}-deposit tasks

So we could also enforce this in the scheduler runner code. That'd be a simpler change than any other things discussed before. And that'd reduce too the number of useless requests.

Something like this should do the trick ^ [1]

I went a bit smarter than just hard-coding thing in the scheduler by allowing some more flags which can then be driven by the cli call.

The idea being to adapt our deployment chart values by specifying what patterns we wanna allow to be scheduled [2].

[1] swh/devel/swh-scheduler!410 (merged)

[2]

    runner:
      enabled: true
      logLevel: INFO
      period: 10
      requestedMemory: 100Mi
      requestedCpu: 10m
      allowedTaskTypePatterns:
        - "list-"
        - "cook-"
        - "check-"
        - "load-deposit"

We've got [1] 38 types (out of 96 for now) which are not relevant for the runner to scheduler. Because it's the schedule-recurrent runner's job to schedule those. Only load-deposit should be scheduled by the runner. So the [2] proposal (from previous comment) sounds about right.

[1]

2025-03-17 16:39:49 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type;
+-------+
| count |
+-------+
|    96 |
+-------+
(1 row)

Time: 5.280 ms
2025-03-17 14:28:14 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'cook-%';
+-------+
| count |
+-------+
|     2 |
+-------+
(1 row)

Time: 5.085 ms
2025-03-17 14:28:20 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'list-%';
+-------+
| count |
+-------+
|    54 |
+-------+
(1 row)

Time: 8.103 ms
2025-03-17 14:31:38 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'check-%';
+-------+
| count |
+-------+
|     1 |
+-------+
(1 row)

Time: 8.329 ms
2025-03-17 14:31:40 softwareheritage-scheduler@albertina:5432 λ select count(*) from task_type where type like 'load-%';
+-------+
| count |
+-------+
|    39 |
+-------+
(1 row)

Time: 18.151 ms