scheduler v2.3.0: AttributeError: 'dict' object has no attribute 'type'

changed the description

@anlambert jsyk, I've deployed the new scheduler version v2.3.0 on staging and those errors are popping up. They seem related to the recent typing change in that version.

Note: The scheduler-listener is impacted by those related issues (it's crashlooping on the literal issue [3]).

[3] https://sentry.softwareheritage.org/organizations/swh/issues/110578/?project=7&query=is%3Aunresolved&referrer=issue-stream&stream_index=1

It looks like the remote scheduler service is still returning dicts instead of objects, likely due to client updated prior backend ?

I don't know.

But we do not enforce order when deploying the dynamic stack.

The scheduler rpc service also displays such issue but it's more resilient so it does not crash (or does it?) [3] [4].

[3] https://sentry.softwareheritage.org/organizations/swh/issues/110578/?project=7&query=is%3Aunresolved&referrer=issue-stream&stream_index=1

[4] inline the error here

LiteralError
Value of status None is not any of the literals specified ['scheduled', 'started', 'eventful', 'uneventful', 'failed', 'permfailed', 'lost']

^ Looks like row has not been inserted in scheduler database due to the first reported error and thus the scheduler listener retrieves a null row => boom.

I do not have a clear view on how scheduler related services are upgraded on kube: are the services stopped, upgraded then restarted ?

Anyway, this is clearly due to an API mismtach between client and server.

I do not have a clear view on how scheduler related services are upgraded on kube: are the services stopped, upgraded then restarted ?

We have multiple pods (docker image running specific frozen swh version of our program) running the scheduler rpc (same goes for runner, listener, schedule-recurrent, ...).

Unless specified otherwise (manually), everything is running all the time. And upgrades happen in a rolling fashion. So, per service, one pod is stopped, and started with the new version. When that is done, it's looping with the next pod one at a time. So, you might end up with 2 versions running concurrently at some point.

There is no relation between listener, runner, etc... or the rpc service. So what i've said is true for each of those services independently from one another.

Hence why I mentioned in some meetings, we'd eventually need to develop (kube) operators for doing more subtle upgrades.

Anyway, this is clearly due to an API mismtach between client and server.

They are now running the same versions [1]. So what's the way forward to unstuck the situation?

Cleaning up the celery listener queue?

[1] and the sentry report for the crashing listener mentions that version too

swh@scheduler-rpc-5f66665c47-kv527:~$ grep swh.scheduler requirements-frozen.txt
swh.scheduler==2.3.0
swh@scheduler-rpc-5f66665c47-lb59n:~$ grep swh.scheduler requirements-frozen.txt
swh.scheduler==2.3.0
swh@scheduler-runner-57488f5477-zvnfx:~$ grep swh.scheduler requirements-frozen.txt
swh.scheduler==2.3.0
swh@scheduler-runner-priority-59ffb6647-4td4m:~$ grep swh.scheduler requirements-frozen.txt
swh.scheduler==2.3.0
swh@scheduler-schedule-recurrent-bf8ffd667-8rpkc:~$ grep swh.scheduler requirements-frozen.txt
swh.scheduler==2.3.0

The server and the clients are in the same versions but it's not guaranteed during the deployment as there is no guaranty in the deployment order. This is the reason why the backward compatibility should be assured as far as possible.

Can the messages in the queue be the source of the error?

like loader vN finished it's work and send a message VN in rabbitq and it's the listener v2 who received the message.

It could be great have a vision of the impacted components before a deployment to avoid such impacts.

The question is now how we can solve the situation:

Full cleaning of the queue as @ardumont proposed
Temporary redeploy the previous version of the scheduler and listener until they deal with all the vN messages and crash when the encounter a vN+1 message. If I understood correctly but it will break all the other services trying to interact with the scheduler
Remove one message by one message through the rabbitmq UI to remove the vN messages from the queue Anything else ?

mentioned in merge request swh/infra/ci-cd/swh-charts!391 (merged)

Can the messages in the queue be the source of the error? Full cleaning of the queue as @ardumont proposed

Tested on staging and it does not help. Same behavior is displays, crashloop on the same error.

Temporary redeploy the previous version of the scheduler and listener until they deal with all the vN messages and crash when the encounter a vN+1 message. If I understood correctly but it will break all the other services trying to interact with the scheduler

yes. The new scheduler version brings types and most services interacting with it got updated for it and deployed as well. So that'd be a massive revert all over the places...

Sorry for the mess ...

If I understood correctly, at some point all pods will use the same swh-scheduler version as it is a rolling upgrade and such error would no longer appear ?

As error happens before tasks are sent to celery, we should still be able to schedule those once all components properly upgraded ?

The celery task runners cache the result of the get_task_types call, so if they restart before the RPC service did, they'll be using the result with the old data type (dict instead of SchedulerModelObject).

They also retry on exceptions, so they'd never restart by themselves. Restarting them has made them happy.

The other remaining issue is that of the scheduler listener. The SQL query to mark a task run as started or ended returns a row of NULLs.

The rows are supposed to be created when the celery tasks are sent (to map the swh task id with the celery task uuid, which is the only thing that the listener receives).

However: the recurrent task schedulers do not create entries in the task_run table, but still generate celery messages that the listener has to consume (and ignore). This ignoring used to happen coindcidentally (the backend returned a dict full of None values, which the listener happily discarded), but now SchedulerModelObjects can't be full of Nones so the ignoring fails.

My best idea is to make start_task_run / end_task_run return Optional[TaskRun] if the returned row is full of Nones.

mentioned in merge request !382 (merged)

However: the recurrent task schedulers do not create entries in the task_run table, but still generate celery messages that the listener has to consume (and ignore). This ignoring used to happen coindcidentally (the backend returned a dict full of None values, which the listener happily discarded), but now SchedulerModelObjects can't be full of Nones so the ignoring fails.

My best idea is to make start_task_run / end_task_run return Optional[TaskRun] if the returned row is full of Nones.

I see, working on it.

Thanks for !382 (merged) !

closed with merge request !382 (merged)

scheduler v2.3.0: AttributeError: 'dict' object has no attribute 'type'

Designs

Child items ...

Activity