A SWH user just reported in the swh-devel mailing list that it could not download a tarball that should have been cooked by the vault:
Dear Sir/Madam,I'm trying to download some repositories, but my request always remains in "status":"new".I've tried using both the APIs and the web interface, but the result doesn't change.I'm attaching the curl command I'm using, along with its return.Currently, I'm making anonymous requests without logging in.curl -X GET https://archive.softwareheritage.org/api/1/vault/flat/swh:1:dir:4b08ee87eff2025999004c30da9cdf91da60cb44/{ "fetch_url": "https://archive.softwareheritage.org/api/1/vault/flat/swh:1:dir:4b08ee87eff2025999004c30da9cdf91da60cb44/raw/", "progress_message": null, "id": null, "status": "new", "swhid": "swh:1:dir:4b08ee87eff2025999004c30da9cdf91da60cb44"}Can you provide any guidance on the cause?Thanking you in advance,Giacomo Corridori
Looking at associated sentry report, it seems the vault cannot connect to the scheduler service.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Looks like we missed a firewall rule to let vangogh speak to the new scheduler RPC service, and the vault doesn't know how to recover from that.
@ardumont opened the firewall and manually scheduled the missing task.
I went ahead and rescheduled the remaining missing tasks on vangogh:
In[29]:withdb.transaction()ascur:...:cur.execute("select distinct type, swhid from vault_bundle where task_status='new' and task_id is null")...:forbundleinlist(cur.fetchall()):...:task_id=vault._send_task(bundle['type'],bundle['swhid'])...:cur.execute("update vault_bundle set task_id = %s where task_status='new' and task_id is null and type=%s and swhid=%s",(task_id,bundle['type'],bundle['swhid']))...:print(f"task {task_id} scheduled for {bundle['type']}@{bundle['swhid']}")...:...:task415810868scheduledforflat@swh:1:dir:06b779b24d2d7d5b40740bb165564b2c10243983task415810869scheduledforflat@swh:1:dir:0970e5d747f6605d48ebfabd0d705e97b4be311btask415810870scheduledforflat@swh:1:dir:0b193a928ccaec5117102a13df00bc67bddc20c5task415810871scheduledforflat@swh:1:dir:0c81650780e57acbfd672567549b3dc007e1ac88task415810872scheduledforflat@swh:1:dir:0de50af11ab8d59ebff9b3fdec8b6375abeeb6cctask415810873scheduledforflat@swh:1:dir:11567f02da7d8514eca2b625f6fa252c98321b40task415810874scheduledforflat@swh:1:dir:1606700da8f5d714a3cef65390b835c20e7faa7ftask415810875scheduledforflat@swh:1:dir:1bb32e7f7eb427fb6ca37d477145c188774a14fatask415810876scheduledforflat@swh:1:dir:25354235d9002a4f0b922bf5226d49d3eec097e4task415810877scheduledforflat@swh:1:dir:25bd23eb790be24beb9d172c482b6df05541d8batask415810878scheduledforflat@swh:1:dir:3687c6127d1d86d8f18e17319e976c5da88a4acctask415810879scheduledforflat@swh:1:dir:370837369f918c7632d5dd3b3ee02fa09748a619task415810880scheduledforflat@swh:1:dir:3cd80e6c2798b9d84b03a5d90e656b6ecd4ac595task415810881scheduledforflat@swh:1:dir:48d538ebcf5a06bf40e4eb67295ae1a08631a667task415810882scheduledforflat@swh:1:dir:4a8401c54eb893578fe931fda064e29cabffaa9btask415810883scheduledforflat@swh:1:dir:4c28bf005a7fcb61524b699ff2b8415d4dc108eatask415810884scheduledforflat@swh:1:dir:540436a77da521ad2ba23037a3c16714f4ef738atask415810885scheduledforflat@swh:1:dir:5758fe5af9630364cc1190b8f37da8e050d90b61task415810886scheduledforflat@swh:1:dir:5791cc43f43e34c19bb57363cc0eb8ab154f4a57task415810887scheduledforflat@swh:1:dir:5a188b8a815e5cd737ede40eafb34b4c32283121task415810888scheduledforflat@swh:1:dir:649c8451d9ca448d995670b43b6b278f0dee6262task415810889scheduledforflat@swh:1:dir:653126189adefa22303d5c2de9ea6d773b3ae276task415810890scheduledforflat@swh:1:dir:74396939bb577e3b6306c8b0c137d3bbe0439cfbtask415810891scheduledforflat@swh:1:dir:825d430f3c17cf4c4f7bbc9b7bd0b01e4589ca1dtask415810892scheduledforflat@swh:1:dir:8e6a9d9b83e6073f5d60d8cc0cc573081cde57f5task415810893scheduledforflat@swh:1:dir:956d9dd3c3566e8fc1e8f15a03a7a54997ab9e5dtask415810894scheduledforflat@swh:1:dir:a5718a97c161bad6b80ecb0b1f29f82135463526task415810895scheduledforflat@swh:1:dir:a57e8c47adb86116734cccaaa2587fba7245533ftask415810896scheduledforflat@swh:1:dir:a82c2a31caefbac468bb5b928fad14ccc9d06e82task415810897scheduledforflat@swh:1:dir:a9c025d069acbd1b14950ac1d69e0e30d3ce7c69task415810898scheduledforflat@swh:1:dir:b53e05fb336e8b992ba4197742b8420d198ae661task415810899scheduledforflat@swh:1:dir:bf5ae4f82b045ec0ef9b51e53115399004e445c6task415810900scheduledforflat@swh:1:dir:c2848a2e8560bde21106d047def39af16789c513task415810901scheduledforflat@swh:1:dir:c4a57f25ef64e31a599c6eb1047c3d2e05253096task415810902scheduledforflat@swh:1:dir:ccc65ca6bac0009168d60348463bee1d59e8b1f8task415810903scheduledforflat@swh:1:dir:e00b0265408e5ce186a8cf84225300bb437e0467task415810904scheduledforflat@swh:1:dir:e66e071e6caaea77e2e6889d209a059fdcbbbbebtask415810905scheduledforflat@swh:1:dir:ee3dc6898b6cb7bdd4262b4a965bcada2f699754task415810906scheduledforflat@swh:1:dir:efd41424e7c5a5a7c224360eafc20da86391fffdtask415810907scheduledforflat@swh:1:dir:f3afd8a33f6f29195d9dde3898c9c695dfc26dedtask415810908scheduledforflat@swh:1:dir:ff4a73403f93442579c421705b382c6d6c559c1atask415810909scheduledforgit_bare@swh:1:rev:bb184bd0a8f91beec3a00718759e96c7828853de
They're being processed now.
(we had an icinga warning on the vault since we had moved the backend service, but as it triggered first on a saturday it looks like it went unnoticed...)
We probably need to rethink how the vault schedules tasks (and give it a better chance of recovering if there's an error in the communication with the scheduler). Maybe sticking my five lines of python in a cronjob would be good enough?
@ardumont opened the firewall and manually scheduled the missing task.
Yes, i actually rescheduled all tasks that were stuck in next_run_not_scheduled or next_run_scheduled from the past (around 27).
Then updated the aliases in the firewall (where the rules already existed), once i noticed i could not trigger a save code now properly.
The alias was missing the new elastic scheduler ingress.
Then i manually added the directory from the user to be cooked again.
That's about it.
My manual check from last week when i deployed the rpc passed though...
The only thing i can think off that made me think it was ok was that the previous scheduler rpc was still running when i triggered my cooking tasks. And by chance, it hit the old rpc and not the new... So i (wrongly now) thought it was ok.
And indeed I did not see the icinga check triggered on saturday....
2023-09-28 10:20:04 softwareheritage-scheduler@belvedere:5432 λ select count(*) from task where type in ('cook-vault-bundle', 'cook-vault-bundle-batch') and status in ('next_run_scheduled', 'next_run_not_scheduled');+-------+| count |+-------+| 0 |+-------+(1 row)Time: 410.066 ms