Vault service is currently broken in production

added Vault activity::MRO priority:High labels

Looks like we missed a firewall rule to let vangogh speak to the new scheduler RPC service, and the vault doesn't know how to recover from that.

@ardumont opened the firewall and manually scheduled the missing task.

I went ahead and rescheduled the remaining missing tasks on vangogh:

In [29]: with db.transaction() as cur:
    ...:     cur.execute("select distinct type, swhid from vault_bundle where task_status='new' and task_id is null")
    ...:     for bundle in list(cur.fetchall()):
    ...:         task_id = vault._send_task(bundle['type'], bundle['swhid'])
    ...:         cur.execute("update vault_bundle set task_id = %s where task_status='new' and task_id is null and type=%s and swhid=%s", (task_id, bundle['type'], bundle['swhid']))
    ...:         print(f"task {task_id} scheduled for {bundle['type']}@{bundle['swhid']}")
    ...:     
    ...:     
task 415810868 scheduled for flat@swh:1:dir:06b779b24d2d7d5b40740bb165564b2c10243983
task 415810869 scheduled for flat@swh:1:dir:0970e5d747f6605d48ebfabd0d705e97b4be311b
task 415810870 scheduled for flat@swh:1:dir:0b193a928ccaec5117102a13df00bc67bddc20c5
task 415810871 scheduled for flat@swh:1:dir:0c81650780e57acbfd672567549b3dc007e1ac88
task 415810872 scheduled for flat@swh:1:dir:0de50af11ab8d59ebff9b3fdec8b6375abeeb6cc
task 415810873 scheduled for flat@swh:1:dir:11567f02da7d8514eca2b625f6fa252c98321b40
task 415810874 scheduled for flat@swh:1:dir:1606700da8f5d714a3cef65390b835c20e7faa7f
task 415810875 scheduled for flat@swh:1:dir:1bb32e7f7eb427fb6ca37d477145c188774a14fa
task 415810876 scheduled for flat@swh:1:dir:25354235d9002a4f0b922bf5226d49d3eec097e4
task 415810877 scheduled for flat@swh:1:dir:25bd23eb790be24beb9d172c482b6df05541d8ba
task 415810878 scheduled for flat@swh:1:dir:3687c6127d1d86d8f18e17319e976c5da88a4acc
task 415810879 scheduled for flat@swh:1:dir:370837369f918c7632d5dd3b3ee02fa09748a619
task 415810880 scheduled for flat@swh:1:dir:3cd80e6c2798b9d84b03a5d90e656b6ecd4ac595
task 415810881 scheduled for flat@swh:1:dir:48d538ebcf5a06bf40e4eb67295ae1a08631a667
task 415810882 scheduled for flat@swh:1:dir:4a8401c54eb893578fe931fda064e29cabffaa9b
task 415810883 scheduled for flat@swh:1:dir:4c28bf005a7fcb61524b699ff2b8415d4dc108ea
task 415810884 scheduled for flat@swh:1:dir:540436a77da521ad2ba23037a3c16714f4ef738a
task 415810885 scheduled for flat@swh:1:dir:5758fe5af9630364cc1190b8f37da8e050d90b61
task 415810886 scheduled for flat@swh:1:dir:5791cc43f43e34c19bb57363cc0eb8ab154f4a57
task 415810887 scheduled for flat@swh:1:dir:5a188b8a815e5cd737ede40eafb34b4c32283121
task 415810888 scheduled for flat@swh:1:dir:649c8451d9ca448d995670b43b6b278f0dee6262
task 415810889 scheduled for flat@swh:1:dir:653126189adefa22303d5c2de9ea6d773b3ae276
task 415810890 scheduled for flat@swh:1:dir:74396939bb577e3b6306c8b0c137d3bbe0439cfb
task 415810891 scheduled for flat@swh:1:dir:825d430f3c17cf4c4f7bbc9b7bd0b01e4589ca1d
task 415810892 scheduled for flat@swh:1:dir:8e6a9d9b83e6073f5d60d8cc0cc573081cde57f5
task 415810893 scheduled for flat@swh:1:dir:956d9dd3c3566e8fc1e8f15a03a7a54997ab9e5d
task 415810894 scheduled for flat@swh:1:dir:a5718a97c161bad6b80ecb0b1f29f82135463526
task 415810895 scheduled for flat@swh:1:dir:a57e8c47adb86116734cccaaa2587fba7245533f
task 415810896 scheduled for flat@swh:1:dir:a82c2a31caefbac468bb5b928fad14ccc9d06e82
task 415810897 scheduled for flat@swh:1:dir:a9c025d069acbd1b14950ac1d69e0e30d3ce7c69
task 415810898 scheduled for flat@swh:1:dir:b53e05fb336e8b992ba4197742b8420d198ae661
task 415810899 scheduled for flat@swh:1:dir:bf5ae4f82b045ec0ef9b51e53115399004e445c6
task 415810900 scheduled for flat@swh:1:dir:c2848a2e8560bde21106d047def39af16789c513
task 415810901 scheduled for flat@swh:1:dir:c4a57f25ef64e31a599c6eb1047c3d2e05253096
task 415810902 scheduled for flat@swh:1:dir:ccc65ca6bac0009168d60348463bee1d59e8b1f8
task 415810903 scheduled for flat@swh:1:dir:e00b0265408e5ce186a8cf84225300bb437e0467
task 415810904 scheduled for flat@swh:1:dir:e66e071e6caaea77e2e6889d209a059fdcbbbbeb
task 415810905 scheduled for flat@swh:1:dir:ee3dc6898b6cb7bdd4262b4a965bcada2f699754
task 415810906 scheduled for flat@swh:1:dir:efd41424e7c5a5a7c224360eafc20da86391fffd
task 415810907 scheduled for flat@swh:1:dir:f3afd8a33f6f29195d9dde3898c9c695dfc26ded
task 415810908 scheduled for flat@swh:1:dir:ff4a73403f93442579c421705b382c6d6c559c1a
task 415810909 scheduled for git_bare@swh:1:rev:bb184bd0a8f91beec3a00718759e96c7828853de

They're being processed now.

(we had an icinga warning on the vault since we had moved the backend service, but as it triggered first on a saturday it looks like it went unnoticed...)

We probably need to rethink how the vault schedules tasks (and give it a better chance of recovering if there's an error in the communication with the scheduler). Maybe sticking my five lines of python in a cronjob would be good enough?

FWIW I just submitted swh/devel/swh-vault!188 (merged) that should enable to resubmit a request that failed with such issue.

@ardumont opened the firewall and manually scheduled the missing task.

Yes, i actually rescheduled all tasks that were stuck in next_run_not_scheduled or next_run_scheduled from the past (around 27). Then updated the aliases in the firewall (where the rules already existed), once i noticed i could not trigger a save code now properly. The alias was missing the new elastic scheduler ingress.

Then i manually added the directory from the user to be cooked again. That's about it.

My manual check from last week when i deployed the rpc passed though... The only thing i can think off that made me think it was ok was that the previous scheduler rpc was still running when i triggered my cooking tasks. And by chance, it hit the old rpc and not the new... So i (wrongly now) thought it was ok.

And indeed I did not see the icinga check triggered on saturday....

mentioned in commit anlambert/swh-vault@5e9b8c79

mentioned in merge request swh/devel/swh-vault!188 (merged)

mentioned in commit anlambert/swh-vault@0aa61bfb

Current status: it's running again.

No tasks are currently blocked either [2]

There remain one unrelated and transient storage issue unrelated to this [1].

[1] #5057 (closed)

[2]

2023-09-28 10:20:04 softwareheritage-scheduler@belvedere:5432 λ select count(*) from task where type in  ('cook-vault-bundle', 'cook-vault-bundle-batch') and status in ('next_run_scheduled', 'next_run_not_scheduled');
+-------+
| count |
+-------+
|     0 |
+-------+
(1 row)

Time: 410.066 ms

changed milestone to %MRO 2023

closed

mentioned in commit anlambert/swh-vault@a6bbde18

Vault service is currently broken in production

Designs

Child items ...

Activity