Migrate azure worker vms to cheaper and more efficient vms
Reduce azure cost: change workers to 'b2ms' vms (current 'ds2v2' underused and costly)
Plan:
-
Reasoning: https://hedgedoc.softwareheritage.org/0_eK1R3iSFmMWxwHDQfqOw?edit -
Provision vault-worker[01-02] as b2ms (terraform) -
Decomission worker13 -
Check vault worker are doing their job [1] -
Decomission worker[11-12] -
Adapt puppet manifest to the fqdn changes ^ and deploy -
Provision indexer-worker[01-02] as b2ms (terraform) -
Check everything is fine ^ (firewall rule to edit to allow connection) -
Decomission ds2v2 worker[07-10] -
Provision indexer-worker[03-06] as b2ms (terraform) -
Decomission remaining ds2v2 worker[03-06] -
Update firewall rule + alias -
Update inventory with vms and network interfaces according to ^ -
Kept worker[01-02] for now (so they finish their current job consuming old queue messages)[2] -
Clean up old oneshot tasks related to ^ [4]
Note:
-
This talks about
worker*.euwest.azure
nodes -
Decomission is deleting the node, then remove references to it within puppet master, then update inventory
-
[1]
Jul 18 14:45:59 vault-worker01 python3[2648]: [2022-07-18 14:45:59,239: INFO/MainProcess] vault_cooker@vault-worker01.euwest.azure.internal.softwareheritage.org ready.
Jul 18 14:58:49 vault-worker01 python3[2648]: [2022-07-18 14:58:49,852: INFO/MainProcess] Received task: swh.vault.cooking_tasks.SWHCookingTask[a3c95ae7-4256-4231-bca7-d3224a9149ce]
Jul 18 14:58:54 vault-worker01 python3[2670]: [2022-07-18 14:58:54,821: INFO/ForkPoolWorker-16] Task swh.vault.cooking_tasks.SWHCookingTask[a3c95ae7-4256-4231-bca7-d3224a9149ce] succeeded in 4.852631129999963s: None
Jul 18 15:01:58 vault-worker02 python3[617]: [2022-07-18 15:01:58,023: INFO/MainProcess] Connected to amqp://swhconsumer:**@rabbitmq:5672//
Jul 18 15:01:58 vault-worker02 python3[617]: [2022-07-18 15:01:58,293: INFO/MainProcess] vault_cooker@vault-worker02.euwest.azure.internal.softwareheritage.org ready.
Jul 18 15:02:59 vault-worker02 python3[617]: [2022-07-18 15:02:59,734: INFO/MainProcess] Received task: swh.vault.cooking_tasks.SWHCookingTask[e3649dcc-9d53-4d88-8245-2543e97d584a]
Jul 18 15:03:19 vault-worker02 python3[997]: [2022-07-18 15:03:19,915: INFO/ForkPoolWorker-16] Task swh.vault.cooking_tasks.SWHCookingTask[e3649dcc-9d53-4d88-8245-2543e97d584a] succeeded in 20.0749026s: None
-
[2] Too much lag that will take some time to subside with only 2 vms. Instead, as the new vms will work on the resetted topics and will pass on the missing data [3], we can just scratch those now in the end.
-
[3] #4282 (closed)
-
[4]
11:50:47 softwareheritage-scheduler@belvedere:5432=> select now(), status, count(*) from task where type = 'index-origin-metadata' group by status;
+-------------------------------+------------------------+---------+
| now | status | count |
+-------------------------------+------------------------+---------+
| 2022-07-19 09:50:55.403248+00 | next_run_not_scheduled | 9802941 |
| 2022-07-19 09:50:55.403248+00 | next_run_scheduled | 5263 |
| 2022-07-19 09:50:55.403248+00 | completed | 3225591 |
| 2022-07-19 09:50:55.403248+00 | disabled | 5736 |
+-------------------------------+------------------------+---------+
(4 rows)
Time: 27451.213 ms (00:27.451)
softwareheritage-scheduler=# update task set status='disabled' where type = 'index-origin-metadata' and status in ('next_run_scheduled', 'next_run_not_scheduled');
UPDATE 9808204
12:28:16 softwareheritage-scheduler@belvedere:5432=> select now(), status, count(*) from task where type = 'index-origin-metadata' group by status;
+-------------------------------+-----------+---------+
| now | status | count |
+-------------------------------+-----------+---------+
| 2022-07-19 10:28:26.489037+00 | completed | 3225591 |
| 2022-07-19 10:28:26.489037+00 | disabled | 9813940 |
+-------------------------------+-----------+---------+
(2 rows)
Time: 32793.481 ms (00:32.793)
(ongoing ^)
Migrated from T4395 (view on Phabricator)
Edited by Antoine R. Dumont