The scrubbers of the secondary database are still scrubbing somerset which is decommissioned since a couple of months now.
The configuration must be updated to scrub massmoca
A config-id (generated after the datastore init) should also be added to the configuration to avoid/fix the current issue with the systemd services (related to swh/devel/swh-scrubber#4696 (closed))
A config-id (generated after the datastore init) should also be added to the configuration to avoid/fix the current issue with the systemd services (related to swh/devel/swh-scrubber#4696 (closed))
Or update in place the current check-config-* with the new datastore id.
Which i did to unstuck this, i just updated the configuration in the db:
2023-08-24 13:20:25 swh-scrubber@belvedere:5432 λ select * from datastore;+--------+---------+------------+-------------------------------------------------------------------------------------------------------+| id | package | class | instance |+--------+---------+------------+-------------------------------------------------------------------------------------------------------+| 1 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=db.internal.softwareheritage.org port=5432 || 4997 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=somerset.internal.softwareheritage.org port=5432 || 165895 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=massmoca.internal.softwareheritage.org port=5432 |+--------+---------+------------+-------------------------------------------------------------------------------------------------------+(3 rows)Time: 11.720 ms2023-08-24 13:21:40 swh-scrubber@belvedere:5432 λ update check_config set datastore=165895 where datastore=4997 and name like 'check-%';UPDATE 4Time: 4.790 ms2023-08-24 13:21:44 *swh-scrubber@belvedere:5432 λ select * from check_config where name like 'check-%' and datastore!=1;+----+-----------+-------------+---------------+------------------------+---------+| id | datastore | object_type | nb_partitions | name | comment |+----+-----------+-------------+---------------+------------------------+---------+| 13 | 165895 | directory | 16384 | check-config-directory | (null) || 14 | 165895 | revision | 16384 | check-config-revision | (null) || 15 | 165895 | release | 4096 | check-config-release | (null) || 16 | 165895 | snapshot | 16384 | check-config-snapshot | (null) |+----+-----------+-------------+---------------+------------------------+---------+(4 rows)Time: 6.435 ms
It does not seem enough to totally unstuck this though...
After a puppet run, this restarts the failing services which still fails at some point [1] with an issue [2]
[2]
root@scrubber1:~# systemctl status swh-scrubber-checker-postgres@secondary-directory-0.service● swh-scrubber-checker-postgres@secondary-directory-0.service - Software Heritage Scrubber Checker Postgres secondary-directory-0 Loaded: loaded (/etc/systemd/system/swh-scrubber-checker-postgres@.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/swh-scrubber-checker-postgres@secondary-directory-0.service.d └─parameters.conf Active: failed (Result: exit-code) since Thu 2023-08-24 11:28:03 UTC; 2min 35s ago Process: 1378082 ExecStart=/usr/bin/swh scrubber check storage $SWH_SCRUBBER_CLI_EXTRA_ARGS (code=exited, status=1/FAILURE) Main PID: 1378082 (code=exited, status=1/FAILURE)Aug 24 11:28:03 scrubber1 swh[1378082]: result = fn(*args, **kwargs)Aug 24 11:28:03 scrubber1 swh[1378082]: File "/usr/lib/python3/dist-packages/swh/scrubber/storage_checker.py", line 244, in _check_partitionAug 24 11:28:03 scrubber1 swh[1378082]: with self.statsd.timed(Aug 24 11:28:03 scrubber1 swh[1378082]: File "/usr/lib/python3/dist-packages/swh/scrubber/storage_checker.py", line 166, in statsdAug 24 11:28:03 scrubber1 swh[1378082]: "datastore_package": self.datastore.package,Aug 24 11:28:03 scrubber1 swh[1378082]: File "/usr/lib/python3/dist-packages/swh/scrubber/storage_checker.py", line 155, in datastoreAug 24 11:28:03 scrubber1 swh[1378082]: assert self.config.datastore_id == datastore_idAug 24 11:28:03 scrubber1 swh[1378082]: AssertionErrorAug 24 11:28:03 scrubber1 systemd[1]: swh-scrubber-checker-postgres@secondary-directory-0.service: Main process exited, code=exited, status=1/FAILUREAug 24 11:28:03 scrubber1 systemd[1]: swh-scrubber-checker-postgres@secondary-directory-0.service: Failed with result 'exit-code'.
Deployment of the fix ongoing. It's manually patched and services are running already [1] (+ icinga happy [2]).
I'm aiming at the proper deployment fix though ;)
[1]
root@scrubber1:~# systemctl status swh-scrubber-checker-postgres@secondary-* | grep active Active: active (running) since Thu 2023-08-24 12:44:32 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:30 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:31 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:30 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:41:21 UTC; 12min ago Active: active (running) since Thu 2023-08-24 12:44:32 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:30 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:31 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:34 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:34 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:34 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:31 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:33 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:33 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:33 UTC; 9min ago Active: active (running) since Thu 2023-08-24 12:44:32 UTC; 9min ago
[2]
14:45 <+swhbot> icinga RECOVERY: service check_systemd on scrubber1.internal.softwareheritage.org is OK: SYSTEMD OK - all
root@scrubber1:~# dpkg -l python3-swh.scrubber | grep iiii python3-swh.scrubber 2.0.3-1~swh2~bpo10+1 all Software Heritage Datastore Scrubberroot@scrubber1:~# systemctl list-units | grep swh | grep secondary | awk '{print $1}' | xargs systemctl status | grep active Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago