[scrubber] Update the secondary database

changed milestone to %MRO 2023

added Datastore Scrubber priority:High labels

marked this issue as related to swh/devel/swh-scrubber#4696 (closed)

mentioned in commit swh/infra/puppet/puppet-swh-site@dbe14098

I replaced sommerset by massmoca in /etc/softwareheritage/scrubber/storage_secondary.yml. The new datastore has been created in the database:

swh-scrubber=# select * from datastore where instance like '%massmoca%';
   id   | package |   class    |                                               instance                                                
--------+---------+------------+-------------------------------------------------------------------------------------------------------
 165895 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=massmoca.internal.softwareheritage.org port=5432
(1 row)

but I can't initialise a scrubber check configuration:

root@scrubber1:~# swh scrubber --config-file /etc/softwareheritage/scrubber/storage_secondary.yml check init --object-type directory --nb-partitions 16384 --name check-config-directory
Error: Configuration check-config-directory already exists

A config-id (generated after the datastore init) should also be added to the configuration to avoid/fix the current issue with the systemd services (related to swh/devel/swh-scrubber#4696 (closed))

Or update in place the current check-config-* with the new datastore id. Which i did to unstuck this, i just updated the configuration in the db:

2023-08-24 13:20:25 swh-scrubber@belvedere:5432 λ select * from datastore;
+--------+---------+------------+-------------------------------------------------------------------------------------------------------+
|   id   | package |   class    |                                               instance                                                |
+--------+---------+------------+-------------------------------------------------------------------------------------------------------+
|      1 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=db.internal.softwareheritage.org port=5432       |
|   4997 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=somerset.internal.softwareheritage.org port=5432 |
| 165895 | storage | postgresql | user=guest password=xxx dbname=softwareheritage host=massmoca.internal.softwareheritage.org port=5432 |
+--------+---------+------------+-------------------------------------------------------------------------------------------------------+
(3 rows)

Time: 11.720 ms

2023-08-24 13:21:40 swh-scrubber@belvedere:5432 λ update check_config set datastore=165895 where datastore=4997 and name like 'check-%';
UPDATE 4
Time: 4.790 ms
2023-08-24 13:21:44 *swh-scrubber@belvedere:5432 λ select * from check_config where name like 'check-%' and datastore!=1;
+----+-----------+-------------+---------------+------------------------+---------+
| id | datastore | object_type | nb_partitions |          name          | comment |
+----+-----------+-------------+---------------+------------------------+---------+
| 13 |    165895 | directory   |         16384 | check-config-directory | (null)  |
| 14 |    165895 | revision    |         16384 | check-config-revision  | (null)  |
| 15 |    165895 | release     |          4096 | check-config-release   | (null)  |
| 16 |    165895 | snapshot    |         16384 | check-config-snapshot  | (null)  |
+----+-----------+-------------+---------------+------------------------+---------+
(4 rows)

Time: 6.435 ms

Thanks.

It does not seem enough to totally unstuck this though... After a puppet run, this restarts the failing services which still fails at some point [1] with an issue [2]

[2]

root@scrubber1:~# systemctl status swh-scrubber-checker-postgres@secondary-directory-0.service
● swh-scrubber-checker-postgres@secondary-directory-0.service - Software Heritage Scrubber Checker Postgres secondary-directory-0
     Loaded: loaded (/etc/systemd/system/swh-scrubber-checker-postgres@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/swh-scrubber-checker-postgres@secondary-directory-0.service.d
             └─parameters.conf
     Active: failed (Result: exit-code) since Thu 2023-08-24 11:28:03 UTC; 2min 35s ago
    Process: 1378082 ExecStart=/usr/bin/swh scrubber check storage $SWH_SCRUBBER_CLI_EXTRA_ARGS (code=exited, status=1/FAILURE)
   Main PID: 1378082 (code=exited, status=1/FAILURE)

Aug 24 11:28:03 scrubber1 swh[1378082]:     result = fn(*args, **kwargs)
Aug 24 11:28:03 scrubber1 swh[1378082]:   File "/usr/lib/python3/dist-packages/swh/scrubber/storage_checker.py", line 244, in _check_partition
Aug 24 11:28:03 scrubber1 swh[1378082]:     with self.statsd.timed(
Aug 24 11:28:03 scrubber1 swh[1378082]:   File "/usr/lib/python3/dist-packages/swh/scrubber/storage_checker.py", line 166, in statsd
Aug 24 11:28:03 scrubber1 swh[1378082]:     "datastore_package": self.datastore.package,
Aug 24 11:28:03 scrubber1 swh[1378082]:   File "/usr/lib/python3/dist-packages/swh/scrubber/storage_checker.py", line 155, in datastore
Aug 24 11:28:03 scrubber1 swh[1378082]:     assert self.config.datastore_id == datastore_id
Aug 24 11:28:03 scrubber1 swh[1378082]: AssertionError
Aug 24 11:28:03 scrubber1 systemd[1]: swh-scrubber-checker-postgres@secondary-directory-0.service: Main process exited, code=exited, status=1/FAILURE
Aug 24 11:28:03 scrubber1 systemd[1]: swh-scrubber-checker-postgres@secondary-directory-0.service: Failed with result 'exit-code'.

[1]

root@scrubber1:~# systemctl list-units | grep swh | grep secondary
● swh-scrubber-checker-postgres@secondary-directory-0.service                               loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-directory-0
● swh-scrubber-checker-postgres@secondary-directory-1.service                               loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-directory-1
● swh-scrubber-checker-postgres@secondary-directory-2.service                               loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-directory-2
● swh-scrubber-checker-postgres@secondary-directory-3.service                               loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-directory-3
● swh-scrubber-checker-postgres@secondary-revision-0.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-revision-0
● swh-scrubber-checker-postgres@secondary-revision-1.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-revision-1
● swh-scrubber-checker-postgres@secondary-revision-2.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-revision-2
● swh-scrubber-checker-postgres@secondary-revision-3.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-revision-3
● swh-scrubber-checker-postgres@secondary-snapshot-0.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-snapshot-0
● swh-scrubber-checker-postgres@secondary-snapshot-1.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-snapshot-1
● swh-scrubber-checker-postgres@secondary-snapshot-2.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-snapshot-2
● swh-scrubber-checker-postgres@secondary-snapshot-3.service                                loaded failed failed    Software Heritage Scrubber Checker Postgres secondary-snapshot-3

It seems there is something off with config-id configuration as per the issue vince opened [1].

[1] swh/devel/swh-scrubber#4696 (closed)

(ongoing) fixing the scrubber

See swh/devel/swh-scrubber!49 (merged).

Thanks.

Deployment of the fix ongoing. It's manually patched and services are running already [1] (+ icinga happy [2]). I'm aiming at the proper deployment fix though ;)

[1]

root@scrubber1:~# systemctl status swh-scrubber-checker-postgres@secondary-*  | grep active
     Active: active (running) since Thu 2023-08-24 12:44:32 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:30 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:31 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:30 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:41:21 UTC; 12min ago
     Active: active (running) since Thu 2023-08-24 12:44:32 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:30 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:31 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:34 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:34 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:34 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:31 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:33 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:33 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:33 UTC; 9min ago
     Active: active (running) since Thu 2023-08-24 12:44:32 UTC; 9min ago

[2]

14:45 <+swhbot> icinga RECOVERY: service check_systemd on scrubber1.internal.softwareheritage.org is OK: SYSTEMD OK - all

Just great. You solved my issue.

Team effort ! ;)

finally...

root@scrubber1:~# dpkg -l python3-swh.scrubber | grep ii
ii  python3-swh.scrubber 2.0.3-1~swh2~bpo10+1 all          Software Heritage Datastore Scrubber
root@scrubber1:~# systemctl list-units | grep swh | grep secondary | awk '{print $1}' | xargs systemctl status | grep active
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 56s ago
     Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:51 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago
     Active: active (running) since Thu 2023-08-24 13:32:50 UTC; 55s ago

added 2h of time spent

mentioned in issue swh/devel/swh-scrubber#4696 (closed)

deleted 2h of spent time from 2023-08-17

mentioned in commit swh/devel/swh-scrubber@8f3dad71

mentioned in commit swh/devel/swh-scrubber@a2f35cef

closed

[scrubber] Update the secondary database

Designs

Child items ...

Activity