One of the SSDs on belvedere has shown too many read errors and has been kicked out of the ZFS pool.
The number of I/O errors associated with a ZFS device exceededacceptable levels. ZFS has marked the device as faulted. impact: Fault tolerance of the pool may be compromised. eid: 661 class: statechange state: FAULTED host: belvedere.internal.softwareheritage.org time: 2020-12-15 13:33:01+0000 vpath: /dev/disk/by-id/wwn-0x500003965c898968-part1 vphys: pci-0000:3b:00.0-scsi-0:0:20:0 vguid: 0x400555B66F317456 devid: scsi-3500003965c898968-part1 pool: 0xA4C042D6C48B4A15
Doesn't look constrained to a single area of the disk.
We don't have spare disks in this pool, so evacuate the data from the mirror pair onto other disks:
root@belvedere:~# zpool remove data mirror-3
Stopping the indexer workers to lessen the load on the server a little:
olasd@pergamon $ sudo clush -w @azure-workers 'puppet agent --disable "storage upgrade"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@indexer*; do systemctl disable $unit; done; systemctl stop swh-worker@indexer*'
Current output of zpool status:
root@belvedere:~# zpool status pool: data state: DEGRADEDstatus: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 0 days 03:39:19 with 0 errors on Wed Dec 2 04:03:21 2020remove: Evacuation of mirror in progress since Tue Dec 15 13:53:16 2020 69.6G copied out of 1.23T at 112M/s, 5.52% done, 3h2m to go 2.62M memory used for removed device mappingsconfig: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000397bc8969c1 ONLINE 0 0 0 wwn-0x50000396dc8a2191 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 wwn-0x500003965c898904 ONLINE 0 0 0 wwn-0x500003965c898910 ONLINE 0 0 0 mirror-3 DEGRADED 0 0 0 wwn-0x500003965c89895c ONLINE 0 0 0 wwn-0x500003965c898968 FAULTED 16 0 0 too many errors mirror-5 ONLINE 0 0 0 wwn-0x50000396dc8a2161 ONLINE 0 0 0 wwn-0x500003972c8a44d1 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 wwn-0x50000396dc8a2185 ONLINE 0 0 0 wwn-0x50000397bc8969b1 ONLINE 0 0 0 mirror-7 ONLINE 0 0 0 wwn-0x5002538c409a8258 ONLINE 0 0 0 wwn-0x5002538c40ba6713 ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 wwn-0x5002538c409a825a ONLINE 0 0 0 wwn-0x5002538c40ba66ab ONLINE 0 0 0 mirror-9 ONLINE 0 0 0 wwn-0x5002538c409a8252 ONLINE 0 0 0 wwn-0x5002538c409a8256 ONLINE 0 0 0 mirror-10 ONLINE 0 0 0 wwn-0x5002538c40ba6716 ONLINE 0 0 0 wwn-0x5002538c409a8253 ONLINE 0 0 0errors: No known data errors
After the end of the eviction of the mirror pair with the dead disk, I reintroduced the other disk of the pair as spare, and started a zpool scrub to check the data integrity on the whole pool.
This threw a lot of checksum errors, apparently identical in a bunch of disk pairs (indicating actual data corruption).
I stopped all workers at this point to reduce the load on the server.
Unfortunately after zfs decided to use the spare disk to replace one of the disks it considered faulty, and finished resilvering its data, it looks like the checksum errors got cleared.
The quantity of checksum errors was suspicious, considering that the automatic monthly scrub had not shown any integrity issues. I've started another scrub to double check last night's results...
pool: data state: ONLINEstatus: One or more devices has experienced an error resulting in data corruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0 days 02:49:03 with 4 errors on Wed Dec 16 11:32:55 2020remove: Removal of vdev 3 copied 1.23T in 5h35m, completed on Tue Dec 15 19:28:52 2020 88.3M memory used for removed device mappingsconfig: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x50000397bc8969c1 ONLINE 0 0 4 wwn-0x50000396dc8a2191 ONLINE 0 0 4 mirror-1 ONLINE 0 0 0 wwn-0x500003965c898904 ONLINE 0 0 2 wwn-0x500003965c898910 ONLINE 0 0 2 mirror-5 ONLINE 0 0 0 wwn-0x50000396dc8a2161 ONLINE 0 0 0 wwn-0x500003972c8a44d1 ONLINE 0 0 0 mirror-6 ONLINE 0 0 0 wwn-0x50000396dc8a2185 ONLINE 0 0 4 wwn-0x50000397bc8969b1 ONLINE 0 0 4 mirror-7 ONLINE 0 0 0 wwn-0x5002538c409a8258 ONLINE 0 0 0 wwn-0x5002538c40ba6713 ONLINE 0 0 0 mirror-8 ONLINE 0 0 0 wwn-0x5002538c409a825a ONLINE 0 0 0 wwn-0x5002538c40ba66ab ONLINE 0 0 0 mirror-9 ONLINE 0 0 0 wwn-0x5002538c409a8252 ONLINE 0 0 0 wwn-0x5002538c409a8256 ONLINE 0 0 0 mirror-10 ONLINE 0 0 0 wwn-0x5002538c40ba6716 ONLINE 0 0 0 wwn-0x5002538c409a8253 ONLINE 0 0 0errors: Permanent errors have been detected in the following files: /srv/softwareheritage/postgres/12/main/base/16401/11390005.9 /srv/softwareheritage/postgres/12/main/base/16401/5682137.167 /srv/softwareheritage/postgres/12/main/base/16401/17165.3 /srv/softwareheritage/postgres/12/main/base/16401/5682137.102
This is really confusing, because this set of corrupted files is much, much, much smaller than what was reported yesterday. I feel that this points at a systemic issue rather than actual data corruption.
(I have some bits of yesterday's error logs in my paste history:
errors: Permanent errors have been detected in the following files: /srv/softwareheritage/postgres/12/main/base/16401/5682148.85 /srv/softwareheritage/postgres/12/main/base/16401/1497181971.3 /srv/softwareheritage/postgres/12/main/base/16401/5682148.433 /srv/softwareheritage/postgres/12/main/base/16401/5682148.67 /srv/softwareheritage/postgres/12/main/base/16401/5682148.210 /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.65 /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.66
I've now (manually) switched over read-only users of the database (moma, storage01.euwest.azure) to the replica on somerset, so we can unfurl this further without more visible downtime...
At that point it seems the four files referenced by the postgres error seem to actually be corrupted. zfs refuses to read some blocks of the files mentioned, and (apparently) only these blocks.
Considering the small extent of the actual damage (until proven otherwise), it's worth investigating restoring these database objects specifically rather than restoring the full database from backup.
Identifying the affected postgres objects
PostgreSQL database files are stored by oid (database object id); To know which object is stored in the given files, we can use the pg_class catalog table.
select oid, relname from pg_class where oid in (17165, 5682137, 11390005); oid │ relname ──────────┼─────────────────── 11390005 │ origin_visit_pkey 5682137 │ directory_pkey 17165 │ person(3 lignes)
Fixing the issue
The two indexes (origin_visit_pkey, directory_pkey) can be handled with the reindex command.
The person table is complete on the replica (somerset), checked with select max(id) from person, so we should be able to dump it from there and restore it on belvedere.
In the meantime, restored data of the person table from somerset:
softwareheritage=# begin;BEGINsoftwareheritage=# alter table release drop CONSTRAINT release_author_fkey; ALTER TABLEsoftwareheritage=# alter table revision drop CONSTRAINT revision_author_fkey; ALTER TABLEsoftwareheritage=# alter table revision drop CONSTRAINT revision_committer_fkey; ALTER TABLEsoftwareheritage=# truncate table person; TRUNCATE TABLEsoftwareheritage=# \copy person from '/tmp/person' COPY 40318974softwareheritage=# alter table release add constraint release_author_fkey foreign key (author) references person(id) not valid;ALTER TABLEsoftwareheritage=# alter table release validate constraint release_author_fkey;ALTER TABLEsoftwareheritage=# savepoint after_release;SAVEPOINTsoftwareheritage=# alter table revision add constraint revision_author_fkey foreign key (author) references person(id), add constraint revision_committer_fkey foreign key (committer) references person(id); ALTER TABLEsoftwareheritage=# commit;COMMIT
After this was done, restarted a zpool scrub. When this was at 80% with no corruption detected, restarted a worker.
lo and behold:
pool: data state: DEGRADEDstatus: One or more devices has experienced an error resulting in data corruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0 days 02:45:53 with 1692 errors on Thu Dec 17 00:01:33 2020remove: Removal of vdev 3 copied 1.23T in 5h35m, completed on Tue Dec 15 19:28:52 2020 88.3M memory used for removed device mappingsconfig: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 wwn-0x50000397bc8969c1 DEGRADED 0 0 380 too many errors wwn-0x50000396dc8a2191 DEGRADED 0 0 380 too many errors mirror-1 DEGRADED 0 0 0 wwn-0x500003965c898904 DEGRADED 0 0 188 too many errors wwn-0x500003965c898910 DEGRADED 0 0 188 too many errors mirror-5 DEGRADED 0 0 0 wwn-0x50000396dc8a2161 DEGRADED 0 0 462 too many errors wwn-0x500003972c8a44d1 DEGRADED 0 0 462 too many errors mirror-6 DEGRADED 0 0 0 wwn-0x50000396dc8a2185 DEGRADED 0 0 344 too many errors wwn-0x50000397bc8969b1 DEGRADED 0 0 344 too many errors mirror-7 DEGRADED 0 0 0 wwn-0x5002538c409a8258 DEGRADED 0 0 980 too many errors wwn-0x5002538c40ba6713 DEGRADED 0 0 980 too many errors mirror-8 DEGRADED 0 0 0 wwn-0x5002538c409a825a DEGRADED 0 0 286 too many errors wwn-0x5002538c40ba66ab DEGRADED 0 0 286 too many errors mirror-9 DEGRADED 0 0 0 wwn-0x5002538c409a8252 DEGRADED 0 0 298 too many errors wwn-0x5002538c409a8256 DEGRADED 0 0 298 too many errors mirror-10 DEGRADED 0 0 0 wwn-0x5002538c40ba6716 DEGRADED 0 0 470 too many errors wwn-0x5002538c409a8253 DEGRADED 0 0 470 too many errorserrors: Permanent errors have been detected in the following files: /srv/softwareheritage/postgres/12/main/base/16401/17157.12 /srv/softwareheritage/postgres/12/secondary/base/16404/1039026315.1 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.4 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.2 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.3 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.7 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.4 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.11 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.7 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.5 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.1 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.6 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.5 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.12 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.3 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.10 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.8 /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.70 /srv/softwareheritage/postgres/12/secondary/base/16402/1527481.77 /srv/softwareheritage/postgres/12/secondary/base/16402/12351813.83 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.4 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.10 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.13 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.2 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.8 /srv/softwareheritage/postgres/12/secondary/base/16404/1039026315.30 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422 /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.132 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.3 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.9 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.6 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.1 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.2 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.5 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.6 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.8 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.1 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.2 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.9 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.3 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.6 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.1 /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.7 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.5 /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.4 /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.71 /srv/softwareheritage/postgres/12/secondary/base/16402/1527453.8
None of these files look corrupted (pv -E $file reports no errors, while it did report some errors on the leftover files from yesterday's scrub).
These files look active (i.e. they're files for tables postgres is actively writing to.
This points to either:
a disk controller issue (race condition between writes and reads due to a caching?)
I've double checked and all the files referenced can be read without issue; This means that the checksums of the data blocks are, in fact, okay.
The RAM is all ECC and I don't see any flags about that, so I very much doubt that it's at fault.
This leaves a controller issue. The disks are set as non-RAID but there's very likely a writeback cache enabled by the controller. Using dell openmanage tools there doesn't seem to be a way to either confirm that said cache exists for non-RAID disks, (well, there is a claim of 2048 MB of cache on board), or a way to disable it ("command not supported").
I guess the next step is using the controller in HBA mode instead of RAID mode. That means having to transfer the OS out of the current RAID1.
To make the controller HBA instead of RAID, we needed to clear all virtual devices. Thankfully we had one spare SSD on the system, so I did the following:
boot on live OS
make filesystem on temp ssd
rsync data from old RAID1 to temp ssd
once that's done, delete vdev from the storage config (via idrac) and switch disks to non-RAID mode
re-partition the two disks : ESP (1GB), RAID partition (rest of disk)
create a raid0 on the two raid partitions: `mdadm --create --level 10 --layout f2 --raid-devices 2 /dev/sdr2 /dev/sdt2
create a vfat partition for the ESP: mkfs.vfat -F32 /dev/sdr1
mount the new partitions, rsync the data from temp ssd to new partitions
update /etc/fstab with new partition UUIDs (use blkid to find them)
update-initramfs -k all -u to update mdadm config in the initramfs
update-grub to update the grub config
grub-install dummy to update the boot manager setup (efi variables)
efibootmgr -v to check that the new entry for "Debian" points at the new EFI system partition (compare the filesystem UUID with the output of blkid)
After all that was done, switched the controller to HBA mode in the iDRAC, and set the change to apply at the next reboot. umounted all partitions and rebooted out of the live system.
The system came back up from the (new software) RAID.
All workers have been restarted now and zfs is not seeing any more issues. We'll need to order some replacements for the eldest disks in the array, which can happen in the followup task.