belvedere SSD failure

added System administration priority:High state:wip labels

zpool status output:

root@belvedere:~# zpool status
  pool: data                                                                            
 state: DEGRADED                                                                   
status: One or more devices are faulted in response to persistent errors.          
        Sufficient replicas exist for the pool to continue functioning in a           
        degraded state.                                                            
action: Replace the faulted device, or use 'zpool clear' to mark the device           
        repaired.                                                          
  scan: scrub repaired 0B in 0 days 03:39:19 with 0 errors on Wed Dec  2 04:03:21 2020
remove: Removal of vdev 4 copied 1.15T in 0h50m, completed on Mon May 11 19:41:47 2020
    2.62M memory used for removed device mappings                                          
config:                                                                               
                                                                                           
        NAME                        STATE     READ WRITE CKSUM                             
        data                        DEGRADED     0     0     0                             
          mirror-0                  ONLINE       0     0     0                 
            wwn-0x50000397bc8969c1  ONLINE       0     0     0                        
            wwn-0x50000396dc8a2191  ONLINE       0     0     0                 
          mirror-1                  ONLINE       0     0     0                        
            wwn-0x500003965c898904  ONLINE       0     0     0                    
            wwn-0x500003965c898910  ONLINE       0     0     0                          
          mirror-3                  DEGRADED     0     0     0                     
            wwn-0x500003965c89895c  ONLINE       0     0     0                     
            wwn-0x500003965c898968  FAULTED     16     0     0  too many errors       
          mirror-5                  ONLINE       0     0     0                     
            wwn-0x50000396dc8a2161  ONLINE       0     0     0                        
            wwn-0x500003972c8a44d1  ONLINE       0     0     0             
          mirror-6                  ONLINE       0     0     0                        
            wwn-0x50000396dc8a2185  ONLINE       0     0     0             
            wwn-0x50000397bc8969b1  ONLINE       0     0     0                             
          mirror-7                  ONLINE       0     0     0                        
            wwn-0x5002538c409a8258  ONLINE       0     0     0                             
            wwn-0x5002538c40ba6713  ONLINE       0     0     0                             
          mirror-8                  ONLINE       0     0     0                             
            wwn-0x5002538c409a825a  ONLINE       0     0     0                 
            wwn-0x5002538c40ba66ab  ONLINE       0     0     0             
          mirror-9                  ONLINE       0     0     0                 
            wwn-0x5002538c409a8252  ONLINE       0     0     0                        
            wwn-0x5002538c409a8256  ONLINE       0     0     0                    
          mirror-10                 ONLINE       0     0     0                          
            wwn-0x5002538c40ba6716  ONLINE       0     0     0                     
            wwn-0x5002538c409a8253  ONLINE       0     0     0

dmesg output

Dec 15 13:31:10 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#149 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:31:10 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#149 Sense Key : Medium Error [current] 
Dec 15 13:31:10 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#149 Add. Sense: Read retries exhausted
Dec 15 13:31:10 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#149 CDB: Read(10) 28 00 b4 f6 82 90 00 00 78 00
Dec 15 13:31:10 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 3036054176
Dec 15 13:31:10 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=1554458681344 size=61440 flags=180880
Dec 15 13:31:12 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#621 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:31:12 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#621 Sense Key : Medium Error [current] 
Dec 15 13:31:12 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#621 Add. Sense: Read retries exhausted
Dec 15 13:31:12 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#621 CDB: Read(10) 28 00 50 4d 4d 38 00 00 70 00
Dec 15 13:31:12 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 1347243320
Dec 15 13:31:12 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=689787531264 size=57344 flags=180880
Dec 15 13:31:37 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#60 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:31:37 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#60 Sense Key : Medium Error [current] 
Dec 15 13:31:37 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#60 Add. Sense: Read retries exhausted
Dec 15 13:31:37 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#60 CDB: Read(10) 28 00 7d d3 21 60 00 00 68 00
Dec 15 13:31:37 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 2110988656
Dec 15 13:31:37 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=1080825135104 size=53248 flags=180880
Dec 15 13:32:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#508 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:32:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#508 Sense Key : Medium Error [current] 
Dec 15 13:32:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#508 Add. Sense: Read retries exhausted
Dec 15 13:32:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#508 CDB: Read(10) 28 00 6a bc c9 e0 00 00 98 00
Dec 15 13:32:39 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 1790757480
Dec 15 13:32:39 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=916866711552 size=77824 flags=180880
Dec 15 13:32:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#732 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:32:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#732 Sense Key : Medium Error [current] 
Dec 15 13:32:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#732 Add. Sense: Read retries exhausted
Dec 15 13:32:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#732 CDB: Read(10) 28 00 07 c1 de 28 00 00 88 00
Dec 15 13:32:41 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 130145880
Dec 15 13:32:41 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=66633617408 size=69632 flags=180880
Dec 15 13:32:42 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#651 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:32:42 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#651 Sense Key : Medium Error [current] 
Dec 15 13:32:42 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#651 Add. Sense: Read retries exhausted
Dec 15 13:32:42 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#651 CDB: Read(10) 28 00 57 ad 64 28 00 00 80 00
Dec 15 13:32:42 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 1470981256
Dec 15 13:32:42 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=753141305344 size=65536 flags=180880
Dec 15 13:32:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#909 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:32:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#909 Sense Key : Medium Error [current] 
Dec 15 13:32:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#909 Add. Sense: Read retries exhausted
Dec 15 13:32:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#909 CDB: Read(10) 28 00 00 00 0a 10 00 00 10 00
Dec 15 13:32:43 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 2576
Dec 15 13:32:43 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=270336 size=8192 flags=b0ac1
Dec 15 13:32:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#150 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 13:32:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#150 Sense Key : Medium Error [current] 
Dec 15 13:32:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#150 Add. Sense: Read retries exhausted
Dec 15 13:32:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:20:0: [sdo] tag#150 CDB: Read(10) 28 00 50 42 5e 28 00 00 70 00
Dec 15 13:32:44 belvedere.internal.softwareheritage.org kernel: print_req_error: critical medium error, dev sdo, sector 1346526808
Dec 15 13:32:44 belvedere.internal.softwareheritage.org kernel: zio pool=data vdev=/dev/disk/by-id/wwn-0x500003965c898968-part1 error=61 type=1 offset=689420652544 size=57344 flags=180880
[...]

Doesn't look constrained to a single area of the disk.

We don't have spare disks in this pool, so evacuate the data from the mirror pair onto other disks:

root@belvedere:~# zpool remove data mirror-3

Stopping the indexer workers to lessen the load on the server a little:

olasd@pergamon $ sudo clush -w @azure-workers 'puppet agent --disable "storage upgrade"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@indexer*; do systemctl disable $unit; done; systemctl stop swh-worker@indexer*'

Current output of zpool status:

root@belvedere:~# zpool status
  pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0 days 03:39:19 with 0 errors on Wed Dec  2 04:03:21 2020
remove: Evacuation of mirror in progress since Tue Dec 15 13:53:16 2020
    69.6G copied out of 1.23T at 112M/s, 5.52% done, 3h2m to go
    2.62M memory used for removed device mappings
config:

        NAME                        STATE     READ WRITE CKSUM
        data                        DEGRADED     0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x50000397bc8969c1  ONLINE       0     0     0
            wwn-0x50000396dc8a2191  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x500003965c898904  ONLINE       0     0     0
            wwn-0x500003965c898910  ONLINE       0     0     0
          mirror-3                  DEGRADED     0     0     0
            wwn-0x500003965c89895c  ONLINE       0     0     0
            wwn-0x500003965c898968  FAULTED     16     0     0  too many errors
          mirror-5                  ONLINE       0     0     0
            wwn-0x50000396dc8a2161  ONLINE       0     0     0
            wwn-0x500003972c8a44d1  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            wwn-0x50000396dc8a2185  ONLINE       0     0     0
            wwn-0x50000397bc8969b1  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            wwn-0x5002538c409a8258  ONLINE       0     0     0
            wwn-0x5002538c40ba6713  ONLINE       0     0     0
          mirror-8                  ONLINE       0     0     0
            wwn-0x5002538c409a825a  ONLINE       0     0     0
            wwn-0x5002538c40ba66ab  ONLINE       0     0     0
          mirror-9                  ONLINE       0     0     0
            wwn-0x5002538c409a8252  ONLINE       0     0     0
            wwn-0x5002538c409a8256  ONLINE       0     0     0
          mirror-10                 ONLINE       0     0     0
            wwn-0x5002538c40ba6716  ONLINE       0     0     0
            wwn-0x5002538c409a8253  ONLINE       0     0     0

errors: No known data errors

After the end of the eviction of the mirror pair with the dead disk, I reintroduced the other disk of the pair as spare, and started a zpool scrub to check the data integrity on the whole pool.

This threw a lot of checksum errors, apparently identical in a bunch of disk pairs (indicating actual data corruption).

I stopped all workers at this point to reduce the load on the server.

Unfortunately after zfs decided to use the spare disk to replace one of the disks it considered faulty, and finished resilvering its data, it looks like the checksum errors got cleared.

The quantity of checksum errors was suspicious, considering that the automatic monthly scrub had not shown any integrity issues. I've started another scrub to double check last night's results...

Current zpool status after this morning's scrub:

  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 02:49:03 with 4 errors on Wed Dec 16 11:32:55 2020
remove: Removal of vdev 3 copied 1.23T in 5h35m, completed on Tue Dec 15 19:28:52 2020
    88.3M memory used for removed device mappings
config:

        NAME                        STATE     READ WRITE CKSUM
        data                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x50000397bc8969c1  ONLINE       0     0     4
            wwn-0x50000396dc8a2191  ONLINE       0     0     4
          mirror-1                  ONLINE       0     0     0
            wwn-0x500003965c898904  ONLINE       0     0     2
            wwn-0x500003965c898910  ONLINE       0     0     2
          mirror-5                  ONLINE       0     0     0
            wwn-0x50000396dc8a2161  ONLINE       0     0     0
            wwn-0x500003972c8a44d1  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            wwn-0x50000396dc8a2185  ONLINE       0     0     4
            wwn-0x50000397bc8969b1  ONLINE       0     0     4
          mirror-7                  ONLINE       0     0     0
            wwn-0x5002538c409a8258  ONLINE       0     0     0
            wwn-0x5002538c40ba6713  ONLINE       0     0     0
          mirror-8                  ONLINE       0     0     0
            wwn-0x5002538c409a825a  ONLINE       0     0     0
            wwn-0x5002538c40ba66ab  ONLINE       0     0     0
          mirror-9                  ONLINE       0     0     0
            wwn-0x5002538c409a8252  ONLINE       0     0     0
            wwn-0x5002538c409a8256  ONLINE       0     0     0
          mirror-10                 ONLINE       0     0     0
            wwn-0x5002538c40ba6716  ONLINE       0     0     0
            wwn-0x5002538c409a8253  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /srv/softwareheritage/postgres/12/main/base/16401/11390005.9
        /srv/softwareheritage/postgres/12/main/base/16401/5682137.167
        /srv/softwareheritage/postgres/12/main/base/16401/17165.3
        /srv/softwareheritage/postgres/12/main/base/16401/5682137.102

This is really confusing, because this set of corrupted files is much, much, much smaller than what was reported yesterday. I feel that this points at a systemic issue rather than actual data corruption.

(I have some bits of yesterday's error logs in my paste history:

errors: Permanent errors have been detected in the following files:

        /srv/softwareheritage/postgres/12/main/base/16401/5682148.85
        /srv/softwareheritage/postgres/12/main/base/16401/1497181971.3
        /srv/softwareheritage/postgres/12/main/base/16401/5682148.433
        /srv/softwareheritage/postgres/12/main/base/16401/5682148.67
        /srv/softwareheritage/postgres/12/main/base/16401/5682148.210
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.65
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.66

no common files whatsoever?)

I've now (manually) switched over read-only users of the database (moma, storage01.euwest.azure) to the replica on somerset, so we can unfurl this further without more visible downtime...

At that point it seems the four files referenced by the postgres error seem to actually be corrupted. zfs refuses to read some blocks of the files mentioned, and (apparently) only these blocks.

Considering the small extent of the actual damage (until proven otherwise), it's worth investigating restoring these database objects specifically rather than restoring the full database from backup.

Identifying the affected postgres objects

PostgreSQL database files are stored by oid (database object id); To know which object is stored in the given files, we can use the pg_class catalog table.

select oid, relname from pg_class where oid in (17165, 5682137, 11390005);
   oid    │      relname      
──────────┼───────────────────
 11390005 │ origin_visit_pkey
  5682137 │ directory_pkey
    17165 │ person
(3 lignes)

Fixing the issue

The two indexes (origin_visit_pkey, directory_pkey) can be handled with the reindex command.

The person table is complete on the replica (somerset), checked with select max(id) from person, so we should be able to dump it from there and restore it on belvedere.

Reindex operations started...

In the meantime, restored data of the person table from somerset:

softwareheritage=# begin;
BEGIN
softwareheritage=# alter table release drop CONSTRAINT release_author_fkey;                                                                                                                                                                                                                                                 
ALTER TABLE
softwareheritage=# alter table revision drop CONSTRAINT revision_author_fkey;                                                                                                                                                                                                                                               
ALTER TABLE
softwareheritage=# alter table revision drop CONSTRAINT revision_committer_fkey;                                                                                                                                                                                                                                            
ALTER TABLE
softwareheritage=# truncate table person;                                                                                                                                                                                                                                                                                   
TRUNCATE TABLE
softwareheritage=# \copy person from '/tmp/person'                                                                                                                                                                                                                                                                          
COPY 40318974
softwareheritage=#   alter table release add constraint release_author_fkey foreign key (author) references person(id) not valid;
ALTER TABLE
softwareheritage=#   alter table release validate constraint release_author_fkey;
ALTER TABLE
softwareheritage=# savepoint after_release;
SAVEPOINT
softwareheritage=# alter table revision add constraint revision_author_fkey foreign key (author) references person(id), add constraint revision_committer_fkey foreign key (committer) references person(id);                                                                                                               
ALTER TABLE
softwareheritage=# commit;
COMMIT

After this was done, restarted a zpool scrub. When this was at 80% with no corruption detected, restarted a worker.

lo and behold:

  pool: data                                                                                                                                                                                                                                                                                                                 
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 02:45:53 with 1692 errors on Thu Dec 17 00:01:33 2020
remove: Removal of vdev 3 copied 1.23T in 5h35m, completed on Tue Dec 15 19:28:52 2020
    88.3M memory used for removed device mappings
config:

        NAME                        STATE     READ WRITE CKSUM
        data                        DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            wwn-0x50000397bc8969c1  DEGRADED     0     0   380  too many errors
            wwn-0x50000396dc8a2191  DEGRADED     0     0   380  too many errors
          mirror-1                  DEGRADED     0     0     0
            wwn-0x500003965c898904  DEGRADED     0     0   188  too many errors
            wwn-0x500003965c898910  DEGRADED     0     0   188  too many errors
          mirror-5                  DEGRADED     0     0     0
            wwn-0x50000396dc8a2161  DEGRADED     0     0   462  too many errors
            wwn-0x500003972c8a44d1  DEGRADED     0     0   462  too many errors
          mirror-6                  DEGRADED     0     0     0
            wwn-0x50000396dc8a2185  DEGRADED     0     0   344  too many errors
            wwn-0x50000397bc8969b1  DEGRADED     0     0   344  too many errors
          mirror-7                  DEGRADED     0     0     0
            wwn-0x5002538c409a8258  DEGRADED     0     0   980  too many errors
            wwn-0x5002538c40ba6713  DEGRADED     0     0   980  too many errors
          mirror-8                  DEGRADED     0     0     0
            wwn-0x5002538c409a825a  DEGRADED     0     0   286  too many errors
            wwn-0x5002538c40ba66ab  DEGRADED     0     0   286  too many errors
          mirror-9                  DEGRADED     0     0     0
            wwn-0x5002538c409a8252  DEGRADED     0     0   298  too many errors
            wwn-0x5002538c409a8256  DEGRADED     0     0   298  too many errors
          mirror-10                 DEGRADED     0     0     0
            wwn-0x5002538c40ba6716  DEGRADED     0     0   470  too many errors
            wwn-0x5002538c409a8253  DEGRADED     0     0   470  too many errors

errors: Permanent errors have been detected in the following files:

        /srv/softwareheritage/postgres/12/main/base/16401/17157.12
        /srv/softwareheritage/postgres/12/secondary/base/16404/1039026315.1
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.4
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.2
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.3
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.7
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.4
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.11
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.7
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.5
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.1
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.6
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.5
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.12
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.3
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.10
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.8
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.70
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527481.77
        /srv/softwareheritage/postgres/12/secondary/base/16402/12351813.83
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.4
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.10
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.13
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.2
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.8
        /srv/softwareheritage/postgres/12/secondary/base/16404/1039026315.30
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.132
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.3
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.9
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.6
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.1
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.2
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.5
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.6
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.8
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.1
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.2
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698606.9
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.3
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.6
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.1
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042698709.7
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699422.5
        /srv/softwareheritage/postgres/12/secondary/base/16404/1042699581.4
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527473.71
        /srv/softwareheritage/postgres/12/secondary/base/16402/1527453.8

None of these files look corrupted (pv -E $file reports no errors, while it did report some errors on the leftover files from yesterday's scrub). These files look active (i.e. they're files for tables postgres is actively writing to.

This points to either:

a disk controller issue (race condition between writes and reads due to a caching?)
a RAM issue (checksums corrupting in flight)
a ZFS bug

I've double checked and all the files referenced can be read without issue; This means that the checksums of the data blocks are, in fact, okay.

The RAM is all ECC and I don't see any flags about that, so I very much doubt that it's at fault.

This leaves a controller issue. The disks are set as non-RAID but there's very likely a writeback cache enabled by the controller. Using dell openmanage tools there doesn't seem to be a way to either confirm that said cache exists for non-RAID disks, (well, there is a claim of 2048 MB of cache on board), or a way to disable it ("command not supported").

I guess the next step is using the controller in HBA mode instead of RAID mode. That means having to transfer the OS out of the current RAID1.

To make the controller HBA instead of RAID, we needed to clear all virtual devices. Thankfully we had one spare SSD on the system, so I did the following:

boot on live OS
make filesystem on temp ssd
rsync data from old RAID1 to temp ssd

once that's done, delete vdev from the storage config (via idrac) and switch disks to non-RAID mode

re-partition the two disks : ESP (1GB), RAID partition (rest of disk)
create a raid0 on the two raid partitions: `mdadm --create --level 10 --layout f2 --raid-devices 2 /dev/sdr2 /dev/sdt2
create a vfat partition for the ESP: mkfs.vfat -F32 /dev/sdr1
mount the new partitions, rsync the data from temp ssd to new partitions

Once the copy was done:

chroot to the new (raid-ed) partition
update /etc/mdadm/mdadm.conf (using /usr/share/mdadm/mkconf)
update /etc/fstab with new partition UUIDs (use blkid to find them)
update-initramfs -k all -u to update mdadm config in the initramfs
update-grub to update the grub config
grub-install dummy to update the boot manager setup (efi variables)
efibootmgr -v to check that the new entry for "Debian" points at the new EFI system partition (compare the filesystem UUID with the output of blkid)

After all that was done, switched the controller to HBA mode in the iDRAC, and set the change to apply at the next reboot. umounted all partitions and rebooted out of the live system.

The system came back up from the (new software) RAID.

I've now started a (new) zpool scrub...

zpool scrub completed with 0 errors.

I've restarted a couple workers for now, the rest can follow later.

All workers have been restarted now and zfs is not seeing any more issues. We'll need to order some replacements for the eldest disks in the array, which can happen in the followup task.

removed state:wip label

closed

marked this issue as related to #2907

mentioned in issue #3081 (closed)

belvedere SSD failure

Designs

Child items 0

Activity

Identifying the affected postgres objects

Fixing the issue