26/07/2021: Unstuck infrastructure outage then post-mortem
Unstuck infrastructure.
What happened so far:
-
Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]
-
Then escalation and most public facing services went down
-
Analysis started on the 26/07 next morning
-
First: status.softwareheritage.org manually updated to notify the issue on channels
-
Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)
-
IDRAC connection and serial console use to access hypervisor(s) and analyze trouble
-
Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable
-
Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation
-
deleted /var/log/ceph.log from all ceph monitors to free disk space
-
restarted ceph monitors
-
Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later)
-
Progressive restart of VMs
-
This unstucks most services
-
Update status.softwareheritage.org with partial service disruption notification
-
Logs still dumping too much information and dangerously close to the initial issue though
-
Stopping workers
-
for host in {branly,hypervisor3,beaubourg} [3]
- Cleaning up voluminous logs
- Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
- Restart ceph-mon@
- Restart ceph-osd@*
- Restart ceph-mgr@
-
... Investigation continues and restarting services ongoing
-
vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5]
-
Investigation did not identify yet the source of the issue
-
The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07).
-
[1] migrated/migration$1092
-
[3] Our main hypervisors our infrastructure rely upon
-
[4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped replaying instructions which were dumping lots of log errors.
-
[5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1
Migrated from T3444 (view on Phabricator)