26/07/2021: Unstuck infrastructure outage then post-mortem

Unstuck infrastructure.

What happened so far:

Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]
Then escalation and most public facing services went down
Analysis started on the 26/07 next morning
First: status.softwareheritage.org manually updated to notify the issue on channels
Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)
IDRAC connection and serial console use to access hypervisor(s) and analyze trouble
Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable
Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation
deleted /var/log/ceph.log from all ceph monitors to free disk space
restarted ceph monitors
Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later)
Progressive restart of VMs
This unstucks most services
Update status.softwareheritage.org with partial service disruption notification
Logs still dumping too much information and dangerously close to the initial issue though
Stopping workers
for host in {branly,hypervisor3,beaubourg} [3]
- Cleaning up voluminous logs
- Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
- Restart ceph-mon@
- Restart ceph-osd@*
- Restart ceph-mgr@
... Investigation continues and restarting services ongoing
vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5]
Investigation did not identify yet the source of the issue
The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07).
[1] migrated/migration$1092
[2] https://branly.internal.softwareheritage.org:8006/
[3] Our main hypervisors our infrastructure rely upon
[4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped replaying instructions which were dumping lots of log errors.
[5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1
[6] https://status.softwareheritage.org/

Migrated from T3444 (view on Phabricator)

Edited Oct 18, 2022 by Antoine R. Dumont