Skip to content

26/07/2021: Unstuck infrastructure outage then post-mortem

Unstuck infrastructure.

What happened so far:

  • Icinga alerts (IRC notifications) around 25/07/2021 3:27 about socket timeout [1]

  • Then escalation and most public facing services went down

  • Analysis started on the 26/07 next morning

  • First: status.softwareheritage.org manually updated to notify the issue on channels

  • Unable to get SSH access to any machines (the SSHd was hanging shortly after authentication)

  • IDRAC connection and serial console use to access hypervisor(s) and analyze trouble

  • Around noon identification of ceph which spitted lots of logs, which filled in disk on /, which crashed all ceph monitors, which made RBD disks (used for all/most VMs including firewalls) unavailable

  • Copied the huge /var/log/ceph.log file to saam:/srv/storage/space/logs/ceph.log.gz for further investigation

  • deleted /var/log/ceph.log from all ceph monitors to free disk space

  • restarted ceph monitors

  • Restart of the hypervisor3 (around noon) which looked particularly in pain (thus the discrepancy version later)

  • Progressive restart of VMs

  • This unstucks most services

  • Update status.softwareheritage.org with partial service disruption notification

  • Logs still dumping too much information and dangerously close to the initial issue though

  • Stopping workers

  • for host in {branly,hypervisor3,beaubourg} [3]

    • Cleaning up voluminous logs
    • Noticed discrepancy version between 14.2.16 for {branly,beaubourg} and 14.2.22 for hypervisor3 [4]
    • Restart ceph-mon@
    • Restart ceph-osd@*
    • Restart ceph-mgr@
  • ... Investigation continues and restarting services ongoing

  • vms/services restarted progressively over the 26-27/07 period, extra monitoring hypervisor statuses through grafana dashboard [5]

  • Investigation did not identify yet the source of the issue

  • The swh status page [6] did not yet get updated with the new status, this should be updated tomorrow (28/07).

  • [1] migrated/migration$1092

  • [2] https://branly.internal.softwareheritage.org:8006/

  • [3] Our main hypervisors our infrastructure rely upon

  • [4] The most likely fix happened when restarting the ceph-osd@* which somehow dropped replaying instructions which were dumping lots of log errors.

  • [5] https://grafana.softwareheritage.org/goto/Z9UD7sW7z?orgId=1

  • [6] https://status.softwareheritage.org/


Migrated from T3444 (view on Phabricator)

Edited by Antoine R. Dumont