Reliable monitoring of services: for users and for admins
This meta-task tracks activities geared towards building reliable monitoring indicators of the Software Heritage services for our users and for our own admins. Every service disruption should be tracked clearly, and avoid messages on IRC saying "oh, it's normal that XXX does not work, we did YYY some weeks ago"
- ensuring that status.softwareheritage.org is always faithfully representing the operational status of the infrastructures
- refining if necessary the list of services on status.softwareheritage.org
- clear planning announcements of scheduled downtime or changes to APIs/WebApp or any other user-visible feature
- add missing monitoring points as needed
On the admin side, we also need to clearly identify the key indicators we want to follow, and reduce noise on them: these indicators should all be green during normal operation, and only show alerts that are meaningful and require intervention.
Migrated from T3129 (view on Phabricator)