Wikimedia Cloud Services team/Alerts

From Wikitech
For any new alert runbook create a runbook page on the specific service runbook category instead (see Portal:Cloud_VPS/Admin/Runbooks or Portal:Toolforge/Admin/Runbooks)

This page gives some information about WMCS team alerting system and procedures. All the team alerts go to this same alerting system.

Current alerts

You can see the list of current alerts being triggered in two places:

Old alerts

You can check the old alerts in this logstash dashboard.

Ideal scenario

Ideally every alert will have two links, one called 'runbook' to the detailed steps on how to handle that specific alert, and how to troubleshoot and fix the issue the alert was triggered by. More info in Portal:Cloud_VPS/Admin/Runbooks

And another link called 'dashboard' linking to a relevant grafana dashboard.

Current status (when writing this)

Currently we are transitioning to a prometheus + alertmanager only setup, but we still have some alerts in the old icinga instance. So there's still some alerts that either don't have a runbook

Handling a page

If you are close to a laptop/ device with a browser

When you get paged, the first thing to do is to go to the Wikimedia alertmanager dashboard.

From there you can see all the alerts that are triggering, there you can ack the one/ones that are paging by clicking on the tick mark on the alert:

Or creating a silence with a comment that starts with "ACK! " and adding there some information, like the task number you are using to track the incident.

If you are not close to a laptop

Then, if you are sure you will be able to handle it in a short time, you can ack the page on the splunk mobile app or via sms.

Otherwise, you can just let the paging system wake up the next person in the rotation to handle the incident.

NOTE: If you acked the alert on splunk, remember to also resolve the incident on splunk, otherwise it will page again in 24h if the alert is still there.

Special temporary case: if the alert has the label 'source=icinga'

There's still some alerts that page through icinga, if the alert that is paging you has that 'source=icinga' label, then you will have to go to icinga to silence it, to do so:

  • click on the elapsed time collapsable (to the lower left in the alert box)
  • click on the '' link, that will open the icinga page for that alert
  • in icinga, click 'Schedule downtime for this service'

Related Links