Wikimedia Cloud Services team/Alerts
This page gives some information about WMCS team alerting system and procedures. All the team alerts go to this same alerting system.
Current alerts
You can see the list of current alerts being triggered in two places:
- Wikimedia wide alertmanager, filtering by the 'team=wmcs' label.
- VPS alertmanager, filtering by the projects you might be interested in
Old alerts
You can check the old alerts in this logstash dashboard.
Ideal scenario
Ideally every alert will have two links, one called 'runbook' to the detailed steps on how to handle that specific alert, and how to troubleshoot and fix the issue the alert was triggered by. More info in Portal:Cloud_VPS/Admin/Runbooks
And another link called 'dashboard' linking to a relevant grafana dashboard.
Current status (when writing this)
Currently we are transitioning to a prometheus + alertmanager only setup, but we still have some alerts in the old icinga instance. So there's still some alerts that either don't have a runbook
Handling a page
If you are close to a laptop/ device with a browser
When you get paged, the first thing to do is to go to the Wikimedia alertmanager dashboard.
From there you can see all the alerts that are triggering, there you can ack the one/ones that are paging by clicking on the tick mark on the alert:
Or creating a silence with a comment that starts with "ACK!
" and adding there some information, like the task number you are using to track the incident.
If you are not close to a laptop
Then, if you are sure you will be able to handle it in a short time, you can ack the page on the splunk mobile app or via sms.
Otherwise, you can just let the paging system wake up the next person in the rotation to handle the incident.
NOTE: If you acked the alert on splunk, remember to also resolve the incident on splunk, otherwise it will page again in 24h if the alert is still there.
Special temporary case: if the alert has the label 'source=icinga'
There's still some alerts that page through icinga, if the alert that is paging you has that 'source=icinga' label, then you will have to go to icinga to silence it, to do so:
- click on the elapsed time collapsable (to the lower left in the alert box)
- click on the 'wikimedia.org' link, that will open the icinga page for that alert
- in icinga, click 'Schedule downtime for this service'