|Allows humans to contact SRE about urgent emergencies.|
|Puppet classes||klaxon profile::klaxon|
Klaxon is a simple web application that allows Wikimedia Foundation staff, as well as other trusted contributors, to manually notify SRE about outages and other emergency situations.
It can be accessed at https://klaxon.wikimedia.org/.
SREs and other technical contributors wishing to contribute to Klaxon, see also Klaxon/Administration.
What kinds of emergencies should I use Klaxon for?
- Outages that affect many users, that demand an urgent response from SRE, and which aren't already known to the SRE team.
- The compromise of credentials for shell accounts or accounts with NDA access.
- A security vulnerability that is being actively exploited on WMF-run sites or infrastructure.
What shouldn't I use Klaxon for?
- Issues where automated monitoring has already paged SRE. (This is visible in Klaxon itself.)
- Issues that are not urgent / that can wait for business hours to be handled. (File a Phabricator ticket instead.)
- Contacting someone other than SRE.
Who receives pages submitted to Klaxon?
Klaxon is just a webapp over an API provided by the Splunk On-Call service (previously known as VictorOps). This is the service that the SRE team uses to receive push notifications/SMS/phone calls when automated monitoring notices an issue.
Who is allowed to send pages using Klaxon?
- The wmf group, for Wikimedia Foundation staff
- The wmde group, for Wikimedia Deutschland staff
- The nda group, for volunteer contributors who have signed a non-disclosure agreement
- The ops group, people who have root in production
Eventually, it's likely we'd expand this to anyone who has a shell account or MediaWiki/other service deployment access (which often overlaps with one of the above groups, but doesn't always).
Should I ever put confidential data or sensitive information in Klaxon?
If you need to share PII as part of reporting an outage – even if it is just your own IP address – open a WMF-NDA task with the details. If you don't have permission to do that, open a security issue instead.
If you need to urgently report a security issue being actively exploited, open a security issue with the details.
You can then refer to those task numbers within Klaxon.
Klaxon is hosted on Wikimedia infrastructure, and relies upon our SSO service also hosted there – isn't that a problem?
We believe our automated monitoring (which includes externally-hosted meta-monitoring) is more than sufficient to detect issues on the scale of "an entire datacenter went offline" or "lots of critical infrastructure suffered a hard failure".
Klaxon is not intended as a substitute for other kinds of defenses-in-depth; rather, it is intended to allow trusted users to easily escalate urgent issues which fell through the cracks of automated monitoring (which is invariably imperfect).
Klaxon looks not entirely unlike a status page – should it be used for that purpose?
While Klaxon is a quick way to check if SRE has been paged recently, it is not a proper user-facing status page. For user-facing status page, check out Wikimediastatus.net.
For one thing, there's a complicated, not-completely-overlapping relationship between pages, automated alerts, and user-affecting incidents – an incident can exist without any pages ever occurring, and also, vice versa. Moreover, Klaxon displays machine-generated alert summaries, which are often difficult to interpret even for the SRE team themselves. And finally, a true status page would have to be hosted externally, not on WMF networks and infrastructure.
In order to produce a proper user-facing status page, much more work would be needed – not just technical work, but also process work. This is out-of-scope for Klaxon, but future work will hopefully follow soon.
Why do you keep talking about "pages" and "paging"?
The term originates from so-called pager devices (also known as 'beepers'). Unfortunately, like many computing terms, they are also overloaded with multiple meanings.