Wikimedia Cloud Services team/EnhancementProposals/WMCS-SRE-book

The WMCS SRE book is a collection of agreement, best practices and engineering operations workflows that we all in the team try to follow and apply in our day to day work.
You can find here hints and tips for common tasks and situations we face. It is not intended to contain very technical information (like how to restart service X).

Community interaction/communication

This section describes our practices for interacting with the community for some special operations, including but not limited to:

deprecation plans (we are shutting down a service or are no longer supporting a given technology)
expected downtime (we have to do a planned operation in a window)
outage/incident communication (we suffered an unexpected issue that affected our users/community)

Communication channels

We interact with our users using different mechanisms:

Using Phabricator tasks and a number of Phabricator tags/workboards.
On IRC, in the #wikimedia-cloud ^connect channel on libera.chat.
Via mailing lists: cloud-announce@lists.wikimedia.org, cloud@lists.wikimedia.org, and cloud-admin@lists.wikimedia.org.
On wiki: News articles on wikitech, talk page messages on wikitech, talk page messages on "home" wikis
Direct email

Due to the nature of each channel (real time, async, group, direct, etc) they are often used for different purposes.

TODO: introduce here general rules for picking a channel to send a particular type of communication

Deprecation plans

TODO: add information and examples here. Context: Grid migration, Ubuntu deprecation in CloudVPS/Toolforge, Jessie deprecation, etc.

Expected downtime

It is common that we need to perform operations that introduces downtime for our services or otherwise creates negative impact for our users experience. This section contains detail on how to handle this kind of situations.

General rules:

If the operation is going to cause downtime to users (any amount), announce it to the mailing lists at least 1 week ahead of the window.
When communicating operation windows, be precise in which kinds of downtime users should expect (all services down? network failure? database failure?)
When communicating operation windows, include concrete information on what services are affected (does this affect Cloud VPS but not Toolforge? or the other way around?)
When communicating operation windows, be very explicit about time and dates. Use an internationally format for dates (YYYY-MM-DD) and the UTC timezone for times.

Recommended timeline:

1 week before the operation window: initial announcement email
1 minute before the operation window: email letting users know the operation is starting
1 minute after operations ended: optional email letting users know the aftermath (not always required)

Examples

Example email sent by Andrew regarding an Openstack upgrade to the cloud-announce mailing list. The email was sent 7 days in advance:

Example email sent by Arturo regarding an operation to reboot several cloudvirt servers, sent to the cloud-announce mailing list. The email was sent 7 days in advance:

Changes tracking

TODO: Do we have internal rules for introducing patches into the several repos we use for work? ops/puppet, dns, etc.

Tickets and work tracking

This section describes our practices using a ticketing system (Phabricator) to track our work and issues related to our systems and services.

Server Admin Logs

This section describes our practices for recording the operations we do, like changes to servers or services.

Logging an operation to a SAL creates a papertrail of what we do. This improves collaboration in a team which is distributed by nature. Also, it helps improving the transparency of our operations.

Use !log admin ... in #wikimedia-cloud ^connect to log all operations related to the Cloud VPS service in general. This is the SAL for the Cloud VPS service.
Use !log tools ... in #wikimedia-cloud ^connect to log all operations related to the Toolforge service. This is the SAL for the Toolforge service.
Use !log ... in #wikimedia-operations ^connect to log all operations related to physical hardware. This is the SAL for general operations tasks.

TODO: add information. Describe how we use the different !log mechanism in the different channels.

Personal availability

This section includes information regarding how we try to coordinate to ensure we always have enough human availability to support our services.

TODO: information on vacations, unavailability communication, etc. Does this fit into this document?

On-call and paging

Worth mentioning Wikimedia_Cloud_Services_team/Clinic_duties which we do while on-call.

TODO: include here all we know about how we coordinate paging/on-call, etc? Does this fit into this document?

Additional notes

Please, take these additional notes into consideration.

This document is intended to be an addition to other standard and industry procedures, and it does not replace them, just adds additional information and tips.
Changes to this document requires agreement among the affected people (i.e, the WMCS folks).
We agreed and follow our Technical Engagement Team Social Norms.