SLO/MariaDB

Status: draft

Organizational

Service

MariaDB. We are currently only building SLOs for core databases (sX and xX).

Teams

Data persistence

Architectural

Environmental dependencies

Diagram of sections and how they get read/write

The service sits in 2 datacenters and roughly 250 hosts. More details can be found in MariaDB#Sections and shards.

Service dependencies

Hard dependencies: server hardware, network, infra-layer software, etc.

Soft dependencies: Most dev team but specially teams in MW engineering group.

Client-facing

Clients

MediaWiki for all operations rely on the MariaDB databases.
Many services rely on databases in misc cluster
- Phabricator
- Mailman
- Znuny
- toolhub
- ...

Request Classes

Web-based requests (using wikiuser):
- Read operations
- Write operations
Maintenance script requests (using wikiadmin) aka CLI:
- Read operations
- Write operations

Service Level Indicators (SLIs)

Mostly inspired from Database Reliability Engineering book

Percentage of queries returned with error (killed due to being slow, syntax error, connection issues, etc.) to all queries returned.
Average time spent in each MediaWiki request (to index.php) waiting for databases to return query results.
Read-only time of each section (or: number of write queries affected by read-only time, estimated by measuring the drop in queries)
Number of queries done divided by CapEx

Operational

Monitoring

There are all-hands pages for these failure scenarios:

Any replica or primary that is serving traffic going down.
Any replica or primary that is serving traffic having replication lag above a certain threshold.

There are IRC alerts/warnings for lower priority issues. Any IRC alert, for a condition that threatens to violate the SLO if uncorrected, also has a paging alert at a higher threshold.

Troubleshooting

Many failure scenarios are covered under MariaDB/troubleshooting. Common cases cover vast majority of all issues but sometimes some cases require in-depth investigation. Further automation is helping us reduce complexity of troubleshooting in many cases (e.g. by just cloning the host from another replica).

Deployment

Debian packaging

Service Level Objectives

Instructions

Realistic targets

What are the realistic targets for each SLI? Why?

Ideal targets

What are the ideal targets for each SLI? Why?

Reconciliation

Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.

Once the SLO is final, consider collapsing the above three sections.

What are the agreed-upon SLOs, for each SLI and each request class?