SLO/MariaDB
Status: draft
Organizational
Service
MariaDB. We are currently only building SLOs for core databases (sX and xX).
Teams
Data persistence
Architectural
Environmental dependencies
The service sits in 2 datacenters and roughly 250 hosts. More details can be found in MariaDB#Sections and shards.
Service dependencies
Hard dependencies: server hardware, network, infra-layer software, etc.
Soft dependencies: Most dev team but specially teams in MW engineering group.
Client-facing
Clients
- MediaWiki for all operations rely on the MariaDB databases.
- Many services rely on databases in misc cluster
- Phabricator
- Mailman
- Znuny
- toolhub
- ...
Request Classes
- Web-based requests (using wikiuser):
- Read operations
- Write operations
- Maintenance script requests (using wikiadmin) aka CLI:
- Read operations
- Write operations
Service Level Indicators (SLIs)
Mostly inspired from Database Reliability Engineering book
- Percentage of queries returned with error (killed due to being slow, syntax error, connection issues, etc.) to all queries returned.
- Average time spent in each MediaWiki request (to index.php) waiting for databases to return query results.
- Read-only time of each section (or: number of write queries affected by read-only time, estimated by measuring the drop in queries)
- Number of queries done divided by CapEx
Operational
Monitoring
There are all-hands pages for these failure scenarios:
- Any replica or primary that is serving traffic going down.
- Any replica or primary that is serving traffic having replication lag above a certain threshold.
There are IRC alerts/warnings for lower priority issues. Any IRC alert, for a condition that threatens to violate the SLO if uncorrected, also has a paging alert at a higher threshold.
Troubleshooting
Many failure scenarios are covered under MariaDB/troubleshooting. Common cases cover vast majority of all issues but sometimes some cases require in-depth investigation. Further automation is helping us reduce complexity of troubleshooting in many cases (e.g. by just cloning the host from another replica).
Deployment
Debian packaging
Service Level Objectives
Realistic targets
What are the realistic targets for each SLI? Why?
Ideal targets
What are the ideal targets for each SLI? Why?
Reconciliation
Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.
Once the SLO is final, consider collapsing the above three sections.
What are the agreed-upon SLOs, for each SLI and each request class?