SLO/EditCheck
Status: draft
Organizational
Service
The Edit Check framework is a component within Visual Editor. It consists of:
- an API for creating "checks" that examine a user's in-progress edit and provide feedback to them;
- a display layer that shows those checks to the users during the edit and before save
- a core set of checks that are built in to the component; and
- a system for tagging edits Edit Checks were shown within
Teams
The Editing team is responsible the development and maintenance of the Edit Check codebase.
The Site Reliability Engineering team is responsible for maintaining Edit Check instances running on production infrastructure.
The Release Engineering team is responsible for deploying updates to Edit Check on production infrastructure.
Architectural
Environmental dependencies
VisualEditor, running on MediaWiki, is the specific platform within which Edit Check operates.
Parsoid issues could in some cases affect Edit Check specifically (not just VisualEditor in general).
Browser ecosystem issues could prevent Edit Check from displaying or operating correctly (in ways that might not affect VisualEditor in general). But note issues with unsupported browsers should be filtered from Edit Check SLIs.
Community-editable wiki configuration could prevent Edit Check from displaying or operating correctly (in ways that might not affect VisualEditor in general); this should be excluded from Edit Check SLIs.
Service dependencies
Edit Check makes HTTP calls to MediaWiki core services in general and Parsoid specifically.
SpamBlacklist extension is called to test whether URLs are blocked.
AbuseFilter rules will be loaded and transformed into the browser, to evaluate text prior to publishing.
The individual core checks could fail without otherwise affecting the Edit Check framework as a whole.
Client-facing
Clients
Feature users: end-user editors, wiki administrators / patrollers
API users: teams who are writing checks (Growth, Moderation Tools), gadget authors on-wiki
Request Classes
There will only be one request class. Unsupported browsers will be included in the SLO. Rationale: The SLI analytics will only happen for browsers that pass VisualEditor’s browser checks, and it seems highly unlikely to us that an error would turn on whether such a browser were on the “supported” list or not. If that were happening at a high rate, Editing would probably want to, and be able to, fix the issue anyway. So for simplicity we will include any browser that gets to the point of running the analytics code.
Service Level Indicators (SLIs)
Availability SLI: The percentage of all VisualEditor edit requests in which Edit Check completes successfully, either showing one or more checks or determining that no check should be shown. In terms of counters, this is equivalent to:
If the check is shown but the user cancels the edit (incrementing preSaveChecksAbandoned
) or closes the tab (not incrementing any counter), Edit Check is still considered available.
Latency SLI: None, although one may be added in a future version.
Operational
Monitoring
Monitoring will be by graphite metrics, based on stored data for preSaveChecksAvailable,
preSaveChecksShown
,
preSaveChecksNotShown
and preSaveChecksAbandoned.
If feasible, alerts will be issued when the real-time SLI for the past hour falls below .95.
Troubleshooting
It’s realistic to suppose that a reduction of the SLI would likely be due to a coding error. Therefore troubleshooting would usually mean Editing fixing broken code, then deploying it via the train or as a backport.
Deployment
Deployment is via the train (or backports). There will also be community configuration in future.
Service Level Objectives
Realistic targets (reasonable worst case)
It’s realistic to suppose that a reduction of the SLI would be due to a coding error, which would require Editing to fix it, and could take a day to notice as it rolls out on the train, and Editing could fix it within a day. So if each quarter there’s one case of deploying broken code each quarter, then it shouldn’t decrease the SLI by more than one percentage point. This means an SLO of 99% is realistic.
Ideal targets (reasonable best case)
The case where an Edit Check failure completely breaks VisualEditor would be fairly catastrophic. Whereas the case where Edit Check fails without breaking VisualEditor wouldn’t be too serious. We do not imagine we would end up deploying code that breaks VisualEditor in all cases, but it could happen, say, that code causes VisualEditor to break in the case of certain rare inputs (say, <1%). So, Edit Check failure would be mostly but not entirely decoupled from any yet-to-be-determined VisualEditor SLI. So assuming we wanted a VisualEditor SLO of something like 99.95%, an Edit Check SLO of 99% would still be appropriate.