SLO/Abstract Wikipedia
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
Status: draft
Organizational
The Service
The Wikifunctions integration allows Wikipedia editors to embed calls to execute Wikifunctions functions dynamically inside article content, which is shown to readers of the pages. This SLO defines reliability, latency, and availability targets to ensure a good user experience on Wikipedia pages with Wikifunctions call results embedded in them.
Responsible Teams
The Abstract Wikipedia team is primarily responsible of this project. You can talk to them on Slack in #talk-to-abstract-wikipedia.
Architectural
Environmental Dependencies
The integration into MediaWiki runs as a client/repo system in MediaWiki land, similar to Wikibase's. It is called synchronously by Parsoid on page render, and has the option of returning a result or a "pending" state (asynchronous content fragment) that will cause Parsoid to set a short TTL and call again later, during which the request should have completed and the resulting fragment be available.
The back-end service, which converts the requests into outputs by running user-written code inside a series of sandboxes runs as a k8s service, but for the purpose of the main part of this SLO is considered a hard dependency.
Service Dependencies
Hard dependencies ParsoidMediaWiki
- MediaWiki's (Parsoid's) asynchronous content fragments system, and MW's page caching management more generally
- Note: No SLO currently exists for MW overall.
- The Wikifunctions-specific memcached pool accessed by the MW client and repo code
- Note: No SLO currently exists for the main memcached service, nor for the Wikifunctions-specific one.
Wikifunctions Backend Node Services
- The "linker" service, function-orchestrator
- The "executor" service, function-evaluator
Soft dependencies
- Wikifunctions ZObject fetch API (part of the MediaWiki Action API)
- Note: No SLO currently exists for MW overall, nor for the Action API.
- Wikidata entities fetch API (part of the MediaWiki Action API)
- Note: No SLO currently exists for MW overall, nor for the Action API.
- CirrusSearch's Wikidata entity search API (part of the MediaWiki Action API)
- Note: No SLO currently exists for MW overall, nor for the Action API; the SLO for CirrusSearch does not cover its responses to MW Action API requests.
- [Internal] Logstash logging, Prometheus metrics, etc.
Client-facing
Users
Feature users:
- Readers
- Editors who add and create new functions, implementations, tests
Clients
- MediaWiki Parsoid page rendering, for pages on which the content is used
- Human readers of MediaWiki articles on which the content is used
Request Classes
Service Level Indicators (SLIs)
[wip] Those in bold will be the first SLIs to be dispatched to a Pyrra db
For the Integration
- Integration combined latency-availability SLI: The percentage of all requests that complete within the 50 ms threshold and receive a non-error response, defined as above, shall be at least the limit.
- Integration latency SLI, percentiles: The percentile request latencies, as measured from Parsoid's perspective, caused when a request to render a Wikifunctions call to a page that has a parser cache miss and so triggers a call from Parsoid to Wikifunctions, will return back to Parsoid (with a successful result, a system-disabled error, or a placeholder).
- Integration latency SLI, acceptable limit: All such requests should return back to Parsoid within this limit, as measured from the standpoint of Parsoid's calling code.
- Content availability: The minimum percentage of (post-parser-cache) page impressions that trigger Parsoid's parse with at least one Wikifunctions call on them are served fully, without placeholders.
- Measure live in grafana
- MediaWiki system load: Extra MediaWiki update jobs due to Wikifunctions content triggering from asynchronous content not being ready will represent less than this threshold of total jobs run, including during an outage and recovery from an outage.
- Currently unmeasurable / at 0%
For the Backend
- Back-end API combined latency-availability SLI: The percentage of all requests that complete within the 10 s threshold and receive a non-error response, defined as above, shall be at least the limit.
- Back-end API availability SLI: The percentage of all requests to the back-end API receiving a non-error response, defined as HTTP status code 200 or 4xx, should be at least this threshold.
- Back-end API latency SLI, percentiles: The percentile request latencies, as measured from MediaWiki's perspective, final successful Wikifunctions calls' response being ready in the cache.
- Back-end API latency SLI, acceptable limit: All such requests must complete within this limit, measured from MediaWiki's perspective, and shall return a timeout on hitting that limit.
Operational
Monitoring
The practices Chores where failure logs and metrics are monitored on a daily, rotational basis. The microservice receives requests for /_info from a blackbox probe, which will page SRE if the service is unreachable or otherwise not responding.
- Metrics
- Logging
- Tracing
- Jaeger UI is used to visualize full request cycle
- Alerting
- AlertManager fires alerts to the #aw-alerts Slack channel using Alert Rules on Grafana
- Other
- A convenience script,
check-wf-service.shfound in the deployment-charts repo which indicates service health
- A convenience script,
Troubleshooting
- Function requests are stateless, so moderately simple to debug, though for non-trivial issues some familiarity is generally needed.
- Requests may return with HTTP 200 responses whilst having an 'error' state internally, hampering external team review and debugging.
- The orchestrator service maintains a cache of relevant objects, which may hinder debugging.
Deployment
- Service updates are through the standard k8s service deployment with helm charts.
- The team has a practice of deploying updates to the service's images once a week, as a shared responsibility across the engineering function.
- Middle-ware (PHP) and front-end (JS) code rides the MediaWiki train in the normal fashion
Service Level Objectives
Realistic targets
| SLI | Target | Current |
|---|---|---|
| Integration latency SLI, percentiles | P50: ≤ 5ms
P95: ≤ 20ms P99: ≤ 25ms P999: ≤ 50ms |
Recent data:
P50: ≤ 2.92 ms P95: ≤ 19.6 ms P99: ≤ 21.4 ms |
| Integration availability SLI | ||
| Integration content availability | 60% | Current measure: ~60%
This is currently far too low, but due to the small scale of use and cache expiry we expect this to be more noise than signal until wider deployment. |
| Integration combined latency-availability SLI | ? | |
| MediaWiki system load | 5% | Currently unmeasurable / at 0% (blocked) |
| Back-end API latency SLI, percentiles | P50: ≤ 250 ms
P95: ≤ 500 ms P99: ≤ 1000 ms P999: ≤ 10 s |
Recent data:
P50: ≤ 23.1 ms P95: ≤ 389 ms P99: ≤ 777 ms We expect requests to get more complex over time, hence the P50 being much higher than current data. 10 s limit enforced by k8s request logic. |
| Back-end API availability SLI | 99.5% | Recent data: 99.8% |
| Back-end API combined latency-availability SLI | 98.5% | Recent data: 98.8% |
Ideal targets
| SLI | Target | Current |
|---|---|---|
| Integration latency SLI, percentiles | P50: ≤ 5ms
P95: ≤ 20ms P99: ≤ 25ms P999: ≤ 30ms |
We want the extra load of adding a function call to a page to be unnoticeable to users, and not impair service costs. |
| Integration availability SLI | ||
| Integration content availability | 85% | We want almost all function calls to be immediately available, only very rarely showing them a placeholder. |
| Integration combined latency-availability SLI | P999 ≤ 50ms | We want all function calls to return to the parser nearly-immediately. |
| MediaWiki system load | 5% | We want to avoid being a significant burden on the general Wikimedia production MediaWiki ecosystem. |
| Back-end API latency SLI, percentiles | P50: ≤ 250 ms
P95: ≤ 500 ms P99: ≤ 1000 ms |
We want new function calls' results to be swiftly and reliably available. |
| Back-end API latency SLI, acceptable limit | 10,000 ms | We want a hard limit on how long to wait for new function calls' results. |
| Back-end API availability SLI | 99.9% | We want the overall back-end system to be extremely reliably available, so as to avoid any disruption for users. |
| Back-end API combined latency-availability SLI | 99% | We want function calls made to the back-end to be reliably available and only very rarely result in surprise re-work or slowness. |
Reconciliation
| SLI | Target |
|---|---|
| Integration latency SLI, percentiles
The percentile request latencies, as measured from Parsoid's perspective, caused when a request to render a Wikifunctions call to a page that has a parser cache miss and so triggers a call from Parsoid to Wikifunctions, will return back to Parsoid (with a successful result, a system-disabled error, or a placeholder). |
P50: ≤ 5ms
P95: ≤ 20ms P99: ≤ 25ms P999: ≤ 50ms |
| Integration availability SLI
The percentage of all requests to the integration API receiving a response should be at least this threshold. |
99.99% |
| Integration combined latency-availability SLI
The percentage of all fragment requests from Parsoid to our integration API that complete within the 100 ms threshold shall be at least the limit. |
95% |
| [eventually] Wikifunctions call content availability
The minimum percentage of (post-parser-cache) page impressions that trigger Parsoid's parse with at least one Wikifunctions call on them are served fully, without placeholders. |
60% |
| [eventually] MediaWiki system load
Extra MediaWiki update jobs due to Wikifunctions content triggering from asynchronous content not being ready will represent less than this threshold of total jobs run, including during an outage and recovery from an outage. |
5% |
| Back-end API latency SLI, percentiles
The percentile request latencies, as measured from MediaWiki's perspective, final successful Wikifunctions calls' response being ready in the cache. |
P50: ≤ 250 ms
P95: ≤ 500 ms P99: ≤ 1000 ms P999: ≤ 10 s |
| Back-end API availability SLI
The percentage of all requests to the back-end API receiving a non-error response, defined as HTTP status code 200 or 4xx, should be at least this threshold. |
99.5% |
| Back-end API combined latency-availability SLI
The percentage of all requests that complete within the 10s threshold and receive a non-error response, defined as above, shall be at least the limit. |
98.5% |