Jump to content

SLO/Abstract Wikipedia

From Wikitech
< SLO
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

Status: draft

Organizational

The Service

The Wikifunctions integration allows Wikipedia editors to embed calls to execute Wikifunctions functions dynamically inside article content, which is shown to readers of the pages. This SLO defines reliability, latency, and availability targets to ensure a good user experience on Wikipedia pages with Wikifunctions call results embedded in them.

Responsible Teams

The Abstract Wikipedia team is primarily responsible of this project. You can talk to them on Slack in #talk-to-abstract-wikipedia.

Architectural

Environmental Dependencies

The integration into MediaWiki runs as a client/repo system in MediaWiki land, similar to Wikibase's. It is called synchronously by Parsoid on page render, and has the option of returning a result or a "pending" state (asynchronous content fragment) that will cause Parsoid to set a short TTL and call again later, during which the request should have completed and the resulting fragment be available.

The back-end service, which converts the requests into outputs by running user-written code inside a series of sandboxes runs as a k8s service, but for the purpose of the main part of this SLO is considered a hard dependency.

Service Dependencies

Hard dependencies ParsoidMediaWiki

  • MediaWiki's (Parsoid's) asynchronous content fragments system, and MW's page caching management more generally
    • Note: No SLO currently exists for MW overall.
  • The Wikifunctions-specific memcached pool accessed by the MW client and repo code
    • Note: No SLO currently exists for the main memcached service, nor for the Wikifunctions-specific one.

Wikifunctions Backend Node Services

Soft dependencies

Extension:WikiLambda

  • Wikifunctions ZObject fetch API (part of the MediaWiki Action API)
    • Note: No SLO currently exists for MW overall, nor for the Action API.
  • Wikidata entities fetch API (part of the MediaWiki Action API)
    • Note: No SLO currently exists for MW overall, nor for the Action API.
  • CirrusSearch's Wikidata entity search API (part of the MediaWiki Action API)
    • Note: No SLO currently exists for MW overall, nor for the Action API; the SLO for CirrusSearch does not cover its responses to MW Action API requests.
  • [Internal] Logstash logging, Prometheus metrics, etc.

Client-facing

Users

Feature users:

  • Readers
  • Editors who add and create new functions, implementations, tests

Clients

  • MediaWiki Parsoid page rendering, for pages on which the content is used
  • Human readers of MediaWiki articles on which the content is used

Request Classes

Service Level Indicators (SLIs)

[wip] Those in bold will be the first SLIs to be dispatched to a Pyrra db

For the Integration

  • Integration combined latency-availability SLI: The percentage of all requests that complete within the 50 ms threshold and receive a non-error response, defined as above, shall be at least the limit.
  • Integration latency SLI, percentiles: The percentile request latencies, as measured from Parsoid's perspective, caused when a request to render a Wikifunctions call to a page that has a parser cache miss and so triggers a call from Parsoid to Wikifunctions, will return back to Parsoid (with a successful result, a system-disabled error, or a placeholder).
  • Integration latency SLI, acceptable limit: All such requests should return back to Parsoid within this limit, as measured from the standpoint of Parsoid's calling code.
  • Content availability: The minimum percentage of (post-parser-cache) page impressions that trigger Parsoid's parse with at least one Wikifunctions call on them are served fully, without placeholders.
    • Measure live in grafana
  • MediaWiki system load: Extra MediaWiki update jobs due to Wikifunctions content triggering from asynchronous content not being ready will represent less than this threshold of total jobs run, including during an outage and recovery from an outage.
    • Currently unmeasurable / at 0%

For the Backend

  • Back-end API combined latency-availability SLI: The percentage of all requests that complete within the 10 s threshold and receive a non-error response, defined as above, shall be at least the limit.
  • Back-end API availability SLI: The percentage of all requests to the back-end API receiving a non-error response, defined as HTTP status code 200 or 4xx, should be at least this threshold.
  • Back-end API latency SLI, percentiles: The percentile request latencies, as measured from MediaWiki's perspective, final successful Wikifunctions calls' response being ready in the cache.
  • Back-end API latency SLI, acceptable limit: All such requests must complete within this limit, measured from MediaWiki's perspective, and shall return a timeout on hitting that limit.

Operational

Monitoring

The practices Chores where failure logs and metrics are monitored on a daily, rotational basis. The microservice receives requests for /_info from a blackbox probe, which will page SRE if the service is unreachable or otherwise not responding.

Troubleshooting

  • Function requests are stateless, so moderately simple to debug, though for non-trivial issues some familiarity is generally needed.
  • Requests may return with HTTP 200 responses whilst having an 'error' state internally, hampering external team review and debugging.
  • The orchestrator service maintains a cache of relevant objects, which may hinder debugging.

Deployment

  • Service updates are through the standard k8s service deployment with helm charts.
  • The team has a practice of deploying updates to the service's images once a week, as a shared responsibility across the engineering function.
  • Middle-ware (PHP) and front-end (JS) code rides the MediaWiki train in the normal fashion

Service Level Objectives

Realistic targets

SLI Target Current
Integration latency SLI, percentiles P50: ≤ 5ms

P95: ≤ 20ms

P99: ≤ 25ms

P999: ≤ 50ms

Recent data:

P50: ≤ 2.92 ms

P95: ≤ 19.6 ms

P99: ≤ 21.4 ms

Integration availability SLI
Integration content availability 60% Current measure: ~60%

This is currently far too low, but due to the small scale of use and cache expiry we expect this to be more noise than signal until wider deployment.

Integration combined latency-availability SLI ?
MediaWiki system load 5% Currently unmeasurable / at 0% (blocked)
Back-end API latency SLI, percentiles P50: ≤ 250 ms

P95: ≤ 500 ms

P99: ≤ 1000 ms

P999: ≤ 10 s

Recent data:

P50: ≤ 23.1 ms

P95: ≤ 389 ms

P99: ≤ 777 ms

We expect requests to get more complex over time, hence the P50 being much higher than current data.

10 s limit enforced by k8s request logic.

Back-end API availability SLI 99.5% Recent data: 99.8%
Back-end API combined latency-availability SLI 98.5% Recent data: 98.8%

Ideal targets

SLI Target Current
Integration latency SLI, percentiles P50: ≤ 5ms

P95: ≤ 20ms

P99: ≤ 25ms

P999: ≤ 30ms

We want the extra load of adding a function call to a page to be unnoticeable to users, and not impair service costs.
Integration availability SLI
Integration content availability 85% We want almost all function calls to be immediately available, only very rarely showing them a placeholder.
Integration combined latency-availability SLI P999 ≤ 50ms We want all function calls to return to the parser nearly-immediately.
MediaWiki system load 5% We want to avoid being a significant burden on the general Wikimedia production MediaWiki ecosystem.
Back-end API latency SLI, percentiles P50: ≤ 250 ms

P95: ≤ 500 ms

P99: ≤ 1000 ms

We want new function calls' results to be swiftly and reliably available.
Back-end API latency SLI, acceptable limit 10,000 ms We want a hard limit on how long to wait for new function calls' results.
Back-end API availability SLI 99.9% We want the overall back-end system to be extremely reliably available, so as to avoid any disruption for users.
Back-end API combined latency-availability SLI 99% We want function calls made to the back-end to be reliably available and only very rarely result in surprise re-work or slowness.

Reconciliation

SLI Target
Integration latency SLI, percentiles

The percentile request latencies, as measured from Parsoid's perspective, caused when a request to render a Wikifunctions call to a page that has a parser cache miss and so triggers a call from Parsoid to Wikifunctions, will return back to Parsoid (with a successful result, a system-disabled error, or a placeholder).

P50: ≤ 5ms

P95: ≤ 20ms

P99: ≤ 25ms

P999: ≤ 50ms

Integration availability SLI

The percentage of all requests to the integration API receiving a response should be at least this threshold.

99.99%
Integration combined latency-availability SLI

The percentage of all fragment requests from Parsoid to our integration API that complete within the 100 ms threshold shall be at least the limit.

95%
[eventually] Wikifunctions call content availability

The minimum percentage of (post-parser-cache) page impressions that trigger Parsoid's parse with at least one Wikifunctions call on them are served fully, without placeholders.

60%
[eventually] MediaWiki system load

Extra MediaWiki update jobs due to Wikifunctions content triggering from asynchronous content not being ready will represent less than this threshold of total jobs run, including during an outage and recovery from an outage.

5%
Back-end API latency SLI, percentiles

The percentile request latencies, as measured from MediaWiki's perspective, final successful Wikifunctions calls' response being ready in the cache.

P50: ≤ 250 ms

P95: ≤ 500 ms

P99: ≤ 1000 ms

P999: ≤ 10 s

Back-end API availability SLI

The percentage of all requests to the back-end API receiving a non-error response, defined as HTTP status code 200 or 4xx, should be at least this threshold.

99.5%
Back-end API combined latency-availability SLI

The percentage of all requests that complete within the 10s threshold and receive a non-error response, defined as above, shall be at least the limit.

98.5%