WMDE/Wikidata/SSR Service

From Wikitech
< WMDE‎ | Wikidata
(Redirected from Termbox)
Jump to navigation Jump to search

This page provides a brief overview of Server-side Rendering Service[1].

Observability

Details

Overview

The service was introduced in 2019, to initially serve server-side rendered content of the Wikidata/Wikibase "term box", i.e. the part of item/property page UI where labels, descriptions and aliases are shown and could be edited.

The service is used as part of generating the HTML output sent from MediaWiki to user's browser.

The HTML generated server-side is to be optionally "enhanced" by client-side JavaScript

There is a server-side and the client-side variant of the code, which are distributions of the same implementation.

The client-side variant is deployed into wikibase on a file system level through git submodules.

In case of no configured server-side rendering service or a malfunctioning of it, the client-side code will act as a fallback.[2]

Technology

The SSR service is a node service. It is written in TypeScript. The code is "compiled" to JavaScript using webpack. The "compiled" code and "compiled" CSS is to be found in the dist folder of the git repository.

The service uses Vue.js as the UI framework.

The service is deployed on the WMF services Kubernetes cluster using helm. This means that the service is packaged as a docker image. The docker image is built by the Deployment pipeline.

Deployment

The images that are used in production can be found on the WMF docker registry. New images are built, after code is merged to the master branch, automatically by the deployment pipeline.

On Beta, the image is just run by Docker. The configuration for this can be found in the git repo in the infrastructure folder. The instructions for applying those changes can also be found there.

In Wikimedia production, the service is managed using Kubernetes and Helm. Kubernetes deployments are configured in the operations/deployment-charts repo. There are four releases in total:

  • 2 production releases, one for the eqiad cluster and one for codfw. These talk to Wikidata (wikidata.org, wikidatawiki) and are used by Wikidata as well.
  • 1 staging release, in the staging cluster. This one also talks to Wikidata, but is not used by anything.
  • 1 test release, also in the staging cluster. This one talks to Test Wikidata (test.wikidata.org, testwikidatawiki) and is used by Test Wikidata as well.

When deploying a new version of the Termbox, you should usually first update the test release (values-test.yaml) and deploy that to the staging cluster, then test that it works on Test Wikidata (check that a newly created item has an SSR termbox). Then, update the version in the production release (values.yaml; this will also update the staging release, because values-staging.yaml does not override the version). If you want to test the staging release before deploying the production release, you will have to do so using curl, because the staging release is not used by any wiki:

curl 'https://staging.svc.eqiad.wmnet:4004/termbox?entity=Q42&revision=1841500264&language=en&editLink=%2Fw%2Findex.php%2FSpecial%3ASetLabelDescriptionAliases%2FQ42&preferredLanguages=en%7Cde'
# should return some HTML starting with <section class="wikibase-entitytermsview"

If this works, then deploy the production release to the eqiad and codfw clusters and check that new Wikidata items have an SSR termbox on mobile.

Some useful metrics for monitoring the deployment can be found shown in grafana.

Architecture

Wikidata Termbox SSR Architecture Diagram
Wikidata Termbox SSR Sequence Diagram

Sequence diagram "source code".

Initial deployment & load details

The initial responsibility of this service will be the rendering of the term box for wikidata items and properties for mobile web views.

Currently wikidata.org gets no more that 80k[3] mobile web requests per day (including cached pages, and non item/property pages).

If we were to assume all of these requests were actually to item and property pages that were not cached this would result in this SSR service being hit 55 times per minute.

(In reality some of these page views are not to item or property pages, and some will be cached) so we are looking at no more than 1 call per second.

Availability objectives and accepted operational errors

The Service Level Objective (SLO) for the Termbox SSR is an error rate of less than 0.1%. The current error rate and numbers of errors can be seen at the Grafana Termbox SSR SLO dashboard.

That availability is impacted by errors triggered inside Termbox SSR (i.e. the NodeJS app living in Kubernetes) that are caused by operational or performance issues in MediaWiki. They are unavoidable to a degree and acceptable as long as their overall frequency stays low, see the SLO above. The bulk of those errors is constituted by the following three error messages:

  • timeout of 3000ms exceeded
    • Some of these timeout errors seem to happen surprisingly often during the health checks that are run periodically (config, docs). This is judged to be strange but probably harmless.
    • Disregarding the health checks that go to the unused datacenter above, these errors also seem to correspond almost perfectly to the errors logged in MediaWiki PHP logstash with the message Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server and content Request failed with status 0. Usually this means network failure or timeout
  • Request failed with status code 500
    • i.e., the MediaWiki API having some server problem.
  • Request failed with status code 503
    • These seem to be triggered by the Envoy Proxy that sits between the Termbox SSR and the MediaWiki API. More detailed information about that is available in another Phabricator comment.

These errors are discussed in more detail in a Phabricator comment. Detailed descriptions of them are visible on logstash. Note that there seems to be a bug in how Prometheus calculates the numbers shown in Grafana, so they can diverge from what is shown in logstash.

Debugging and Testing Production

To connect to the production services for testing use ssh port forwarding as follows:

ssh -4 -L 3030:termbox.svc.codfw.wmnet:3030 <username>@bast1002.wikimedia.org

You can alter the bastion host as needed. You can also alter the service e.g. eqiad vs codfw.

References

  1. Source code of the service
  2. wikibase TermboxView falling back to termbox client code mount point DOM element
  3. https://tools.wmflabs.org/siteviews/?platform=mobile-web&source=pageviews&agent=user&range=last-year&sites=wikidata.org