This page provides a brief overview of Server-side Rendering Service.
- Grafana dashboard for termbox service
- Grafana dashboard for envoy proxy, filtered for termbox
- Grafana dashboard for Termbox SSR Service Level Objective (SLO)
- Grafana dashboard for Wikidata alerts with a panel showing Termbox request errors (requests from MediaWiki to Termbox)
- Logstash, Logstash 2 (Todo: create some gadgets to see at a glance whether events are spiking, maybe consolidate this)
The service was introduced in 2019, to initially serve server-side rendered content of the Wikidata/Wikibase "term box", i.e. the part of item/property page UI where labels, descriptions and aliases are shown and could be edited.
The service is used as part of generating the HTML output sent from MediaWiki to user's browser.
There is a server-side and the client-side variant of the code, which are distributions of the same implementation.
The client-side variant is deployed into wikibase on a file system level through git sub modules.
In case of no configured server-side rendering service or a malfunctioning of it, the client-side code will act as a fallback.
The service uses Vue.js as the UI framework.
The service is deployed on the WMF services Kubernetes cluster using helm. This means that the service is packaged as a docker image. The docker image is built by the Deployment pipeline.
The images that are used in production can be found on the WMF docker registry which is missing a nice UI; the easiest way to see current images is this tool. New images are built, after code is merged to the master branch, automatically by the deployment pipeline.
The production clusters used on wikidata.org are managed using kubernetes and helm. These are also used for a staging instance as well as an instance for test.wikidata.org. The configuration for these can be found in the operations/deployment-charts repo. Details for applying those adjustments to the production clusters can be found at Migrating from scap-helm.
The instance used for beta is just run by docker. The configuration for this can be found in the git repo in the infrastructure folder. The instructions for applying those changes can also be found there.
Some useful metrics for monitoring the deployment can be found shown in grafana.
Sequence diagram "source code".
Initial deployment & load details
The initial responsibility of this service will be the rendering of the term box for wikidata items and properties for mobile web views.
Currently wikidata.org gets no more that 80k mobile web requests per day (including cached pages, and non item/property pages).
If we were to assume all of these requests were actually to item and property pages that were not cached this would result in this SSR service being hit 55 times per minute.
(In reality some of these page views are not to item or property pages, and some will be cached) so we are looking at no more than 1 call per second.
Availability objectives and accepted operational errors
The Service Level Objective (SLO) for the Termbox SSR is an error rate of less than 0.1%. The current error rate and numbers of errors can be seen at the Grafana Termbox SSR SLO dashboard.
That availability is impacted by errors triggered inside Termbox SSR (i.e. the NodeJS app living in Kubernetes) that are caused by operational or performance issues in MediaWiki. They are unavoidable to a degree and acceptable as long as their overall frequency stays low, see the SLO above. The bulk of those errors is constituted by the following three error messages:
timeout of 3000ms exceeded
- Some of these timeout errors seem to happen surprisingly often during the health checks that are run periodically (config, docs). This is judged to be strange but probably harmless.
- Disregarding the health checks that go to the unused datacenter above, these errors also seem to correspond almost perfectly to the errors logged in MediaWiki PHP logstash with the message
Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote serverand content
Request failed with status 0. Usually this means network failure or timeout
- That timeout for this connection going out from MediaWiki/PHP to the Termbox SSR is currently based on the wikibase default configuration
Request failed with status code 500
- i.e., the MediaWiki API having some server problem.
Request failed with status code 503
- These seem to be triggered by the Envoy Proxy that sits between the Termbox SSR and the MediaWiki API. More detailed information about that is available in another Phabricator comment.
These errors are discussed in more detail in a Phabricator comment. Detailed descriptions of them are visible on logstash. Note that there seems to be a bug in how Prometheus calculates the numbers shown in Grafana, so they can diverge from what is shown in logstash.
Debugging and Testing Production
To connect to the production services for testing use ssh port forwarding as follows:
ssh -4 -L 3030:termbox.svc.codfw.wmnet:3030 <username>@bast1002.wikimedia.org
You can alter the bastion host as needed. You can also alter the service e.g. eqiad vs codfw.