SLO/linkrecommendation

From Wikitech
< SLO

Status: approved

Organizational

Service

There are three:

  • linkrecommendation external (served by API gateway, for third party clients)
  • linkrecommendation internal (accessed via internal proxies, production services only)
  • linkrecommendation dataset loader (cron job for regular creation/updates of datasets that the service needs in order to run)

For documentation of the service, see Add_Link

Teams

Architectural

Environmental dependencies

Kubernetes deployment. Configuration is in operations/deployment-charts repo.

Service dependencies

  • hard dependency: MariaDB/misc#m2 the linkrecommendation service uses a database on this cluster for storing datasets. The service cannot function without access to this database. The database name is "mwaddlink". Database user is "linkrecommendation". Database admin user is "adminlinkrecommendation". (See "charts/linkrecommendation/values.yaml")
  • soft dependency: GET requests to the service make use of MEDIAWIKI_PROXY_API_URL and MEDIAWIKI_PROXY_API_BASE_URL. If proxying is broken, no GET requests can be served. The external service primarily uses GET requests. The primary (only?) consumer of the internal service, GrowthExperiments/maintenance/refreshLinkRecommendations.php, uses POST.

Client-facing

Clients

  • internal deployment: GrowthExperiments/maintenance/refreshLinkRecommendations.php. Runs via puppet managed cron job on mwaintenance hosts. Issues a POST to the linkrecommendation internal service with article content, gets the suggestion data from the linkrecommendation service, and caches the results in a MediaWiki database table on x1.
  • external deployment: QA, internal testers. Not known if there are any gadgets or other users.
  • dataset loader deployment: Cron job managed via operations/deployment-charts. Regularly polls https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ and iterates over list of wikis in https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/wikis.txt, and compares checksums from the published datasets with those stored in the linkrecommendation service database. If the checksums don't match (update) or don't exist (new dataset), the linkrecommendation application downloads the datasets and imports them into the linkrecommendation service's MySQL database.

Request Classes

GET and POST requests to the external service are nice-to-have. The POST requests to the internal service are the main thing that are important to the Growth team. Without those succeeding, users of Special:Homepage where link recommendation tasks are deployed would see fewer or no link recommendation tasks.

Service Level Indicators (SLIs)

  • Availability SLI: The percentage of all requests receiving a non-error response, defined as "HTTP status code 200"

The latency is not that important to us, because the primary consumer is a GrowthExperiments maintenance script that runs via cron, so whether it takes 2 seconds or 20 seconds to process a request is not catastrophic. It just means it takes longer to fill up the cache of suggestions in MediaWiki.

Operational

Monitoring

Troubleshooting

  • Most engineers on the Growth team have experience deploying and troubleshooting issues. Add_Link is reasonably documented. That said, there are multiple moving parts and it takes a while to understand the full sequence of events.

Deployment

Patch to operations/deployment-charts for helmfile adjustments. Patches to research/mwaddlink for application updates; that results in an updated Docker image, and then a patch to operations/deployment-charts with the updated image.

Service Level Objectives

  • The external-facing service is not covered by the SLO
  • Internal service availability (checking for HTTP 500 errors) is the only SLI currently.
    • There are still some failure modes that wouldn't be covered: notably, if capacity issues prevented the cron job from completing in a timely manner as we scale it to include most Wikipedias, then link recommendations could be out of date even if the backend were 100% available. That's imperfect: ideally, the SLO should be violated if and only if the user experience is degraded. But the error rate is the best candidate SLI of the metrics we have available today, so we'll go forward with it.
  • 95% availability SLO for the internal service (4.5 days downtime per quarter)
    • Paging alerts aren't needed, and SRE support can be limited to normal working hours