SLO/Lift Wing

From Wikitech
< SLO

Organizational

What is the service?

Machine Learning/LiftWing is a set of services run by the Machine Learning team. While these services share common libraries/frameworks and all are topically related (Machine Learning and Interference for WMF-related data), they are somewhat independent: LW is not a cohesive API of complementing REST endpoints, but more like a project that covers topically-similar services.

The base infrastructure is a Kubernetes cluster that is spread across the eqiad and codfw data centers. On this cluster runs a collection of ML services and some supporting infrastructure. The two locations are independent and mirrored, with the intent that even with the loss of an entire data center, the services running in the other data center are sufficient to handle all expected traffic. There also is a smaller staging cluster in codfw. Note that this SLO does not apply to the staging cluster nor the services running there.

The services running on the cluster are currently mostly inherited from ORES and they are intended to provide the same (or better) functionality as that project does at the moment, while being more maintainable than the custom/bespoke ORES setup.

In addition, there are ML services on the cluster that in a sense are post-ORES, in that they have no ORES equivalent and were created with LW in mind, taking advantage of the added features of this system. Over time, the fraction of newer services is expected to increase, while eventually, the old ORES services will be turned off.

Finally, there are auxiliary services, that either provide functionality for other services on the cluster, or are providing legacy functionality for former clients of ORES.

For the purposes of the SLIs and SLOs discussed here, we will consider the services to be in classes which have similar SLOs, and the Kubernetes infrastructure they run on, separately. More on this below.

Service structure and frameworks

The Inference Services (ISes) are usually stateless microservices that use the MediaWiki API (MWAPI) as a backend. Each of these services has its own endpoint, reachable through Discovery and the API Gateway using distinct endpoints. Here are two examples of such endpoints:

On the API gateway, the URL structure is slightly different for technical reasons. More information can be found in the LiftWing usage documentation.

Most of these services use a framework to simplify request handling and integration with Kubernetes, typically Kserve.

For the purposes of the SLO discussion, we will consider the ORES services and the newer ML services written and designed for LW separately.

Glossary

All these names can be confusing, so here is a short summary:

  • Inference Service: A microservice that uses a machine learning model to make predictions on revisions, articles etc.
  • Kserve: A Python framework that helps with wrapping an ML model in a HTTP service, and integration with Kubernetes.
  • LiftWing: A set of typically Kserve-based microservices that run on the ML-Serve Kubernetes cluster.
  • ORES Legacy: A service running on ML-Serve that facilitates the migration of clients that use ORES to get the same (or very similar) API from LiftWing services running on ML-Serve. This service is only a relay that queries LiftWing services (and a Redis cache).
  • ORES: An older, custom-built cluster of inference services that ML-Serve and LiftWing are meant to supplant. Hardware and most of the stack (except the ML models) are completely different from LW.

Who are the responsible teams?

Lift Wing is primarily run by the Machine Learning team, who both take care of the Kubernetes setup, as well as the ML services running on top of it. The services themselves can be fully maintained by the Machine Learning team (e.g. the services that provide ORES-like functionality), or they are provided by other teams (e.g. the Research team), but the day-to-day maintenance of them is handled by the ML Team.

Since ML team is the main team responsible for the operation of Lift Wing, they have to sign off on services running on LW. The staging cluster allows for iterating on new services until they meet the requirements needed before they can be run on the production cluster.

Architectural

What are the service's dependencies?

Dependencies of the LW/Inference system fall into two categories:

  • Dependencies the Kubernetes framework has.
  • Dependencies of the individual inference services themselves.

The inference services implicitly depend on Kubernetes of course, but the latency, outage budgets etc of the individual services include the portion caused/offered by Kubernetes. Nevertheless, we will also define an SLO for the Kubernetes portion eventually, which likely will become a separate document eventually.

Hard dependencies

ML-Serve Kubernetes

The ML-Serve Kubernetes cluster runs the Lift Wing Inference Services (typically based on Kserve), plus auxiliary services such as the ORES Legacy endpoint.

This cluster naturally depends on the base infrastructure of the data centers it is hosted in: power, networking etc.

The more specific hard dependencies are:

  • PyBal and DNS Discovery to route traffic towards Lift Wing
  • The API Gateway to route external traffic
  • A set of etcd servers run by the ML team for leader election in the Kubernetes layer
  • Swift (Thanos) for the storage of model binaries
  • The WMF Docker Registry to store and fetch the Docker images of services running on the cluster
Inference Services

Most if not all Inference services will depend on the Media Wiki API (MWAPI) and some logging service like Logstash, but may have additional dependencies, like Feature Stores, Score Caches and the like. Since the shape of degradation of a particular Inference Service is highly variable, distinguishing hard and soft dependencies broadly is difficult. For now, all of the services running on the LW Kubernetes cluster have these hard dependencies:

  1. Kubernetes itself
  2. Swift (Thanos)
  3. The WMF Docker registry
  4. MW API (to fetch revisions)
  5. Some future services may use a Feature Store service

Swift and Docker are needed during service (re)starts, and so are less critical for the steady state than the rest.

Soft dependencies

ML-Serve Kubernetes

Lift Wing is monitored via Icinga, Prometheus and Grafana. As such, normal, controlled operation requires these services to be up, but Lift Wing can operate without them for short times.

Inference Services

In addition to the monitoring infrastructure as mentioned above:

  • Inference Services based on Kserve use Logstash for logging requests and debug messages.
  • Some services may use Score Caches (e.g. Redis or Cassandra) to improve response times. Depending on the users of the service, such a cache may become a hard dependency.

Client-facing

Who are the service's clients?

Clients fall into two classes:

  • External users, via the API GW (this includes applications on Toolforge or Cloud VPSes)
  • Internal users, via the Discovery endpoint or the API GW

External users

These are external users that call Lift Wing endpoints. Examples are researchers that analyze Wiki edits, bots that surface metadata, moderation tools and the like. Some of these were formerly using ORES and will have to migrate to using Lift Wing services (either ISes or ORES Legacy) before September 2023.

Internal users

One of the main internal uses of Lift Wing is the population of Kafka topics with LW-generated content (scores) computed from the stream of changes of various wikis. Lift Wing itself does not create the topics for Kafka (in contrast to how ORES achieves the same effect), but only generates the scores and hands them back to a service that then puts them in the correct Kafka streams.

The ORES extension is an example of an application that uses (or will use) Lift Wing to provide additional data on Wiki pages (e.g. surfacing whether a particular edit was likely damaging).

Other clients may not be stream-based/stream-triggered, querying a service for a prediction on behalf of a user, or a series of queries that are based on some other criteria.

Write the service level indicators

All the ISes and auxiliary services have these two main SLIs:

  1. Latency of requests
  2. Availability

These SLIs are complex in that from the user perspective, the latency/availability of the service itself is combined with the latency and availability of Lift Wing. That is, the overall latency experienced by the user is the combination of Lift Wing's processing and routing of the request as well as the latency the service itself. Since services running on LW can be nearly arbitrarily complex, a lot of the latency and error budget (as seen by the user) will be allocated towards the service rather than Lift Wing itself. Still, for the purposes of the SLO, we consider services running on LW as "black boxes", that is we for example do not subtract the backend latency.

Telemetry on inference services should expose the backend-added latency. In a similar vein, errors caused by the backend services must be exported as separate counts. Ideally, this distinction (a server error inside the Inference Service vs an error returned by a backend) should be exposed to the user as well. While it is impossible to distinguish these perfectly (is an unreachable backend a problem of the IS configuration, or is the back end service down?), a best effort should be made.

  • Request Latency SLI, acceptable fraction: The percentage of all requests where the latency is higher than the chosen threshold.
  • Service Availability SLI: The percentage of all requests receiving a non-error response (note: client-caused errors like 404, 429 etc. are not counted here).

Operational

Every service experiences an outage sometimes, so the SLO should reflect its expected time to recovery. If the expected duration of a single outage exceeds the error budget, then the SLO reduces to "we promise not to make any mistakes." Relying on such an SLO is untenable.

Answer these questions for the service as it is, not as it ought to be, in order to arrive at a realistically supportable SLO. Alternatively, you may be able to make incremental improvements to the service as you progress through the worksheet. Resist the temptation to publish a more ambitious SLO than you can actually support immediately, even if it feels like you should be able to support it.

How is the service monitored?

Lift Wing is monitored via Prometheus+Grafana and Icinga, sharing configuration and infrastructure to do so with the rest of WMF.

How complex is the service to troubleshoot?

Lift Wing itself uses Kubernetes (and assorted related services) to host services. It is a fairly complex stack of technologies, and especially routing and TLS can be hard to debug. Fortunately, the wider team of SREs at WMF have experience with operating and debugging Kubernetes issues. As a result, if the ML team can't quickly solve a problem themselves, external (to the team) help may be available.

Issues within the services running on Lift Wing may be debugged and resolved by ML Team, the team who provided the service themselves, or a combination of the two.

Regarding "who is responsible for the debugging and ongoing maintenance of a service running on Lift Wing", the ML Team strives to only onboard services where a clear mutual understanding exists regarding ownership. While the team does runs services it also owns, it has finite resources and thus can not completely maintain many more such services. Of particular concern is also the re-training of models to avoid drift and deterioration of predictions.

How is the service deployed?

Lift Wing services are deployed using Helm charts, similar to other Kubernetes installations at WMF, using the same pipeline as the Wikikube clusters.

Write the service level objectives

The four SLO reporting quarters are:

  • December 1 - February 28 (or 29)
  • March 1 - May 31
  • June 1 - August 31
  • September 1 - November 30

Calculate the realistic targets

Note: We don't have any historical data here. Or at least not at a volume that is useful to make predictions about what is feasible.

As mentioned before, we have four broad classes:

  1. Inference Services derived from ORES model (articlequality, damaging, drafttopic etc.)
    • Service Availability SLI: 98% of all requests are answered with non-errors.
    • Request Latency SLI, acceptable fraction: <=5 seconds, 98%
  2. Stable and reliable Inference Services (revertrisk language agnostic, etc)
    • Service Availability SLI: 98% of all requests are answered with non-errors.
    • Request Latency SLI, acceptable fraction: <=500 milliseconds, 98%
  3. Experimental Inference Services (revertrisk multi lingual, outlink, etc)
    • Service Availability SLI: 95% of all requests are answered with non-errors.
    • Request Latency SLI, acceptable fraction: <=500 (or 5000 depending on the model) milliseconds, 95%
  4. Auxiliary services (ORES Legacy etc.)
    • Service Availability SLI: 95% of all requests are answered with non-errors.
    • Request Latency SLI, acceptable fraction: <=500 milliseconds, 95%
  5. Lift Wing Kubernetes API
    • Service Availability SLI: 99% of all requests are answered with non-errors.
    • Request Latency SLI, acceptable fraction: <=500 milliseconds, 99%

Note that the bare ORES services have a much higher latency SLO than what ORES currently delivers. This is due to ORES having a caching layer that hides the actual latency of the models running there. For the services running on Lift Wing, the OREs Legacy service provides this caching functionality (using the same Redis infrastructure that ORES uses for this).

Calculate the ideal targets

TODO: For a busy service and a smooth-running LW, the targets above are too defensive. A decent service running on LW shouldn't really have 5% of failures. With very low traffic numbers, there is a granularity problem of course, even at the one-quarter scale.

Reconcile the realistic vs ideal targets

Now that you've worked out what SLO targets you'd like to offer, and what targets you can actually support, compare them. If you're lucky, the realistic values are the same or better than the ideal ones: that's great news. Publish the ideal values as your SLO, or choose a value in between. (Resist the urge to set a stricter SLO just because you can; it will constrain your options later.)

If you're less lucky, there's some distance between the SLO you'd like to offer and the one you can support. This is an uncomfortable situation, but it's also a natural one for a network of dependent services establishing their SLOs for the first time. Here, you'll need to make some decisions to close the gap. (Resist, even more strongly, the urge to set a stricter SLO just because you wish you could.)

One approach is to make the same decisions you would make if you already had an SLO and you were violating it. (In some sense, that's effectively the case: your service isn't reliable enough to meet its clients' expectations, you just didn't know it yet.) That means it's time to refocus engineering work onto the kind of projects that will bolster the affected SLIs. Publish an SLO that reflects the promises you can keep right now, but continue to tighten it over time as you complete reliability work.

The other approach is to do engineering work to relax clients' expectations. If they're relying on you for a level of service that you can't provide, there may be a way to make that level of service unnecessary. If your tail latency is high but you have spare capacity, they can use request hedging to avoid the tail. If they can't tolerate your rate of outages in a hard dependency, maybe they can rely on you as a soft dependency by adding a degraded mode.

Despite the use of "you" and "they" in the last couple of paragraphs, this is collaborative work toward a shared goal. The decision of which approach to take doesn't need to be adversarial or defensive.

You should also expect this work to comprise the majority of the effort involved in the SLO process. Where the earlier steps were characterized by documentation and gathering, here your work is directed at improving the practical reality of your software in production.

Regardless of the approach you take to reconciliation, you should publish a currently-realistic SLO, and begin measuring your performance against it, sooner rather than later. You can publish your aspirational targets too (as long as it's clearly marked that you don't currently guarantee to meet them) so that other teams can consider them in their longer-term planning. In the meantime, you'll be able to prioritize work to keep from backsliding on the progress you've already made.

📝 Clearly document any decisions you made during reconciliation. Finally, clearly list the agreed SLOs -- that is, SLIs and associated targets. There should be as many SLOs as the number of SLIs multiplied by the number of request classes -- or, if some request classes are ineligible for any guarantee, say which.

References

  • Jones, Wilkes, and Murphy with Smith, "Service Level Objectives" in Site Reliability Engineering, O'Reilly 2016 (free online)
  • Alex Hidalgo, Implementing Service Level Objectives, O'Reilly 2020 (WMF Tech copy)