Machine Learning/LiftWing/KServe

From Wikitech

KServe is a Python framework and K8s infrastructure aimed to standardize the way that people run and deploy HTTP servers wrapping up ML models. The Machine Learning team uses it in the LiftWing K8s cluster, to implement the new model serving infrastructure that should replace ORES.

You can check this video (slides) about KServe to get an overview!

How Kserve fits into the Kubernetes picture?

As described above, KServe represents two things:

  • A Python framework to load model binaries and wrap them around a consistent and standard HTTP interface/server.
  • A set of Kubernetes resources and controllers able to deploy the aforementioned HTTP servers.

Before concentrating on Kubernetes it is wise to learn a bit how the Python framework works and how to write custom code to serve your model. Once KServe's internals and architecture are learned, it should be relatively easy to start playing with Docker to test few things. Once done, the ML team will take care of helping people to add the K8s configuration to deploy the model on Lift Wing.

KServe architecture

KServe uses FastAPI, Uvicorn and asyncio behind the scenes: it assumes that the code that handles/wraps the model is as async as possible (so composed by coroutines and not blocking code). The idea is to have the following code split:

  • Transformer code, that takes care of the client's inputs and also to retrieve the necessary features from services/feature-stores/etc.. It correspond to a separate Docker image and container.
  • Predictor code, that gets features via HTTP from the Transformer and passes them to the model, that computes a score. The result is then returned to the client.

By default both Transformer and Predictor runs a asyncio loop, so any blocking code limits the scalability of the code (since it is single threaded). KServe offers the possibility to use Ray workers as well, to parallelize models, see what the ML Team tested in https://phabricator.wikimedia.org/T309624.

When writing the code of your model server, keep in mind the following:

  • In Python only on thread can run at any given time due to the GIL. Any cpu-bound code holds the GIL until it finishes, stopping all the other threads / executions in the meantime (especially I/O bound ones, like send/receive of HTTP requests).
  • Python upstream suggests to use multi-process in case of heavy cpu-bound code. The ML team is writing libraries to ease the execution of cpu-bound functions/code into separate processes, please reach out to them in case you think that your code needs it.
  • Any HTTP or similar operation that you do in your code should use a correspondent asyncio library, like aiohttp. Please reach out to the ML team for more info if you need them (libraries like requests and urllib are not async for example).

Repositories

The starting point is surely the inference-services repository, where we keep all our configurations and Python code needed to generate the Docker images that will run on Kubernetes.

New service

If you have a new service that you want the ML Team to deploy on Lift Wing, we would suggest you first build and test your own model server using KServe locally via Docker.

This is an example https://github.com/AikoChou/kserve-example/tree/main/alexnet-model how to build a model server for image classification using a pre-trained Alexnet model.

Create your model server

In model-server/model.py, the AlexNetModel class extends the kserve.Model base class for creating a custom model server.

The base model class defines a group of handlers:

  • load: loads your model into the memory from a local file system or remote model storage.
  • preprocess: pre-processes the raw data or customized transformation logic.
  • predict: executes the inference for your model.
  • postprocess: post-processes the prediction result or turns the raw prediction result into a user-friendly inference response.

Based on your need, you can write a custom code for these handlers. Note that the later three handlers are executed in sequence, means that the output of the preprocess is passed to predict as the input, and the output of the predictor is passed to postprocess as the input. In this Alexnet example, you will see we write custom code for load and predict handlers, so we basically do everything (preprocess, predict, postprocess) in a single predict handler.

Having a separate Transformer to do pre/post-process is not mandatory, but is recommended. A more complex example with transformer and predicator, see https://github.com/AikoChou/kserve-example/tree/main/outlink-topic-model.

In model-server/requirement.txt, add the dependencies for your service and the dependencies below for KServe that align our production environment to the file.

kserve==0.8.0
kubernetes==17.17.0
protobuf==3.19.1
ray==1.9.0

Create a Dockerfile

Docker provides the ability to package and run an application in an isolated environment. If you look at the Dockerfile in the example, you will see we first specify a base image "python3-build-buster:0.1.0" from the Wikimedia Docker Registry, so we make sure the application can run in our WMF environment. The rest of the steps in the Dockerfile are simple, 1) we copy the model-server directory to the container. 2) pip install the necessary dependencies in requirement.txt. Finally, define an entry point for the KServe application to run the model.py script.

Deploy locally and Test

To deploy locally with docker please follow the instructions to deploy Alexnet model locally and test it.

To deploy kserve in a kubernetes cluster locally and test all its functionality, one can do it with minikube following the instructions on our local development guide.

Multiprocessing with asyncio

Enabling multiprocessing with asyncio can be done by specifying the corresponding variables in the helmfile values. At the moment this is only available for revscoring model servers.

predictor:
  custom_env:
    - name: ASYNCIO_USE_PROCESS_POOL
      value: "True"
    - name: ASYNCIO_AUX_WORKERS
      value: "5"
    - name: PREPROCESS_MP
      value: "True"
    - name: INFERENCE_MP
      value: "False"

Setting PREPROCESS_MP to True/False will enable/disable multiprocessing for preprocessing the models input, while doing the same with INFERENCE_MP will toggle multiprocessing for the inference part.

The team has run load testing experiments for the revscoring model servers which demonstrated a significant improvement in model server's robustness for the editquality models while for other model servers enabling multiprocessing did not improve the server's capacity to handle increased load. More information along with test results can be found on the relevant Phabricator task.

Service already present in the inference-services repository

Testing services already present in the inference-services repository locally is possible with Docker, but it needs a little bit of knowledge about how Kserve works.

Example 1 - Testing enwiki-goodfaith

Let's imagine that we want to run the enwiki revscoring editquality goodfaith model locally, to test how it works.

Prerequisites

In the inference-service repo, run the following commands to build the Docker image:

blubber .pipeline/revscoring/blubber.yaml production | docker build --tag SOME-DOCKER-TAG-THAT-YOU-LIKE --file - .

If you are curious about what Dockerfile gets built, remove the docker build command and see the output of Blubber. After the build process is done, we should see a Docker image named after the tag added to the docker build command. Use the following command to check:

docker image ls

Next, check the model.py file related to editquality (contained in the model-server directory) and familiarize with the __init__() function. All the environment variables retrieved in there are usually passed to the container by Kubernetes settings, so with Docker we'll have to explicitly set them.

Now you can create your specific playground directory under /tmp or somewhere else. The important bit is that you place the model binary file inside it. In this example, let's suppose that we are under /tmp/test-kserve, and that the model binary is stored in a subdirectory called models (so the binary's path is /tmp/test-kserve/models/model.bin). The name of the model is important, the standard is model.bin (so please rename your binary in case it doesn't match).

Run the following command to start a container:

docker run -p 8080:8080 -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL=https://en.wikipedia.org --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG-THAT-YOU-LIKE

If everything goes fine, you'll see some something like:

[I 220725 09:06:00 model_server:150] Registering model: enwiki-goodfaith
[I 220725 09:06:00 model_server:123] Listening on port 8080
[I 220725 09:06:00 model_server:125] Will fork 1 workers
[I 220725 09:06:00 model_server:128] Setting max asyncio worker threads as as 8

Now we are ready to test the model server! First, create a file called input.json with the following content:

{ "rev_id": 1097728152 }

Open another terminal, execute:

curl localhost:8080/v1/models/enwiki-goodfaith:predict -i -X POST -d@input.json --header "Content-type: application/json" --header "Accept-Encoding: application/json"

If everything goes fine, you should see some scores in the HTTP response.

Example 2 - Testing calling a "fake" EventGate (two containers)

A more complicated example is how to test code that needs to call services (besides the MW API). One example is the testing of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247

In the above code change, we are trying to add support for EventGate. The new code would allow us to create and send specific JSON events via HTTP POSTs to EventGate, but in our case we don't need to re-create the whole infrastructure locally; a simple HTTP server to echo the POST content is enough to verify the functionality.

The Docker daemon creates containers in a default network called bridge, that we can use to connect two containers together. The idea is to:

  • Create a KServe container like explained in the Example 1.
  • Create a HTTP server in another container using Python.

The latter is simple. Let's create a directory with two files:

FROM python:3-alpine

EXPOSE 6666

RUN mkdir /ws
COPY server.py /ws/server.py

WORKDIR /ws

CMD ["python", "server.py", "6666"]

We can then build and execute the container:

  • docker build . -t simple-http-server
  • docker run --rm -it -p 6666 simple-http-server

Before creating the KServe container, let's check the running container's IP:

  • docker ps (to get the container id)
  • docker inspect #container-id | grep IPAddress (let's assume it is 172.19.0.3)

As you can see in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247, two new variables have been added to __init__: EVENTGATE_URL and EVENTGATE_STREAM. So let's add them to the run command:

docker run -p 8080:8080 -e EVENTGATE_STREAM=test -e EVENTGATE_URL="http://172.19.0.3:6666" -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL=https://en.wikipedia.org --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG

Now you can test via curl the new code, and you should see the HTTP POST send by the KServe container to the "fake" EventGate simple HTTP server!

Example 3 - Testing outlink-topic-model (two containers)

The KServe architecture highly encourages the use of Transformers for the pre/post-process functions (so basically for feature engineering) and to use a Predictor for the models. Transformer and Predictor are separate Docker containers, that will also become separate pods in k8s (but we don't need to worry a lot about this last bit).

This example is a variation of the second one, since it involves spinning up two containers and use the default network bridge to make them in communication between each other. The Transformer can be instructed to contact the predictor on a certain IP:Port combination, to pass to it features collected during the preprocess step.

Let's use the outlink model example (at the moment the only transformer/predictor example in inference-services) to see the steps:

Build the Transformer's Docker image locally:

blubber .pipeline/outlink/transformer.yaml production | docker build --tag outlink-transformer --file - .

Build the Predictor's Docker image locally:

blubber .pipeline/outlink/blubber.yaml production | docker build --tag outlink-predictor --file - .

Download the model from https://analytics.wikimedia.org/published/datasets/one-off/isaacj/articletopic/model_alloutlinks_202012.bin in a temp path (see Example 2) Start the predictor:

docker run --rm -v `pwd`:/mnt/models outlink-predictor

(note: `pwd` represents the directory that will be mounted in the container, it needs to have the model binary downloaded above and renamed 'model.bin')

Run docker ps and docker inspect #container-id to find the IP address of the Predictor's container (see Example 2 for more info).

Run the transformer:

docker run -p 8080:8080 --rm outlink-transformer --predictor_host PREDICTOR_IP:8080 --model_name outlink-topic-model

(note: PREDICTOR_IP needs to be replaced with what you found during the previous step)

Then you can send requests to localhost:8080 via curl or your preferred HTTP client. You'll hit the Transformer first, the features will be retrieved and then sent to the Predictor. The score will be generated and the returned to the client.

Admin only - Upgrade KServe to a new version

Before starting, please check https://github.com/kserve/kserve/releases and familiarize with the changes in the new version.

Upgrading KServe is a process that involves two macro parts:

  • Upgrade all isvc Docker images in the inference-services repository. This can be done as first step, in a totally friendly and relaxed rolling upgrade of every kserve package installed by pip.
  • Upgrade the K8s control plane, namely the Go controllers that extend the K8s API to support InferenceService custom resources (and related).

The first step is the easiest but very tedious, since upgrading kserve often means a little bit of Python dependency hell in our isvc's requirements.txt. Pay particular attention if the Python version needs to be bumped, and if possible couple the upgrade with an Operating System bump (via Blubber config). The OS bump is not mandatory but coupling the two seems nice and easy, unless there is some major reason/breaking-change that tells otherway.

The second step has some sub-steps:

  • Upgrade the KServe Docker images in the production-images repository.
  • Build and release the new images to the Docker Registry. Please see Kubernetes/Images#Image building for more info (SRE only).
  • Upgrade the kserve Helm chart in the deployment-charts repository. The new config can be retrieved in various places, but the easiest is to look into Github's released files for the new Kserve version (like https://github.com/kserve/kserve/releases/tag/v0.12.0, bottom of the page). The new yaml is usually really huge, something like 20k lines, and a line-by-line comparison is something unbearable for any human being. The procedure that we have used so far is the following:
    • Use an editor that allows to place two yaml files one near the other, and also that allows to collapse yaml bits if required.
    • Compare the current kserve.yaml file in deployment-charts with the new one downloaded from Github.
    • Find all the occurrences of schema in the new yaml file, and collapse them (we don't need to customize them, since it is mostly related to webhooks). Now the things to compare are way less and more manageable :)
    • Check all occurrences of custom values that we add in deployment-charts' kserve.yaml, for example the ones surrounded by {{ etc.. }}, and replicate them in the new kserve.yaml file.
    • Read the README file in deployment-charts' kserve chart, since it lists a series of customizations that we applied over time.
    • When you are done, copy the new kserve.yaml file over the deployment-charts' one, bump the chart's version and check the diff in Gerrit. It will be pretty easy now to compare old and new one, if you missed anything or touched/modified bits inadvertently.
    • Merge the change once it gets reviewed, and prepare to deploy :)
  • Deploy the new chart to ml-staging-codfw, and check various bits:
    • Deleting an isvc pod should work fine (and the new storage-initializer's image should work).
    • You shouldn't see error logs (or any horror related to yaml parsing etc..) in the KServe's control plane nodes (kserve namespace).
  • Finally deploy to prod!

At this point the task is completed! Note for the reader - in the future we may want to use the upstream's kserve chart config, even if I am not 100% sure if it simplifies the above or not (since we'll have to apply customizations anyway).