Machine Learning/LiftWing/KServe
KServe is a Python framework and K8s infrastructure aimed to standardize the way that people run and deploy HTTP servers wrapping up ML models. The Machine Learning team uses it in the LiftWing K8s cluster, to implement the new model serving infrastructure that should replace ORES.
You can check this video (slides) about KServe to get an overview!
How Kserve fits into the Kubernetes picture?
As described above, KServe represents two things:
- A Python framework to load model binaries and wrap them around a consistent and standard HTTP interface/server.
- A set of Kubernetes resources and controllers able to deploy the aforementioned HTTP servers.
Before concentrating on Kubernetes it is wise to learn a bit how the Python framework works and how to write custom code to serve your model. Once KServe's internals and architecture are learned, it should be relatively easy to start playing with Docker to test few things. Once done, the ML team will take care of helping people to add the K8s configuration to deploy the model on Lift Wing.
KServe architecture
KServe uses FastAPI, Uvicorn and asyncio behind the scenes: it assumes that the code that handles/wraps the model is as async as possible (so composed by coroutines and not blocking code). The idea is to have the following code split:
- Transformer code, that takes care of the client's inputs and also to retrieve the necessary features from services/feature-stores/etc.. It correspond to a separate Docker image and container.
- Predictor code, that gets features via HTTP from the Transformer and passes them to the model, that computes a score. The result is then returned to the client.
By default both Transformer and Predictor runs a asyncio loop, so any blocking code limits the scalability of the code (since it is single threaded). KServe offers the possibility to use Ray workers as well, to parallelize models, see what the ML Team tested in https://phabricator.wikimedia.org/T309624.
When writing the code of your model server, keep in mind the following:
- In Python only on thread can run at any given time due to the GIL. Any cpu-bound code holds the GIL until it finishes, stopping all the other threads / executions in the meantime (especially I/O bound ones, like send/receive of HTTP requests).
- Python upstream suggests to use multi-process in case of heavy cpu-bound code. The ML team is writing libraries to ease the execution of cpu-bound functions/code into separate processes, please reach out to them in case you think that your code needs it.
- Any HTTP or similar operation that you do in your code should use a correspondent asyncio library, like aiohttp. Please reach out to the ML team for more info if you need them (libraries like requests and urllib are not async for example).
Repositories
The starting point is surely the inference-services repository, where we keep all our configurations and Python code needed to generate the Docker images that will run on Kubernetes.
New service
If you have a new service that you want the ML Team to deploy on Lift Wing, we would suggest you first build and test your own model server using KServe locally via Docker.
This is an example https://github.com/AikoChou/kserve-example/tree/main/alexnet-model how to build a model server for image classification using a pre-trained Alexnet model.
Create your model server
In model-server/model.py, the AlexNetModel class extends the kserve.Model base class for creating a custom model server.
The base model class defines a group of handlers:
load: loads your model into the memory from a local file system or remote model storage.preprocess: pre-processes the raw data or customized transformation logic.predict: executes the inference for your model.postprocess: post-processes the prediction result or turns the raw prediction result into a user-friendly inference response.
Based on your need, you can write a custom code for these handlers. Note that the later three handlers are executed in sequence, means that the output of the preprocess is passed to predict as the input, and the output of the predictor is passed to postprocess as the input. In this Alexnet example, you will see we write custom code for load and predict handlers, so we basically do everything (preprocess, predict, postprocess) in a single predict handler.
Having a separate Transformer to do pre/post-process is not mandatory, but is recommended. A more complex example with transformer and predicator, see https://github.com/AikoChou/kserve-example/tree/main/outlink-topic-model.
In model-server/requirement.txt, add the dependencies for your service and the dependencies below for KServe that align our production environment to the file.
| kserve==0.8.0 |
| kubernetes==17.17.0 |
| protobuf==3.19.1 |
| ray==1.9.0 |
Create a Dockerfile
Docker provides the ability to package and run an application in an isolated environment. If you look at the Dockerfile in the example, you will see we first specify a base image "python3-build-buster:0.1.0" from the Wikimedia Docker Registry, so we make sure the application can run in our WMF environment. The rest of the steps in the Dockerfile are simple, 1) we copy the model-server directory to the container. 2) pip install the necessary dependencies in requirement.txt. Finally, define an entry point for the KServe application to run the model.py script.
Deploy locally and Test
To deploy locally with docker please follow the instructions to deploy Alexnet model locally and test it.
To deploy kserve in a kubernetes cluster locally and test all its functionality, one can do it with minikube following the instructions on our local development guide.
Multiprocessing with asyncio
Enabling multiprocessing with asyncio can be done by specifying the corresponding variables in the helmfile values. At the moment this is only available for revscoring model servers.
predictor:
custom_env:
- name: ASYNCIO_USE_PROCESS_POOL
value: "True"
- name: ASYNCIO_AUX_WORKERS
value: "5"
- name: PREPROCESS_MP
value: "True"
- name: INFERENCE_MP
value: "False"
Setting PREPROCESS_MP to True/False will enable/disable multiprocessing for preprocessing the models input, while doing the same with INFERENCE_MP will toggle multiprocessing for the inference part.
The team has run load testing experiments for the revscoring model servers which demonstrated a significant improvement in model server's robustness for the editquality models while for other model servers enabling multiprocessing did not improve the server's capacity to handle increased load. More information along with test results can be found on the relevant Phabricator task.
Service already present in the inference-services repository
Testing services already present in the inference-services repository locally is possible with Docker, but it needs a little bit of knowledge about how Kserve works.
Example 1 - Testing enwiki-goodfaith
Let's imagine that we want to run the enwiki revscoring editquality goodfaith model locally, to test how it works.
Prerequisites
- Clone the inference-services repository
- Download Blubber
- Download the model binary (in our case, they are available in https://github.com/wikimedia/editquality/tree/master/models).
In the inference-service repo, run the following commands to build the Docker image:
blubber .pipeline/revscoring/blubber.yaml production | docker build --tag SOME-DOCKER-TAG-THAT-YOU-LIKE --file - .
If you are curious about what Dockerfile gets built, remove the docker build command and see the output of Blubber. After the build process is done, we should see a Docker image named after the tag added to the docker build command. Use the following command to check:
docker image ls
Next, check the model.py file related to editquality (contained in the model-server directory) and familiarize with the __init__() function. All the environment variables retrieved in there are usually passed to the container by Kubernetes settings, so with Docker we'll have to explicitly set them.
Now you can create your specific playground directory under /tmp or somewhere else. The important bit is that you place the model binary file inside it. In this example, let's suppose that we are under /tmp/test-kserve, and that the model binary is stored in a subdirectory called models (so the binary's path is /tmp/test-kserve/models/model.bin). The name of the model is important, the standard is model.bin (so please rename your binary in case it doesn't match).
Run the following command to start a container:
docker run -p 8080:8080 -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL=https://en.wikipedia.org --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG-THAT-YOU-LIKE
If everything goes fine, you'll see some something like:
[I 220725 09:06:00 model_server:150] Registering model: enwiki-goodfaith
[I 220725 09:06:00 model_server:123] Listening on port 8080
[I 220725 09:06:00 model_server:125] Will fork 1 workers
[I 220725 09:06:00 model_server:128] Setting max asyncio worker threads as as 8
Now we are ready to test the model server!
First, create a file called input.json with the following content:
{ "rev_id": 1097728152 }
Open another terminal, execute:
curl localhost:8080/v1/models/enwiki-goodfaith:predict -i -X POST -d@input.json --header "Content-type: application/json" --header "Accept-Encoding: application/json"
If everything goes fine, you should see some scores in the HTTP response.
Example 2 - Testing calling a "fake" EventGate (two containers)
A more complicated example is how to test code that needs to call services (besides the MW API). One example is the testing of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247
In the above code change, we are trying to add support for EventGate. The new code would allow us to create and send specific JSON events via HTTP POSTs to EventGate, but in our case we don't need to re-create the whole infrastructure locally; a simple HTTP server to echo the POST content is enough to verify the functionality.
The Docker daemon creates containers in a default network called bridge, that we can use to connect two containers together. The idea is to:
- Create a KServe container like explained in the Example 1.
- Create a HTTP server in another container using Python.
The latter is simple. Let's create a directory with two files:
server.py, containing something like https://gist.github.com/mdonkers/63e115cc0c79b4f6b8b3a6b797e485c7 (there are countless examples of using Python's HTTP server to log GET and POST requests received).Dockerfile, containing something like the following:
FROM python:3-alpine
EXPOSE 6666
RUN mkdir /ws
COPY server.py /ws/server.py
WORKDIR /ws
CMD ["python", "server.py", "6666"]
We can then build and execute the container:
docker build . -t simple-http-serverdocker run --rm -it -p 6666 simple-http-server
Before creating the KServe container, let's check the running container's IP:
docker ps(to get the container id)docker inspect #container-id | grep IPAddress(let's assume it is 172.19.0.3)
As you can see in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247, two new variables have been added to __init__: EVENTGATE_URL and EVENTGATE_STREAM. So let's add them to the run command:
docker run -p 8080:8080 -e EVENTGATE_STREAM=test -e EVENTGATE_URL="http://172.19.0.3:6666" -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL=https://en.wikipedia.org --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG
Now you can test via curl the new code, and you should see the HTTP POST send by the KServe container to the "fake" EventGate simple HTTP server!
Example 3 - Testing outlink-topic-model (two containers)
The KServe architecture highly encourages the use of Transformers for the pre/post-process functions (so basically for feature engineering) and to use a Predictor for the models. Transformer and Predictor are separate Docker containers, that will also become separate pods in k8s (but we don't need to worry a lot about this last bit).
This example is a variation of the second one, since it involves spinning up two containers and use the default network bridge to make them in communication between each other. The Transformer can be instructed to contact the predictor on a certain IP:Port combination, to pass to it features collected during the preprocess step.
Let's use the outlink model example (at the moment the only transformer/predictor example in inference-services) to see the steps:
Build the Transformer's Docker image locally:
blubber .pipeline/outlink/transformer.yaml production | docker build --tag outlink-transformer --file - .
Build the Predictor's Docker image locally:
blubber .pipeline/outlink/blubber.yaml production | docker build --tag outlink-predictor --file - .
Download the model from https://analytics.wikimedia.org/published/datasets/one-off/isaacj/articletopic/model_alloutlinks_202012.bin in a temp path (see Example 2) Start the predictor:
docker run --rm -v `pwd`:/mnt/models outlink-predictor
(note: `pwd` represents the directory that will be mounted in the container, it needs to have the model binary downloaded above and renamed 'model.bin')
Run docker ps and docker inspect #container-id to find the IP address of the Predictor's container (see Example 2 for more info).
Run the transformer:
docker run -p 8080:8080 --rm outlink-transformer --predictor_host PREDICTOR_IP:8080 --model_name outlink-topic-model
(note: PREDICTOR_IP needs to be replaced with what you found during the previous step)
Then you can send requests to localhost:8080 via curl or your preferred HTTP client. You'll hit the Transformer first, the features will be retrieved and then sent to the Predictor. The score will be generated and the returned to the client.
KServe Inference Batcher
The KServe Inference Batcher is a component that groups multiple incoming inference requests into a single batch before sending them to the model server. This mechanism improves overall throughput and resource utilisation by allowing models to process several inputs simultaneously instead of handling each request individually.
When batching is enabled, an additional model agent sidecar container is injected into the InferenceService pod. Incoming requests are first received by this sidecar, which temporarily stores them until a batch is formed and then forwards the batched request to the predictor container for inference. A batch is triggered either when the maximum batch size is reached or when the maximum waiting time expires.
The post payload for the batch requests looks something like this, where each request may contain one or more instances. Instances are the separate data samples on which we wish to run inference upon.
{
"instances": [
{
"rev_id": 1234
"lang": "en"
},
{
"rev_id": 2345
"lang": "de"
},
]
}
Batching behavior is primarily controlled by two parameters, so inference is triggered when either of the two conditions are met:
maxBatchSize– the number of instances that are sent. When this number is reached a batch computation is triggeredmaxLatency– the maximum time (in milliseconds) the batcher waits/aggregates before sending a batch, even if the batch size is not reached.
This mechanism allows operators to trade off latency and throughput depending on workload characteristics.
Why use the Batcher in LiftWing
Batching can be useful for LiftWing workloads where:
- Individual inference requests are relatively small.
- Many requests arrive within a short time window.
- The underlying model benefits from vectorized or batched computation (e.g., GPU inference or optimized tensor operations).
By aggregating requests, the batcher can reduce per-request overhead and improve overall throughput of the model service. This may be particularly beneficial when serving high-traffic endpoints where small latency increases are acceptable in exchange for higher efficiency.
Example configuration
Batching is enabled directly in the InferenceService specification under the predictor field in the deployment-charts repository:
inference_services:
model-server-name:
predictor:
batcher:
maxBatchSize: 32
maxLatency: 500
In this example:
- Up to 32 requests will be grouped together. In reality the number of requests can be smaller if some of the requests contain more than one instance.
- If the batch size is not reached, the batcher will aggregate requests inside a 500 ms time window.
Scenarios based on the above configuration
This section presents several real-world scenarios to help better understand the behavior of the KServe batcher.
The first 3 scenarios assume that each request contains 1 data instance, while the last one demonstrates the batcher's behavior when requests have more than 1 instance.
Scenario 1: High request rate (batch size reached quickly)
If requests arrive very quickly, the batcher will trigger inference as soon as 32 instances are collected, without waiting for the latency timeout.
Example timeline
t = 0 ms→ request 1 arrivest = 5 ms→ request 10 arrivest = 20 ms→ request 32 arrives
At this point:
- The batch size limit (32) is reached.
- The batcher immediately forwards the batch to the predictor container.
Result
- 32 requests are processed together.
- Total wait time for the earliest request: ~20 ms.
maxLatencyis not reached.
This scenario typically occurs with high-traffic inference endpoints.
Scenario 2: Moderate request rate (latency limit reached)
If requests arrive more slowly, the batch may not reach 32 instances before the latency threshold expires.
Example timeline
t = 0 ms→ request 1 arrivest = 120 ms→ request 2 arrivest = 240 ms→ request 3 arrivest = 410 ms→ request 4 arrivest = 500 ms→ latency threshold reached
At t = 500 ms:
- Only 4 requests are available with 1 instance each.
- The batcher triggers inference because maxLatency is reached.
Result
- A batch of 4 requests is sent to the predictor.
- The earliest request waited 500 ms.
This scenario balances batching with bounded waiting time.
Scenario 3: Very low request rate (single-request batch)
If requests arrive infrequently, the batcher may send very small batches, sometimes even a single request.
Example timeline
t = 0 ms→ request 1 arrivest = 500 ms→ latency threshold reached
At this point:
- Only 1 instance is in the queue.
- The batcher sends a batch of size 1 to the model.
Result
- No batching benefit occurs.
- The request still experiences the maximum wait time (500 ms).
This scenario may happen for low-traffic models, where batching might not provide significant performance improvements.
Scenario 4: Requests with multiple instances
If some requests have more instances the actual computation batch can be much higher than the maxBatchSize.
Example timeline
t = 0 ms→ request 1 arrives with 16 instances.t = 120 ms→ request 2 arrives with 32 instances
So at t = 120 ms:
- When the second request is sent the total number of instances is 48.
Result
- A batch of 48 instances is sent to the predictor.
This scenario illustrates the relationship between requests and instances and shows that, in practice, batches can contain an arbitrary number of instances.
Summary
The batcher sends requests to the model when either condition is met:
- The number of queued instances reaches
maxBatchSize(32). - The oldest request has waited
maxLatency(500 ms).
This design ensures that batching improves throughput when traffic is high while preventing requests from waiting indefinitely when traffic is low.