Machine Learning/LiftWing/Streams
If you want to call a specific Lift Wing model server every time that an event is posted to an event stream in Kafka, we suggest to use our ChangeProp rules defined for Lift Wing. ChangeProp can be configured to listen to a Kafka topics and call Lift Wing to generate a score. In turn, Lift Wing can be configured to post an event containing the score to EventGate, that will finally enqueue it to a Kafka topic.
As of December 2023, we have two model servers configured using ChangeProp. These servers publish events from Lift Wing to Event Gate:
model server | source event stream | output event stream |
---|---|---|
revscoring-drafttopic | mediawiki.revision_score_drafttopic (schema) | |
outlink-topic-model | mediawiki.page_change.v1 | mediawiki.page_outlink_topic_prediction_change.v1 (schema) |
The requirements for you are the following:
- A model server needs to be deployed to Lift Wing, and it must have passed basic sanity checks from the ML team (namely it needs to be able to sustain a decent traffic level without crashing etc..).
- Decide what is the source event stream. For example, ORES has always been configured to score every rev-id registered in mediawiki.revision-create but you may need a different source.
- Decide if you need to filter or not the traffic in the stream. For example, let's say that your model in Lift Wing supports only enwiki and itwiki etc.. You can specify this in the task to the ML team (more on it later on).
- Decide the schema of the event that will be generated by Lift Wing a posted to EventGate. For example, all the ORES scores use the mediawiki.revision-score schema. We also have the mediawiki.page_prediction_classification_change schema to represent a classification model output (topic, revert, quality, etc.). If you need a different one, you'll have to work with Data Engineering to create and deploy it. Please inform also the ML team in case so we'll need to add the necessary code to your model server to support the use case.
- Your new event stream will contain the events generated by a specific model server enqueued in a Kafka topic. We have some conventions about stream naming, and a mediawiki-config deployment is needed to declare the stream in stream configuration. (When using eventgate-main, it will also need a deployment.) We'll follow up with you in the task about this don't worry!
After reading the above, you can create a task to the ML team with what you have decided, we'll take it from there and work with you to implement the new stream!
Streams (Admins only, Machine Learning team)
Once a task has been created with the above information, we need to do the following steps:
- If the stream uses a new schema or version (and eventgate-main) follow these instructions to update the schema repositories in the eventgate-wikimedia docker image.
- Create a mediawiki-config change like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/884155, and add members of ML and DE to it. When you have +1s and you are happy to go, don't merge it but schedule a deployment in Deployments (somebody will help you in the deployment, no need to be able to perform a full MediaWiki deployment). Once the deployment is done you should see the new stream listed in https://meta.wikimedia.org/w/api.php?action=streamconfigs&all_settings.
- Ask to any SRE member of DE/ML/ServiceOps to roll restart EventGate Main's pods (in both DCs). The reason is written in Event Platform/EventGate/Administration#EventStreamConfig change.
- Verify that the target event chosen by the requestor is supported by Lift Wing. For example, at the time of writing (Dec 2023) we support mediawiki.revision-create and mediawiki.page_change - any other request will need a code change in the
events.py
module in the inference-services repo. - Verify that the model-server has been deployed to the Lift Wing staging cluster, that it is working correctly (more specifically, that it supports sending events to Event Gate).
- File a code change with the Change-Prop staging config (look for "liftwing"). The testing set up is a little different from Production, see the example in the dedicated section below. After the change has been merged, ask an SRE to deploy the change to Change-Prop's staging cluster (ServiceOps' staging cluster basically).
- When you are happy about the overall workflow, you are ready to deploy in prod! Please modify the Change-prop's production config like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886918. Please make sure that the target model-server's config is configured with autoscaling (see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886917) if needed (for example, if the model-server supports few requests per second more pods are likely better to sustain the extra traffic generated by Change-Prop). If you are unsure, ask to an SRE :)
- Deploy Change-Prop in production (ask to an SRE) and after this you should start seeing traffic hitting the model server and events generated to the target Kafka topic!
Staging configuration and testing
liftwing:
uri: 'https://inference-staging.svc.codfw.wmnet:30443'
models:
goodfaith:
concurrency: 2
match_config:
database: '/^(en|zh)wiki$/'
namespace: revscoring-editquality-goodfaith
kafka_topic: 'liftwing.test-events'
The Change-Prop staging config is a little different from production, since we don't want to have a continuous stream of events to evaluate, but just few ones to check that the whole pipeline works. In this case:
- The kafka topic to use to listen to for any event is
liftwing.test-events
in the Kafka Main eqiad cluster (the Change-prop's staging config is configured only with the eqiad cluster as of now, Feb 2023). This topic should mimic, in this case,mediawiki.revision-create
, so we can send to it any revision-create events and use them to test Change-prop. The main benefit from this setting is that we don't cause any other Change-prop's config/rule to be triggered (since many of the workflows usemediawiki.revision-create
as well) but only the lift wing ones. - Find the Kafka topic that represents your source of data. For example, in our case we have mediawiki.revision-create, and the correspondent topic is
eqiad.mediawiki.revision-create
(ask to an SRE to help you). - From a stat100x node, collect an event from
mediawiki.revision-create
to a file calledtest.json
using kafkacat:kafkacat -t eqiad.mediawiki.revision-create -b kafka-main1001.eqiad.wmnet:9093 -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt -o latest -c 1 > test.json
- Verify that the
match_config
rule specified in the configuration highlighted above works. In the above example we are saying that we are matching a field in the source event called "database" with the regex that follows. - Then send the event to staging.liftwing.test-events (from a stat100x node):
cat test.json | kafkacat -P -t staging.liftwing.test-events -b kafka-main1001.eqiad.wmnet:9093 -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt
- You should now see in the model-server's logs on Lift Wing staging an access log entry with the new request logged.
- Last but not the least, verify on the target Kafka topic that the event has been posted correctly (you can use
kafkacat -C
as described above from a stat100x node). - If you get a validation error from EventGate, check logstash for more info.
Current settings to publish events from Lift Wing staging to EventGate:
model server | source Kafka topic | target Kafka topic |
---|---|---|
revscoring-drafttopic | liftwing.test-events | mediawiki.revision-score-test |
outlink-topic-model | liftwing.test-outlink-events | mediawiki.page_prediction_change.rc0 (T349919) |