2022-02-23

Feast

https://phabricator.wikimedia.org/T294434
Andy: what do we want to learn? what would be a good demo for online feature store?
- Just load a bunch of revscoring features in?
- What kind of storage makes sense for us? Swift, parquet, sql?
Luca: How do we want to structure the +3 nodes in eqiad & codfw
- we have codfw, eqiad still needs to be racked
- score cache could be a seperate Redis instance
Tobias: having separate instances would allow us to tune each instance.
Luca: Feast wants a single redis endpoint for host & config, if we have multiple nodes, we may need a proxy in the middle.
Let's figure out how we save registry, also how to handle single redis endpoint.
Followup: how do we load data into Feast? (airflow) How much space do we need? etc..

ORES deploy

https://phabricator.wikimedia.org/T300195
Andy: was there a issue on beta?
Luca: There is a local proxy issue, Taavi fixed, not sure if deployment-prep vm is fixed.
- Nothing is burning, no weird errors, things seem to work.
- Might be nice for learning if Aiko wants to work on the task w/ Luca
Aiko: I would like to learn more about the difference between ORES & Lift Wing

Changeprop calling ORES

https://phabricator.wikimedia.org/T301409
Luca: we could post to eventgate in our model.py for Lift Wing
Tobias: could we do this in Istio on our side? a bit like request logging right?
Luca: its just a simple POST request so we could do it in the python code, could maybe try Knative eventing but we are using an older version of Knative.
Tobias: agreed, doing it w/ a library or wrapper makes alot of sense, only tricky bit is you don't want to delay the call.
Andy: does consistency matter? can we just fire off a post request via asyncio and then return our prediction to the user?
Luca: that should be fine, if some are missing it's not super problematic.

Lift Wing migration

https://phabricator.wikimedia.org/T301409
Luca: moving models will take space on the cluster with the current cgroup configs
- was hoping the new system would not
Tobias: Lift Wing is not homebrew, which is an advantage.
Andy: the revscoring images are pretty big, also they include a ton of assets
- other models won't necessarily be like this
Luca: we are halfway through migrating ORES images and cpu/memory is filling up.
- maybe fine with 8 nodes?

2022-02-16

ORES deploy

Luca: we are unblocked, recent patches are now running on beta

API Gateway platform

Chris: i think there is now a big push to get it in a good place, which is awesome for us
Luca: We should connect and start making sure everything works as expected
- header pathing map etc..

editquality migration

Luca: we may need to change Swift clusters to MOSS
- the paths should stay the same

eventgate scores

https://phabricator.wikimedia.org/T301878
Luca: we can just include a parameter that will make an HTTP post to eventgate

2022-02-09

API Gateway

https://phabricator.wikimedia.org/T288789
Are we blocked?
- Chris: short-term- Hugh & us will unblock, unsure of long-term status of project
- Tobias: will let you know outcome of upcoming meeting, all our asks could be no big issue.

Transformers (again)

https://phabricator.wikimedia.org/T294419#7688032
Images are big (transformer + predictor)
- also transformer + predictor both need to mount/load model into pod from storage
editquality will have 30+ isvcs running two large images
Chris: Do we want to use transformers on future models? Yes. The ORES models are a special case.
Kevin: the transformers seemed fine until we needed to load the models into the separate pods, now it seems really heavy.
Andy: my one argument for keeping the heavy transformers was that we could use it in an explainer, but that does not seem to work (maybe a kserve bug?)
Tobias: Forcing transformers architecture on revscoring models may not help us gain much, other than keeping it alive longer.
Luca: for editquality it might make sense to go back to predictor
- the mw transformers are not async either so there is a bottleneck.
Kevin: Also, regarding maintainability, we are currently loading models + revscoring and all its dependencies in both the transformer and predictor. Loading them in only the predictor is much more maintainable. It's more DRY.

ORES migration

Chris: We should get all the models onto Lift Wing before ORES dies (hardware out of warranty, stretch is ending LTS in a few months, etc.) and then we can improve the models.
Luca: Looking at traffic in ores1001 - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ores1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores
- Lots of hosts doing nothing
Tobias: I wish we could easily sample say 5% of current ORES-bound traffic and see what happens to it on LW
Luca: We could use changeprop and see how it handles on LW
Chris: We could do an experiment, although every single model we are not able to migrate over is a conversation we need to have with the community.
Andy: It would be pretty easy to migrate editquality over to LW. We just need to add the model binary files to Thanos Swift and then update the helm files.
Chris: Let's load every single ORES model into LW. We got 110+ of them, let's start moving them.
Luca: For editquality, let's get rid of transformer, move to predictor-only and then start spinning up pods.
Andy: I will make a task and then Kevin and I can split the work from there.

Staging Environment

Kevin: What is our staging environment going to be?
- Chris: Should we do dev on staging?
Luca: We have ml-sandbox
- Kevin: ml-sandbox is good
Andy: I think we were unsure of what the testing cluster would be used for. Also the cluster-local-gateway networking issues hadn't been solved on ml-sandbox yet, so we were unsure if we should continue maintaining our dev cluster. Things are good now and I think ml-sandbox is good for dev.

2022-01-26

ORES deploy planning

Chris: 4 tasks
- Deploy dutch model
  - https://phabricator.wikimedia.org/T300195
- Logging
  - https://phabricator.wikimedia.org/T299999
- Celery update
  - https://phabricator.wikimedia.org/T300180
- PyYaml update
  - https://phabricator.wikimedia.org/T300179
Andy: Should we wait on the logger changes and fix security bugs before full deploy?
Luca: I have a logger pr: https://github.com/wikimedia/ores/pull/355
Chris: What are the chances of the nlwiki model breaking things?
- Maybe we do a model deployment first-> then logging -> then dep upgrade
Luca: the celery update will be tricky, what about pyyaml?
Andy: there is a wrapper for pyyaml that will need to be updated: https://github.com/halfak/yamlconf
Luca: Let's see if we can get a new version pushed to PYPI, otherwise we can fork and install ourselves.
Tobias: re: upgrades, Let's do risk assessment, do the smallest first then iterate.
- Luca: We can test on canary for a few days
Luca: it would be helpful to know who is using ORES the most
- Andy: list of ORES applications: https://www.mediawiki.org/wiki/ORES/Applications

API Gateway Integration

https://phabricator.wikimedia.org/T288789
Chris: Where are we on this?
Tobias: Luca and I have been discussing about our wants & needs, still need to get info about feasability.
Chris: Lets figure out nice-to-haves, needs for production system and what we need to get to MVP.
Luca: All things we have asked for have almost been delivered, but we need to start testing the integration
- Hugh has been very helpful in deploying changes to prod.

Image Recommendation?

Chris: it's not really standalone ML model
- we won't need to host this (built-in logistic regression feature in elasticsearch)
Luca: where is this hosted/ who owns this?
- Chris: Cormac on platform i believe.

2022-01-19

Lift Wing MVP check-in

https://phabricator.wikimedia.org/T272917
Luca: what is left for SRE?
- Finish load balancer endpoint - https://phabricator.wikimedia.org/T289835
- API Gateway integration - https://phabricator.wikimedia.org/T288789
  - still need rate limiting, but ok to start testing
- egress gateway works
- cert-manager is deployed - https://phabricator.wikimedia.org/T298976
- load testing - https://phabricator.wikimedia.org/T296173
Luca: Is score-caching and feature store outside MVP scope?
- Chris: yes
Andy: we need to finish up transformers
- We know how to create an inference service (predictor, transformer, models)
- We can upload model binaries to thanos swift via statboxes using model_upload script
- Dev work on testing transformers on the ML sandbox - not an MVP blocker
  - Need to figure out how to run transformers on ml-sandbox, cluster-local-gateway issue?
  - Also need to decide where to store dev model binaries (pvc, minio, swift or keep using old bucket?)

Wikidata ORES spikes

https://phabricator.wikimedia.org/T299137
score_errored
no visibility
- missing logs
Luca: i think it happens during feature extraction?
Luca: adding more logging would help us figure out what is happening
Chris: this will help keep ores stable while we migrate to lift wing
Andy: this could be helpful for us to see if there is a bug buried somewhere.
Luca: maybe not fix the bug but help us know where the issue is

ORES deploy

Andy: we need to deploy the new nlwiki articlequality model, maybe this week?
Luca: let's include the logging updates for the spikes
Andy: +1, i'm happy to do the deploy, maybe we can record it and use it as a side-by-side comparison video w/ lift wing?

ORES clients

Luca: let's catalog all clients, bots, how people are calling ores (apis etc.) a good starting task for maybe Aiko?
Andy: +1 - we need local contacts for wiki communities too
- Hal (privacy engineer) had suggested having 'service cards' that describe all downstream users for a model, might be good to have early-on for lift wing.
Luca: a preliminary list of users, tools, etc.

2022-01-12

Feast spike & hardware order

Luca: We should discuss our plans for the Feature store(s), we have to review procurement tasks for dcops this week. Current Plan:
- 3 redis-like nodes in eqiad
- 3 redis-like nodes in codfw
- 2 nodes in eqiad (for the offline store, but it was in early stages, not even sure if we need those)
Online Feature Store Task: https://phabricator.wikimedia.org/T294434
Score Cache:
- Luca: ORES models may not need a feature store right away
- Chris: from product perspective, score cache would handle MVP use-case
- Tobias: Having a cache and not needing it is lower risk than doing full feature store
Chris: let's try to use the same boxes for score cache and then later on the online feature-store.
- build score-cache first, then progress to online-feature store

nlwiki articlequality deployment

https://www.mediawiki.org/w/index.php?title=Topic:Wnlvrrqtbmfig1l3&topic_showPostId=wnlvrrqtbqdko5jb&fromnotif=1#flow-post-wnlvrrqtbqdko5jb
Andy: the new version ranges for revscoring on the model class repos have some implications for the inference-service image builds. The 'easiest' solution so far has been to install model-class repo via git+ssh and pin to a specific commit in requirements.txt
- example: https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revscoring/articlequality/model-server/requirements.txt#7
Chris: Let's deploy for now and use the git+ssh hack while we finish the MVP
Andy: Will merge PRs this week and plan for a deployment next week

ORES migration

Luca: we will need to update clients / all users of ORES with new endpoints
- New urls will not be a simple redirect due to api gateway etc..
- Chris: Let's start getting in touch with down-stream users, I will start asking around.

ORES wikidata spike

Luca: we are seeing occasional spikes: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=72&orgId=1&refresh=1m
- is this a model error or ORES-error?
- Tobias: Classic problem - We have signal but we aren't sure what is noise.
- Luca: Let's make a task (update: https://phabricator.wikimedia.org/T299137)
ML monitoring
- Prometheus -> Grafana
- logs -> logstash(?)
- Status codes
  - Tobias: 4xx is client screwed up, 5xx is we screwed up
- Tracing
  - Tobias: there are some great rpc tracing tools that let you explore each step in a workflow, it would be helpful to have something similar
  - Andy: I've seen zipkin and jaeger recommended for distributed tracing in our stack

Load Testing

https://phabricator.wikimedia.org/T296173
Luca: Per pod performance is not great at the moment
- Maybe we need to tune CPU & Memory?
Tobias: What is being 'starved'? Does container still see full machine?
Luca: there is 'blocking' code, not sure if this is on IO, will ask in slack
- max asyncio workers is cpu-count + 4
- https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c
- Luca: Lets bump the limit to two for now.

2021-12-15

Transformers

https://phabricator.wikimedia.org/T294419
Andy: articlequality transformer image is done (need to update chart & deploy to ml-serve)
- image: https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-articlequality-transformer/tags/
- code: https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revscoring/articlequality/transformer/
Luca: Some changes will need to be made in deployment-charts to support transformer definition
Chris: Do we need a separate transformer for post-processing?
- Andy: Nope, transformers can have both preprocess and postprocess methods
  - example: https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/outlink-topic-model/transformer/outlink_transformer/outlink_transformer.py#65
Andy: Next step is to create transformers for editquality, draftquality and topic
- These images might also be large. They may need the same assets as the predictor due to feature extraction process. Another reason to start work on a feature store soon!

Deployment Pipeline image issues

What is 'stable' tag?: https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/.pipeline/config.yaml#40
- It's an alias to the most recent image
Kevin: Why are there so many new images published? Some seem to be duplicates?
- Andy: It looks like we are publishing on each patchset, which is not good :( my bad!
  - Happy to work on (or tag-team) this during the silent week. edit: https://phabricator.wikimedia.org/T297823
- Luca: We can manually delete older images from registry
  - Any image pre-December 2021 should be deprecated now due to kserve migration https://phabricator.wikimedia.org/T293331