Machine Learning/Technical Meeting Notes

From Wikitech

2022-02-23

Feast

  • https://phabricator.wikimedia.org/T294434
  • Andy: what do we want to learn? what would be a good demo for online feature store?
    • Just load a bunch of revscoring features in?
    • What kind of storage makes sense for us? Swift, parquet, sql?
  • Luca: How do we want to structure the +3 nodes in eqiad & codfw
    • we have codfw, eqiad still needs to be racked
    • score cache could be a seperate Redis instance
  • Tobias: having separate instances would allow us to tune each instance.
  • Luca: Feast wants a single redis endpoint for host & config, if we have multiple nodes, we may need a proxy in the middle.
  • Let's figure out how we save registry, also how to handle single redis endpoint.
  • Followup: how do we load data into Feast? (airflow) How much space do we need? etc..

ORES deploy

  • https://phabricator.wikimedia.org/T300195
  • Andy: was there a issue on beta?
  • Luca: There is a local proxy issue, Taavi fixed, not sure if deployment-prep vm is fixed.
    • Nothing is burning, no weird errors, things seem to work.
    • Might be nice for learning if Aiko wants to work on the task w/ Luca
  • Aiko: I would like to learn more about the difference between ORES & Lift Wing

Changeprop calling ORES

  • https://phabricator.wikimedia.org/T301409
  • Luca: we could post to eventgate in our model.py for Lift Wing
  • Tobias: could we do this in Istio on our side? a bit like request logging right?
  • Luca: its just a simple POST request so we could do it in the python code, could maybe try Knative eventing but we are using an older version of Knative.
  • Tobias: agreed, doing it w/ a library or wrapper makes alot of sense, only tricky bit is you don't want to delay the call.
  • Andy: does consistency matter? can we just fire off a post request via asyncio and then return our prediction to the user?
  • Luca: that should be fine, if some are missing it's not super problematic.

Lift Wing migration

  • https://phabricator.wikimedia.org/T301409
  • Luca: moving models will take space on the cluster with the current cgroup configs
    • was hoping the new system would not
  • Tobias: Lift Wing is not homebrew, which is an advantage.
  • Andy: the revscoring images are pretty big, also they include a ton of assets
    • other models won't necessarily be like this
  • Luca: we are halfway through migrating ORES images and cpu/memory is filling up.
    • maybe fine with 8 nodes?

2022-02-16

ORES deploy

  • Luca: we are unblocked, recent patches are now running on beta

API Gateway platform

  • Chris: i think there is now a big push to get it in a good place, which is awesome for us
  • Luca: We should connect and start making sure everything works as expected
    • header pathing map etc..

editquality migration

  • Luca: we may need to change Swift clusters to MOSS
    • the paths should stay the same

eventgate scores

2022-02-09

API Gateway

  • https://phabricator.wikimedia.org/T288789
  • Are we blocked?
    • Chris: short-term- Hugh & us will unblock, unsure of long-term status of project
    • Tobias: will let you know outcome of upcoming meeting, all our asks could be no big issue.

Transformers (again)

  • https://phabricator.wikimedia.org/T294419#7688032
  • Images are big (transformer + predictor)
    • also transformer + predictor both need to mount/load model into pod from storage
  • editquality will have 30+ isvcs running two large images
  • Chris: Do we want to use transformers on future models? Yes. The ORES models are a special case.
  • Kevin: the transformers seemed fine until we needed to load the models into the separate pods, now it seems really heavy.
  • Andy: my one argument for keeping the heavy transformers was that we could use it in an explainer, but that does not seem to work (maybe a kserve bug?)
  • Tobias: Forcing transformers architecture on revscoring models may not help us gain much, other than keeping it alive longer.
  • Luca: for editquality it might make sense to go back to predictor
    • the mw transformers are not async either so there is a bottleneck.
  • Kevin: Also, regarding maintainability, we are currently loading models + revscoring and all its dependencies in both the transformer and predictor. Loading them in only the predictor is much more maintainable. It's more DRY.

ORES migration

  • Chris: We should get all the models onto Lift Wing before ORES dies (hardware out of warranty, stretch is ending LTS in a few months, etc.) and then we can improve the models.
  • Luca: Looking at traffic in ores1001 - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ores1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores
    • Lots of hosts doing nothing
  • Tobias: I wish we could easily sample say 5% of current ORES-bound traffic and see what happens to it on LW
  • Luca: We could use changeprop and see how it handles on LW
  • Chris: We could do an experiment, although every single model we are not able to migrate over is a conversation we need to have with the community.
  • Andy: It would be pretty easy to migrate editquality over to LW. We just need to add the model binary files to Thanos Swift and then update the helm files.
  • Chris: Let's load every single ORES model into LW. We got 110+ of them, let's start moving them.
  • Luca: For editquality, let's get rid of transformer, move to predictor-only and then start spinning up pods.
  • Andy: I will make a task and then Kevin and I can split the work from there.

Staging Environment

  • Kevin: What is our staging environment going to be?
    • Chris: Should we do dev on staging?
  • Luca: We have ml-sandbox
    • Kevin: ml-sandbox is good
  • Andy: I think we were unsure of what the testing cluster would be used for. Also the cluster-local-gateway networking issues hadn't been solved on ml-sandbox yet, so we were unsure if we should continue maintaining our dev cluster. Things are good now and I think ml-sandbox is good for dev.

2022-01-26

ORES deploy planning

API Gateway Integration

  • https://phabricator.wikimedia.org/T288789
  • Chris: Where are we on this?
  • Tobias: Luca and I have been discussing about our wants & needs, still need to get info about feasability.
  • Chris: Lets figure out nice-to-haves, needs for production system and what we need to get to MVP.
  • Luca: All things we have asked for have almost been delivered, but we need to start testing the integration
    • Hugh has been very helpful in deploying changes to prod.

Image Recommendation?

  • Chris: it's not really standalone ML model
    • we won't need to host this (built-in logistic regression feature in elasticsearch)
  • Luca: where is this hosted/ who owns this?
    • Chris: Cormac on platform i believe.

2022-01-19

Lift Wing MVP check-in

Wikidata ORES spikes

  • https://phabricator.wikimedia.org/T299137
  • score_errored
  • no visibility
    • missing logs
  • Luca: i think it happens during feature extraction?
  • Luca: adding more logging would help us figure out what is happening
  • Chris: this will help keep ores stable while we migrate to lift wing
  • Andy: this could be helpful for us to see if there is a bug buried somewhere.
  • Luca: maybe not fix the bug but help us know where the issue is

ORES deploy

  • Andy: we need to deploy the new nlwiki articlequality model, maybe this week?
  • Luca: let's include the logging updates for the spikes
  • Andy: +1, i'm happy to do the deploy, maybe we can record it and use it as a side-by-side comparison video w/ lift wing?

ORES clients

  • Luca: let's catalog all clients, bots, how people are calling ores (apis etc.) a good starting task for maybe Aiko?
  • Andy: +1 - we need local contacts for wiki communities too
    • Hal (privacy engineer) had suggested having 'service cards' that describe all downstream users for a model, might be good to have early-on for lift wing.
  • Luca: a preliminary list of users, tools, etc.

2022-01-12

Feast spike & hardware order

  • Luca: We should discuss our plans for the Feature store(s), we have to review procurement tasks for dcops this week. Current Plan:
    • 3 redis-like nodes in eqiad
    • 3 redis-like nodes in codfw
    • 2 nodes in eqiad (for the offline store, but it was in early stages, not even sure if we need those)
  • Online Feature Store Task: https://phabricator.wikimedia.org/T294434
  • Score Cache:
    • Luca: ORES models may not need a feature store right away
    • Chris: from product perspective, score cache would handle MVP use-case
    • Tobias: Having a cache and not needing it is lower risk than doing full feature store
  • Chris: let's try to use the same boxes for score cache and then later on the online feature-store.
    • build score-cache first, then progress to online-feature store

nlwiki articlequality deployment

ORES migration

  • Luca: we will need to update clients / all users of ORES with new endpoints
    • New urls will not be a simple redirect due to api gateway etc..
    • Chris: Let's start getting in touch with down-stream users, I will start asking around.

ORES wikidata spike

  • Luca: we are seeing occasional spikes: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=72&orgId=1&refresh=1m
  • ML monitoring
    • Prometheus -> Grafana
    • logs -> logstash(?)
    • Status codes
      • Tobias: 4xx is client screwed up, 5xx is we screwed up
    • Tracing
      • Tobias: there are some great rpc tracing tools that let you explore each step in a workflow, it would be helpful to have something similar
      • Andy: I've seen zipkin and jaeger recommended for distributed tracing in our stack

Load Testing

2021-12-15

Transformers

Deployment Pipeline image issues