ORES/Deployment

From Wikitech
(Redirected from Ores/Deployment)
Deprecation warning. The ORES infrastructure has been deprecated and it is not longer present/maintained in the WMF infrastructure.

This page is a guide on how to deploy new version of ORES into the server.

Prepare the source code

PyPI

So, your patches are merged into ores/revscoring/other dependencies. You need to increment the version number. Try to do that in a SemVer fashion. Like only upgrading the patch level (e.g. 0.5.8 -> 0.5.9). You need to do it in setup.py and __init__.py (and probably some other place too, use grep to check where the current version is used)

Then you need to push new version into PyPI using:

python setup.py sdist bdist_wheel upload

If you got GPG/PGP you can try adding sign to the list above to also sign the wheel and the sdist

Update revscoring in PyPI

If you need to upload a new revscoring version to Pypi, do the following:

  • checkout a copy of the wikimedia/revscoring repository from Github, or ensure that your local copy is up-to-date with all the commits that you want to publish. In the past I released a minor version of revscoring that was built from an unclean copy of my local repository, and I had to publish another version to fix the issue. Please double check before proceeding).
  • You need to build the wheel from a Python 3.7 virtual environment (to be extra sure about not breaking any compatibility). Using Docker is probably the best way to go!
  • Unset history on your bash shell with something like set +o history or similar.
  • Export two variables:
    • PYPI_USER=scoring-internal
    • PYPI_PASS=REDACTED (the real value should be present in pwstore and accessible by SREs, ask to them the password. For SREs: check the machine-learning file in pwstore).
  • Run scripts/deploy.sh from the Github repository of revscoring.

Update models

If you are doing breaking changes to revscoring probably old model files won't work, so you need to rebuild models. Do it using Makefile in editquality & wikiclass repos. If a model changes substantially (new features, new algorithm, etc), make sure to increment the model versions in the Makefile too.

Update wheels

First, clone https://github.com/wiki-ai/ores-wmflabs-deploy:

git clone https://github.com/wiki-ai/ores-wmflabs-deploy

There is are two files in ores-wmflabs-deploy called "requirements.txt" and "frozen-requirements.txt". Update the required version numbers and make wheels by making a virtualenv and installing everything in it:

virtualenv -p python3 tmp
source tmp/bin/activate
pip install --upgrade pip
pip install wheel
pip wheel -w wheels/ -r requirements.txt

It's critical to do this in an environment that will be binary-compatible with the production cluster. Once wheels are ready, there is a repo in gerrit called wheels (in research/ores/wheels) we keep wheels and nltk data in it. You need to git clone, update wheels and make a patch:

git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/research/ores/wheels

Then, you need to copy new versions to wheels folder, delete old ones and make a new patch:

cd wheels
git commit -m "New wheels for wiki-ai 1.2" -a
git review -R

To rebuild the production wheels, use frozen-requirements.txt rather than requirements.txt.

Merge the code and prepare to deploy

There are two use cases: updating repositories with models and updating the ORES deploy repository.

Updating model repositories

In this case, we are interested for example to update or add a model to one of the repositories, for example https://github.com/wikimedia/editquality. After doing all the work the first step is to send a pull request to the github repository, and wait for approvals from the WMF Machine Learning team before merging. For example: https://github.com/wikimedia/editquality/pull/233

Once the change is merged, then we need to propagate git LFS object from github to gerrit (since we deploy gerrit repositories in production) following what suggested in https://phabricator.wikimedia.org/T212818#4865070:

$ git clone https://github.com/wikimedia/editquality
$ cd editquality
$ git lfs pull
$ git remote add gerrit https://gerrit.wikimedia.org/r/scoring/ores/editquality
$ git lfs push gerrit master

Updating the ORES deploy repository

This repository is the one that we deploy in production, that includes all the more specific model repositories as git submodules. If you don't need to change git submodules, just change the code and send a gerrit patch, and wait for the WMF Machine Learning team to review and merge.

If you need to update a submodule, for example editquality:

# Assumption - the working directory is the ores/deploy one i.e https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/ores/deploy
git submodule update --init submodules/editquality
cd submodules/editquality/
# Checkout new changes
git checkout master
git fetch origin master
# Confirm that the diff between origin and local is the expected one
git diff origin/master
git pull
cd ../../
# Now you should see a diff in the submodule sha
git diff
# Proceed with git add, commit and review

Deploy to the test server

Please deploy to the beta cluster well in advance of any production deployments, at least an hour, several days is better, to give time for smoke-testing and log-watching.

We have a series of increasingly production-like environments available for smoke testing each release, please take the time to go through each step (beta -> production). There is also an automatic canary deployment during scap, which stops after pushing to ores1001 and gives you the opportunity to compare that server's health before proceeding.

Beta (ores-beta.wmflabs.org)

Deploy to beta

  1. ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud
  2. cd /srv/deployment/ores/deploy
  3. git pull && git submodule update --init"
  4. Deploy with scap deploy -v "<relevant task -- e.g. T1234>" and check out whether everything works as expected.

Monitoring

If something does go wrong, you'll want to read the diagnostic messages. See /srv/log/ores/main.log and app.log. Monitor the logs throughout each of these deployment stages, by going to the target server, for beta this is currently deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud, and running:

sudo tail -f /srv/log/ores/*.log

You can also view these logs on https://beta-logs.wmcloud.org

Open the beta cluster grafana dashboard for the ORES service: https://grafana-labs.wikimedia.org/d/000000015/ores-beta-cluster

Open the beta cluster ORES extension graphs at: https://grafana-labs.wikimedia.org/d/000000016/ores-extension

Read the recent server admin log messages for beta: toolforge:sal/deployment-prep

Configuration

The beta cluster configuration should match production, the only time when it's appropriate for the config to be different is when you're testing new configuration that will be included with this deployment. Since the beta cluster configuration is applied as an override on top of production configuration, the usual case is that you will make sure that InitialiseSettings-labs.php and CommonSettings-labs.php contain no ORES-specific configuration.

If you do plan to deploy a configuration change, consider what will happen if the code is rolled back. The safest type of change can be deployed either code- or configuration- first. If one cannot be deployed without the other, please review your rollback plan with the rest of the team.

Running tests

We will use httpbb to test all the models and verify that nothing changed behavior inadvertently by a deployment.

Usage from deployment-deploy03:

aikochou@deployment-deploy03:~$ httpbb /srv/deployment/httpbb-tests/ores/test_ores.yaml --hosts=deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud --http_port=8081
Sending to deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud...
PASS: 124 requests sent to deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud. All assertions passed.

Deploy to production

Production cluster (ores.wikimedia.org)

You are doing a dangerous thing. Remember, breaking the site is extremely easy! Be careful in every step and try to have someone from the team and ops supervising you. Also remember, ORES is depending on a huge number of puppet configurations, check out if your change is compatible with puppet configs and change puppet configs if necessary.

Monitoring

It's crucial to watch all of these places, sometimes the service side won't error but will cause the wikis themselves to burst into flames.

Production ORES service graphs: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores.

Production ORES extension graphs: https://grafana.wikimedia.org/d/000000263/ores-extension

Site-wide error graphs: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 [Broken Link T211982]

Watch the logs, especially for ERROR-level messages: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES

Watch MediaWiki fatal logs: https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors

Note that the service "Scores processed" graph is the only indication of what's happening on each machine's Celery workers. This is the best place to watch for canary health. All of the "scores returned" graphs are only showing behavior at the uWSGI layer.

Prep work

We'll double check the hash that is deployed in case we need to revert and then update the code to current master.

  1. ssh ores1001.eqiad.wmnet
  2. cd /srv/deployment/ores/deploy.
  3. Record the latest revision (OLDHASH) with git log -1 (in case you needed to rollback). Not that the revision on the deployment server (tin) is not a 100% reliable reference, it's possible that the code was rolled back, incompletely deployed, or that the last person was doing a deployment to an experimental cluster. You need to get the current revision from the production server itself.
Deploy to canary

Then you need to deploy it into a node to check if it works as expected. It's called canary node. Right now, it's ores1001.eqiad.wmnet.

  1. ssh deployment.eqiad.wmnet.
  2. Update the deploy repository with:
    1. cd /srv/deployment/ores/deploy
    2. git log (and verify that HEAD is the hash retrieved in Prep Work on ores1001)
    3. git fetch
    4. git log origin (and inspect the commits between origin and local branch)
    5. git pull
    6. git submodule update --init
  3. scap deploy -v "<relevant task -- e.g. T1234>" (This will automatically post a log line in #wikimedia-operations connect.)
  4. Let it run, but when prompted to continue do not hit "y" yet! You have just deployed to the canary server, please smoke test.
  5. ssh ores1001.eqiad.wmnet and check the service internally by commanding curl http://0.0.0.0:8081/v3/scores/fakewiki/$(date +%s)
    • It would be great if you test other aspects if you are changing them (e.g. test if it returns data if you are adding a new model).
    • Note that you are testing uWSGI on the canary server, so any gross errors will show up, but if the request makes a call through celery (most requests do), you won't necessarily be running code on the canary server, but on any node in the cluster. Try running the curl command 10 times for a reasonable chance (94%) of hitting the canary server. Makes sure to include ?features in the request to circumvent the cache.
Continue deployment to prod

If everything works as expected, we're ready to continue.

  1. Deploy it fully by answering "y" to the scap prompt.
  2. If everything looks OK, say "Victory! ORES deploy looks good" (or something equally effusive) in #wikimedia-operations.

In case of a production accident

The ORES extension has the potential to break a few critical pages, such as Special:RecentChanges. An issue with these pages is serious, and should be handled in basically the same way as if you took down the entire site.

Rollback

Your first instinct should be to roll back whatever you just deployed. Take the OLDHASH you recorded before deploying, and run this command:

  1. Announce the problem and your intention to roll back in #wikimedia-operations.
  2. scap deploy -v -r <OLDHASH>

Disable the ORES extension

In the unlikely event that a rollback isn't going fast enough, or for some reason doesn't work, please disable the ORES extension on any sites that are having problems, or globally if appropriate.

  1. Announce what steps you'll take in the #wikimedia-operations.
  2. Make a patch in the mediawiki-config repo, in wmf-config/InitialiseSettings.php, to disable $wmgUseORES on the sites you have identified.
  3. From the deployment server:
  4. cd /srv/mediawiki-staging
  5. git fetch
  6. git log HEAD..origin/master -- Make sure you're only pulling in your own change.
  7. git rebase
  8. scap deploy-file wmf-config/InitialiseSettings.php "<Explain why you're doing this>"

Monitor

Make sure the situation stabilizes. Sorry but you break it, you buy it. Please stay on-duty until you can be certain that nothing else is happening, or someone else on the team agrees to adopt your putrid albatross.

Incident report

When you're feeling better, ` within a day or two, explain what happened.

  1. Create a wiki page as a subpage under Incident documentation, use the template and follow instructions there.
  2. You should have just emailed ops@?
  3. Create a Phabricator task and tag with #wikimedia-incident

Unusual maintenance actions

Clear threshold cache

Thresholds are normally cached for a day, so if you want changes to threshold code to be reflected immediately, you'll have to purge the caches manually. Calculated threshold values are cached separately for every wiki and model. Clear by logging into the deployment server and running, for example,

mwscript eval.php --wiki frwiki

$cache = MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
$key = $cache->makeKey( 'ORES', 'threshold_statistics', 'damaging', 1 );
$cache->delete($key);
$key = $cache->makeKey( 'ORES', 'threshold_statistics', 'goodfaith', 1 );
$cache->delete($key);

Restarting Redis

Celery is unhappy when its Redis backing is restarted. Any time Redis crashes or is intentionally restarted, you may need to restart the Celery workers.

There are multiple Redis services for each datacenter:

  • two instances (master/replica) holding the celery queue (not persisted on disk)
  • two instances (master/replica) holding the ORES score cache (persisted on disk)

The two master instances are running on the same rdb node (different ports), same thing for the replicas. If you want to restart or reboot one of the redis instances, you can follow this simple procedure:

  1. A code change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/715209 to change ORES' config to point to the replica instance (a quick git grep in puppet should be sufficient to find the hostnames)
  2. On a cumin node, execute (after merging the above change) - cumin -m async -b 1 -s 30 'A:ores-codfw' 'run-puppet-agent' 'depool' 'sleep 5' 'systemctl restart celery-ores-worker ; systemctl restart uwsgi-ores' 'sleep 5' 'pool'
  3. Monitor the ORES grafana dashboards and verify that no TCP connections are hitting the Redis node to reboot (a simple netstat on the node is enough)
  4. Revert the change in 1)
  5. Execute again the cumin command.

Enabling ORES on a new wiki

TODO: bug T182054

Puppet-managed config changes

First, our configurations can be found in several places. At the code, you can find them in "config" folder. Then in the deploy repository there is another "config" folder that overrides the code configs, and at last there's puppet ores module that has the final configs that override the other two.

If you want to change configs in the code or deploy repo, you just need to make the change, get it merged and deploy it. Deployment causes the services to restart and pick up the new config but changing the puppet-managed configs doesn't cause the service to restart and pick up the new ones. You need to wait until puppet agent run in each ores node (like ores1001) and changes the config file. The files can be found at /etc/ores/*.yaml and once it's changed you need to manually restart ores services:

sudo service uwsgi-ores restart
sudo service celery-worker-ores restart

You need to do it on all nodes in both datacenters. You can test it on one or two nodes as canary and if everything's fine, use pssh (or fabric, capistrano, your choice) to run it automatically on the rest. TODO: Make a script for this.