ORES/Deployment

From Wikitech
Jump to: navigation, search

This page is a guide on how to deploy new version of ORES into the server.

Prepare the source code

PyPI

So, your patches are merged into ores/revscoring/other dependencies. You need to increment the version number. Try to do that in a SemVer fashion. Like only upgrading the patch level (e.g. 0.5.8 -> 0.5.9). You need to do it in setup.py and __init__.py (and probably some other place too, use grep to check where the current version is used)

Then you need to push new version into PyPI using:

python setup.py sdist bdist_wheel upload

If you got GPG/PGP you can try adding sign to the list above to also sign the wheel and the sdist

Update models

If you are doing breaking changes to revscoring probably old model files won't work, so you need to rebuild models. Do it using Makefile in editquality & wikiclass repos. If a model changes substantially (new features, new algorithm, etc), make sure to increment the model versions in the Makefile too.

Update wheels

First, clone https://github.com/wiki-ai/ores-wmflabs-deploy:

git clone https://github.com/wiki-ai/ores-wmflabs-deploy

There is a file in ores-wmflabs-deploy called "requirements.txt". Update their version number and make wheels by making a virtualenv and installing everything in it:

virtualenv -p python3 tmp
source tmp/bin/activate
pip install --upgrade pip
pip install wheel
pip wheel -w wheels/ -r requirements.txt

It's critical to do this in an environment that will be binary-compatible with the production cluster. ores-misc-01.ores.eqiad.wmflabs is designed to do that. Don't forget to install C dependencies beforehand. Be careful if any kind of error happened.

Once wheels are ready, there is a repo in gerrit called wheels (in research/ores/wheels) we keep wheels and nltk data in it. You need to git clone, update wheels and make a patch:

git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/research/ores/wheels

Then, you need to copy new versions to wheels folder, delete old ones and make a new patch:

cd wheels
git commit -m "New wheels for wiki-ai 1.2" -a
git review -R

To rebuild the production wheels, use frozen-requirements.txt rather than requirements.txt.

Update ores-wmflabs-deploy

After +2ing and being merged, you should update ores-wmflabs-deploy

NOTE: This is not a required step for production, but we like to keep the repos in sync.

cd ores-wmflabs-deploy
git checkout -b wiki_ai_1.2
source tmp/bin/activate
pip freeze | grep -v setuptools > frozen-requirements.txt
cd submodules/wheels
git pull
cd ../..
git commit -m "Release wiki-ai 1.2"
git push -f origin wiki_ai_1.2

After that you need to make a PR in github and once it's merged it's good to go!

If you want to deploy to prod as well (ores.wikimedia.org) you need to backport your commits in gerrit too (ewww). The gerrit repos are:

git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/mediawiki/services/ores

For ores.

And:

  • "mediawiki/services/ores/deploy" for ores-wmflabs-deploy (note that these repos have diverged [FIXME: Mande?])
  • "mediawiki/services/ores/editquality" for editquality
  • "mediawiki/services/ores/wikiclass" for wikiclass

Deploy to the test server

Please deploy to the beta cluster well in advance of any production deployments, at least an hour, several days is better, to give time for smoke-testing and log-watching.

We have a series of increasingly production-like environments available for smoke testing each release, please take the time to go through each step, labs staging -> beta -> production. There is also an automatic canary deployment during scap, which stops after pushing to scb1002 and gives you the opportunity to compare that server's health to its brethren's.

Labs (ores.wmflabs.org)

NOTE: This is not a required step for production, but we like to keep the repos in sync.

First, go to staging. Simply make your changes in the ores-wmflabs-deploy repo and do fab stage (don't forget to log it in #wikimedia-cloud by typing this: "!log ores-staging deploying <HASH> into staging".

Then check ores-staging.wmflabs.org to see if everything is healthy. If so, you are good to go to the labs setup. Rebase the "deploy" branch onto master.

git checkout deploy
git rebase origin/master
git push -f origin deploy

If working as expected, deploy with "fab deploy_web" and then "fab deploy_celery". Once it's done, test ores.wmflabs.org to see if everything is working as expected.

Beta (ores-beta.wmflabs.org)

Monitoring

If something does go wrong, you'll want to read the diagnostic messages. See /srv/log/ores/main.log and app.log. Monitor the logs throughout each of these deployment stages, by going to the target server, for beta this is currently deployment-sca03.eqiad.wmflabs, and running:

sudo tail -f /srv/log/ores/*.log

You can also view these logs on https://logstash-beta.wmflabs.org

Open the beta cluster grafana dashboard for the ORES service: https://grafana-labs.wikimedia.org/dashboard/db/ores-beta-cluster?orgId=1

Open the beta cluster ORES extension graphs at: https://grafana-labs.wikimedia.org/dashboard/db/ores-extension?orgId=1

Read the recent server admin log messages for beta: https://tools.wmflabs.org/sal/deployment-prep

Configuration

The beta cluster configuration should match production, the only time when it's appropriate for the config to be different is when you're testing new configuration that will be included with this deployment. Since the beta cluster configuration is applied as an override on top of production configuration, the usual case is that you will make sure that InitialiseSettings-labs.php and CommonSettings-labs.php contain no ORES-specific configuration.

If you do plan to deploy a configuration change, consider what will happen if the code is rolled back. The safest type of change can be deployed either code- or configuration- first. If one cannot be deployed without the other, please review your rollback plan with the rest of the team.

Deploy to beta

  1. ssh deployment-tin.eqiad.wmflabs
  2. cd /srv/deployment/ores/deploy
  3. git pull
  4. git submodule update --init
  5. Record the NEWHASH at the top of git log -1
  6. Record the new revision (NEWHASH) and prepare a message to send to #wikimedia-cloudconnect: "!log deployment-prep deploying ores <NEWHASH>"
  7. Deploy with scap deploy -v "<relevant task -- e.g. T1234>" and check out whether everything works as expected.

Deploy to production

Production cluster (ores.wikimedia.org)

You are doing a dangerous thing. Remember, breaking the site is extremely easy! Be careful in every step and try to have someone from the team and ops supervising you. Also remember, ORES is depending on a huge number of puppet configurations, check out if your change is compatible with puppet configs and change puppet configs if necessary.

Monitoring

It's crucial to watch all of these places, sometimes the service side won't error but will cause the wikis themselves to burst into flames.

Production ORES service graphs: https://grafana.wikimedia.org/dashboard/db/ores?orgId=1.

Production ORES extension graphs: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1

Site-wide error graphs: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1

Watch the logs, especially for ERROR-level messages: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES?_g=()

Watch MediaWiki fatal logs: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=()

Note that the service "Scores processed" graph is the only indication of what's happening on each machine's Celery workers. This is the best place to watch for canary health. All of the "scores returned" graphs are only showing behavior at the uWSGI layer.

Prep work

We'll double check the hash that is deployed in case we need to revert and then update the code to current master.

  1. ssh scb1001.eqiad.wmnet
  2. cd /srv/deployment/ores/deploy.
  3. Record the latest revision (OLDHASH) with git log -1 (in case you needed to rollback). Not that the revision on the deployment server (tin) is not a 100% reliable reference, it's possible that the code was rolled back, incompletely deployed, or that the last person was doing a deployment to an experimental cluster. You need to get the current revision from the production server itself.
  4. Update the deploy repository with git pull && git submodule update --init
Deploy to canary

Then you need to deploy it into a node to check if it works as expected. It's called canary node. Right now, it's scb1002.eqiad.wmnet.

  1. ssh deployment.eqiad.wmnet. Then cd /srv/deployment/ores/deploy.
  2. scap deploy -v "<relevant task -- e.g. T1234>" (This will automatically post a log line in #wikimedia-operationsconnect.)
  3. Let it run, but when prompted to continue do not hit "y" yet! You have just deployed to the canary server, please smoke test.
  4. ssh scb1002.eqiad.wmnet and check the service internally by commanding curl http://0.0.0.0:8081/v3/scores/testwiki/$(date +%s)
    • It would be great if you test other aspects if you are changing them (e.g. test if it returns data if you are adding a new model).
    • Note that you are testing uWSGI on the canary server, so any gross errors will show up, but if the request makes a call through celery (most requests do), you won't necessarily be running code on the canary server, but on any node in the cluster. Try running the curl command 10 times for a reasonable chance (94%) of hitting the canary server.
Continue deployment to prod

If everything works as expected, we're ready to continue.

  1. Deploy it fully by answering "y" to the scap prompt.
  2. If everything looks OK, say "Victory! ORES deploy looks good" (or something equally effusive) in #wikimedia-operations.

In case of a production accident

The ORES extension has the potential to break a few critical pages, such as Special:RecentChanges. An issue with these pages is serious, and should be handled in basically the same way as if you took down the entire site.

Rollback

Your first instinct should be to roll back whatever you just deployed. Take the OLDHASH you recorded before deploying, and run this command:

  1. Announce the problem and your intention to roll back in #wikimedia-operations.
  2. scap deploy -v -r <OLDHASH>

Disable the ORES extension

In the unlikely event that a rollback isn't going fast enough, or for some reason doesn't work, please disable the ORES extension on any sites that are having problems, or globally if appropriate.

  1. Announce what steps you'll take in the #wikimedia-operations.
  2. Make a patch in the mediawiki-config repo, in wmf-config/InitialiseSettings.php, to disable $wmgUseORES on the sites you have identified.
  3. From the deployment server:
  4. cd /srv/mediawiki-staging
  5. git fetch
  6. git log HEAD..origin/master -- Make sure you're only pulling in your own change.
  7. git rebase
  8. scap deploy-file wmf-config/InitialiseSettings.php "<Explain why you're doing this>"

Monitor

Make sure the situation stabilizes. Sorry but you break it, you buy it. Please stay on-duty until you can be certain that nothing else is happening, or someone else on the team agrees to adopt your putrid albatross.

Incident report

When you're feeling better, ` within a day or two, explain what happened.

  1. Create a wiki page as a subpage under Incident documentation, use the template and follow instructions there.
  2. You should have just emailed ops@?
  3. Create a Phabricator task and tag with #wikimedia-incident

Unusual maintenance actions

Clear threshold cache

Thresholds are normally cached for a day, so if you want changes to threshold code to be reflected immediately, you'll have to purge the caches manually. Calculated threshold values are cached separately for every wiki and model. Clear by logging into the deployment server and running, for example,

mwscript eval.php --wiki frwiki

$cache = MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
$key = $cache->makeKey( 'ORES', 'threshold_statistics', 'damaging', 1 );
$cache->delete($key);
$key = $cache->makeKey( 'ORES', 'threshold_statistics', 'goodfaith', 1 );
$cache->delete($key);

Restarting Redis

Celery is unhappy when its Redis backing is restarted. Any time Redis crashes or is intentionally restarted, you must restart the Celery workers. If this is an intentional restart, then stop all Celery workers prior to shutting down Redis.

Enabling ORES on a new wiki

TODO: bug T182054