OCG

From Wikitech
Jump to: navigation, search

This content is being migrated to mw:OCG.

OCG, or offline content generation, is a service that converts parsoid RDF into some offline form, typically a PDF document. It is only accessable externally through the collection extension.

OCG is accessible internally at: http://ocg.svc.eqiad.wmnet:8000

Installing a development instance

 // Collection extension
 require_once("$IP/extensions/Collection/Collection.php");
 // configuration borrowed from wmf-config/CommonSettings.php
 // in operations/mediawiki-config
 $wgCollectionFormatToServeURL['rdf2latex'] =
 $wgCollectionFormatToServeURL['rdf2text'] = 'http://localhost:17080';
 
 // MediaWiki namespace is not a good default
 $wgCommunityCollectionNamespace = NS_PROJECT;
 
 // Sidebar cache doesn't play nice with this
 $wgEnableSidebarCache = false;
 
 $wgCollectionFormats = array(
 		'rdf2latex' => 'PDF',
 		'rdf2text' => 'Plain text',
 );
 
 $wgLicenseURL = "http://creativecommons.org/licenses/by-sa/3.0/";
 $wgCollectionPortletFormats = array( 'rdf2latex', 'rdf2text' );
  • Create a new directory, which we'll call $OCG, and check out the OCG service, bundler, and some backends:
mkdir $OCG ; cd $OCG
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Collection/OfflineContentGenerator mw-ocg-service
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Collection/OfflineContentGenerator/bundler mw-ocg-bundler
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Collection/OfflineContentGenerator/latex_renderer mw-ocg-latexer
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Collection/OfflineContentGenerator/text_renderer mw-ocg-texter
for f in mw-ocg-service mw-ocg-bundler mw-ocg-latexer mw-ocg-texter ; do
  cd $f ; npm install ; cd ..
done
  • Follow the Installation instructions in the mw-ocg-latexer/README.md (installing system dependencies in particular).
  • Follow the Running a development server instructions in mw-ocg-service/README.md to configure and start the OCG service. (Ignore the "Installation on ubuntu part" unless/until you want to install a production instance.)
  • Your wiki sidebar should now have "Download as PDF" and "Download as Plain text" entries. Visit an article on your wiki and try them out! Diagnostics appear on the console where mw-ocg-service is running, unless you've configured some other logger.
  • If running a private wiki you must make sure that the $wgServer IP address is able to reach article content/the Mediawiki API. It is suggested to include this IP address using NetworkAuth and adding it to the iprange array or by adding the following to the LocalSettings.php:
if ( @$_SERVER['REMOTE_ADDR'] == '<enter your $wgServer IP address>' || @$_SERVER['REMOTE_ADDR'] == '127.0.0.1' ) {
  $wgGroupPermissions['*']['read'] = true;
}
  • You can also use the bundler and backends directly from the command-line. See the mw-ocg-latexer/README.md for an example.

Monitoring

  • Logging happens in /var/log/ocg.log. There is a log rotation setup in /etc/logrotate.d/ocg. (But you need to be in ocg-render-admins to look at these.)
  • grafana dashboard: https://grafana.wikimedia.org/dashboard/db/ocg
  • graphite.wikimedia.org has a Graphite/ocg/pdf tree with other useful statistics like:
    • job_queue_length.value -- The number of currently pending jobs
    • status_objects.value -- The number of jobs we're currently tracking (this data is kept for a couple of days for caching purposes)
    • [backend|frontend].restarts.count -- The number of times the given thread has restarted (indication of fatal errors)
  • Dashboard for OCG is at https://logstash.wikimedia.org/app/kibana#/dashboard/OCG

When something goes wrong

C. Scott and Arlo know the most.

Hop into #mediawiki-parsoid on freenode.

Reverting an OCG deployment

Code

ssh tin
cd /srv/deployment/ocg/ocg
git deploy revert # pick the last good deployed version

If git deploy revert fails:

git deploy start
git reset --hard <desired changeset>
git submodule update --recursive
git deploy --force sync

Deploying changes

OCG is deployed using git-deploy. Briefly, you will run git deploy start, make whichever changes you need to make to the git clone (such as pulling, changing branches, committing live hacks, etc.), then run git deploy sync. The sync command pushes the new state to all backends and restarts them.

You should have deploy access and be a member of the deployment-prep project (so you can deploy to beta). Since the service never restarts properly on beta, being listed on Special:NovaSudoer for the deployment-prep project (usually in the under_NDA group), is a good idea, so that you can sudo. In production, being a member of the ocg-render-admins puppet group is helpful, in case salt fails to restart the ocg service; being a member of ocg-render-admins also lets you read /var/log if things go wrong.

Since OCG does not have regularly scheduled deploy windows (yet!), ping greg-g on #wikimedia-operations and ask him to schedule a window for your deploy when necessary.

Pre-deploy checks; preparing the deploy commit

OCG is a collection of submodules organized under the OCG Collection service. There are two branches: the master branch is the latest versions of the code. In theory it would be what's running on our local pre-deploy testing machine (like parsoid.wmflabs.org and the round-trip-testing service for parsoid), but at the moment nothing automatically pulls from the master branch. The wmf-deploy branch is code we've deemed stable enough to deploy. It should be what's running in beta and on the ocg servers. In addition, the wmf-deploy branch has a prebuilt node_packages folder, which is built for the version of node we run on the cluster (nodejs 0.10 x64; on Ubuntu 14.04).

You will need to, on your local machine, update the submodules as required and then run 'make'. The make script builds the node dependencies. Note that your local machine must match the architecture and configuration of the deploy cluster. At the moment that is x64 ubuntu 14.04 and node 0.10.25. (In the future we may provision an appropriately-configured labs machine to build deploy commits.)

  • Begin a deployment summary on OCG/Deployments. Don't include all commits, but only notable fixes and changes (ignore code cleanup updates, test case updates, etc).
    • I usually open an edit window on OCG/Deployments and update the various sections as I perform the steps below. I include the shortlog for the submodules in the commit message for the "Updating to latest masters" commit (below), and cut-and-paste that into the wiki summary. Then as the various master and wmf-deploy branch commits are created I update the wiki with the appropriate hashes and gerrit links, adding special notes if there are additional deploy branch commits which are being created or any other special work being done.
  • Prepare a ocg-collection repo commit and push for +2 (note that jenkins is not running on this repo, so you will need to V+2 and submit as well)
    • First update the submodules, roughly: cd ocg-collection ; git checkout master ; git pull origin master ; git submodule update ; git submodule foreach git pull origin master ; git add -u ; git commit -m "Updating to latest masters" ; git review
      • You probably want to edit that commit message a bit more before submitting it; see deployment summary discussion above.
    • Then build the dependencies: git checkout wmf-deploy ; git pull origin wmf-deploy ; git merge master ; git commit --amend; git review
      • Note the git commit --amend after git merge master to allow the gerrit hooks to add an appropriate Change-Id field to the merge commit.
    • If the package dependencies have changed, continue with: make production ; git add --all package.json node_modules ; git commit -m "Rebuilding dependencies" ; git review
      • In order to ensure that the binary versions match, these steps can be done on deployment-pdf01.eqiad.wmflabs. However, the labs machines have packages installed (like node-request) which are not installed in production. Be careful. You can perform the build under /home. After setting up your user.name and user.email using git config --global and doing mkdir ~/bin ; ln -s $(which nodejs-ocg) ~/bin/node, try: git clone https://gerrit.wikimedia.org/r/p/mediawiki/services/ocg-collection && cd ocg-collection && git submodule update --init --recursive and then the above commands (starting with make production) to (re)build the dependencies.
      • Other commands that might be useful: curl https://www.npmjs.org/install.sh | bash; sudo apt-get install g++
    • Run the service locally to ensure that nothing has broken: XXX MORE DETAILS HERE XXX
  • Add yourself to the "deployer" field of Deployments if you're not already there
  • Be online in freenode #wikimedia-operations and #wikimedia-releng (and stay online through the deployment window)

Deploying the latest version of OCG

We are going to deploy the latest version both to the beta cluster and to production. In theory we might separate these steps by a few days, but at the moment we just do a quick test on beta before deploying to production. Let's start by deploying to beta (see https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances for the .ssh/config needed):

$ ssh -A deployment-tin.eqiad.wmflabs
deployment-tin$ cd /srv/deployment/ocg/ocg
deployment-tin$ git deploy start
deployment-tin$ git pull
deployment-tin$ git submodule update --init --recursive
deployment-tin$ git deploy sync

You will then get status updates. If any minions are not ok, then retry the deploy until all are. Proceed with 'y' when all minions are ok for each step. If minions are not ok, in general you just need to press 'd' to check for the 'detailed status' (which will not restart the salt job just repoll the status) until they are ok.

Nodes are not automatically restarted. To do this, use

deployment-tin$ git deploy service restart

If that fails (in early 2014 it used to be in the habit of failing) deployment-prep admins can sudo service ocg restart on individual boxes -- for beta, that's deployment-pdf01.eqiad.wmflabs and deployment-pdf02.eqiad.wmflabs. On the deployment cluster, you need to be a member of ocg-render-admins to sudo service ocg restart on the ocg100[123].eqiad.wmnet boxes. (Once we get a rsh group we could do something like dsh -g ocg sudo service ocg restart.)

Now go to #wikimedia-releng and report the deploy:

!log updated OCG to version <new hash>

Assuming this all worked, you will want to test the deploy on beta before moving on to production.

  • Do a test render on http://en.wikipedia.beta.wmflabs.org (use the "Download as PDF" link in the sidebar)
    • For example, [1]
      • Use 'force re-render' if necessary to ensure you're testing the latest code.
    • You may also wish to use the "Create a book" option in the sidebar and add a few articles to a book, then render that.

Now let's deploy to production. Before you begin, notify ops in #wikimedia-operations:

!log starting OCG deploy

Now we're going to repeat the above steps, but on deployment.eqiad.wmnet rather than deployment-tin.eqiad.wmflabs:

$ ssh -A deployment.eqiad.wmnet
mira$ cd /srv/deployment/ocg/ocg
mira$ git deploy start
mira$ git pull
mira$ git submodule update --init --recursive
mira$ git deploy sync

We used to have hosts here which were off but not yet depooled, so we would see minions fail. That shouldn't happen anymore, but if it does consult the old version of this page for advice.

You will then get status updates. If any minions are not ok, then retry the deploy until all are. Proceed with 'y' when all minions are ok for each step. If minions are not ok, in general you just need to press 'd' to check for the 'detailed status' (which will not restart the salt job just repoll the status) until they are ok. Remember to then restart the service:

mira$ git deploy service restart
ocg1003.eqiad.wmnet: True
ocg1002.eqiad.wmnet: True
ocg1001.eqiad.wmnet: True

Fantastic! (Remember you can use sudo service ocg restart on the individual ocg100[123].eqiad.wmnet boxes if something goes wrong here.)

Once everything is done, log the deploy in #wikimedia-operations with something like

!log updated OCG to version <new hash> (T<bug number>, T<bug number>, ...)

listing the hash of the deployed OCG version (the hash of the wmf-deploy branch of the ocg-collection repository) as well as any bug numbers referenced in the deploy log. This creates a timestamped entry in the Server Admin Log and creates cross-references in the listed bugs to the SAL.

Post-deploy checks

Scripts

If you are a member of ocg-render-admins (or root) you can run scripts by doing:

$ ssh ocg1001.eqiad.wmnet
ocg1001$ sudo -u ocg -g ocg nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service/scripts/run-garbage-collect.js -c /etc/ocg/mw-ocg-service.js

The machines in question are ocg100[1234].eqiad.wmnet; see https://gerrit.wikimedia.org/r/#/c/150863/

sudo -u ocg -g ocg nodejs-ocg will put you in the same permissions context as ocg.

Maintenance scripts

All scripts take a `-c /etc/ocg/mw-ocg-service.js` configuration option which tells it to use the puppetized OCG configuration file.

  • /srv/deployment/ocg/ocg/mw-ocg-service/scripts/clear-queue.js -- Clears the job queue if it's gotten too long
  • /srv/deployment/ocg/ocg/mw-ocg-service/scripts/run-garbage-collect.js -- If the cron jobs have failed or if there are too many job status objects

There are actually three configuration files for the service:

mw-ocg-service/defaults.js -> /etc/ocg/mw-ocg-service.js -> /srv/deployment/ocg/ocg/LocalSettings.js
defaults.js
is all the "default stuff" and it is well commented.
/etc/ocg/mw-ocg-service.js
has all the puppetized stuff, e.g. the redis password, hosts, and file directories
LocalSettings.js
is commited to the git repo and has stuff that is more for performance tweaking

The service initializes a configuration object, then loads the file specified with a "-c" command line option, and then passes the configuration object to it (the config files are treated as node modules and they have a known entry point). In production we pass the /etc/ file to -c and the /etc/ file then calls the LocalSettings.js file. So it's one big chain and each step can override the previous.

(Note that commit da78e552232efe0078452b0f876b926332f49c84 to mw-ocg-service added a /etc/mw-collection-ocg.js configuration file, and a new mechanism for chaining configurations. We haven't switched to this new style configuration in production yet.)

Pruning the queue

If the queue mechanism appears to be working, but you'd like to expire some jobs in order to free up space, the following command (repeated on each of ocg100[1234].eqiad.wmnet could be useful:

sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/run-garbage-collect.js -c ~/config.js

where ~/config.js contains something like:

module.exports = function(config) {
  // chain to standard configuration
  config = require('/etc/ocg/mw-ocg-service.js')(config);
  // drastically reduce job lifetimes and frequencies
  var gc = config.garbage_collection;
  ['every', 'job_lifetime', 'job_file_lifetime', 'failed_job_lifetime', 'temp_file_lifetime', 'postmortem_file_lifetime'].forEach(function(p) {
    gc[p] /= 1000; // maybe you don't need to be this dramatic
  });
  return config;
};

Decommissioning a host

If one of the OCG hosts needs to be taken down (for maintenance, upgrades, etc), the cache entries for that host need to be removed from redis. The clear-host-cache.js script will do this.

First, remove the host from the round-robin DNS name specified in the Collection extension configuration, so it is no longer the target of new job requests from PHP. This is the $wgCollectionMWServeURL variable, set to ocg.svc.eqiad.wmnet for production and deployment-pdf01 in labs.

You should also decommission the host in puppet, by writing a hieradata/hosts/ocg1003.yaml file (where ocg1003 is the name of the host being decommissioned) with the contents:

ocg::decommission: true

See https://gerrit.wikimedia.org/r/286070 for an example of this. This will stop the host from running new backend jobs. You should restart OCG on the affected host(s) once the puppet change has propagated for the configuration to take effect. (Once https://gerrit.wikimedia.org/r/284599 is enabled the explicit restart won't be necessary, but that's not turned on in our machine configuration yet. Baby steps.)

Once the DNS change has propagated and you've restarted OCG with the decommission configuration (restarting will wait for any existing jobs on that host to complete), you would run something like:

$ cd /srv/deployment/ocg/ocg/
$ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-host-cache.js -c /etc/ocg/mw-ocg-service.js ocg1003.eqiad.wmnet

where ocg1003.eqiad.wmnet is the fully-qualified domain name of the host you want to decommission. If the hostname is omitted, the script will use the name of the host on which the script is running (presumably, you'd typically run this on the OCG host itself, but you could also run it on a different OCG host). (Note that within a week or so of the deployment of https://gerrit.wikimedia.org/r/286068 you will have to clear both the FQDN of the host and the bare hostname. You can do that simultaneously by specifying both on the command line.)

The script will not remove job status entries for pending jobs (unless you use the --force flag). It will complain on console if it finds pending jobs, and exit with a non-zero exit code. In that case, the operator should wait longer (say, 15 minutes) for the pending job to complete and the user to collect the results, before re-running the clear-host-cache script.

Clearing the job queue

If the job queue grows to ridiculous levels, it can impair usability for ordinary users. This can happen when someone decides to (say) spider all of wiktionary. In this case, it might be best to clear the entire job queue, aborting all jobs. The clear-queue.js script will do this, run on any of the OCG hosts:

$ cd /srv/deployment/ocg/ocg/
$ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-queue.js -c /etc/ocg/mw-ocg-service.js

This script will set all pending job status entries to "failed" with the message "Killed by administrative action".

Clearing the cache for a given date range

Parsoid or RESTbase bugs might cause some cached content to be corrupted. After the bug is identified and fixed, the cache entries for some specific period of time might need to be removed from redis to clear the corruption. The clear-time-range.js script will do this:

$ cd /srv/deployment/ocg/ocg/
$ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-time-range.js -c /etc/ocg/mw-ocg-service.js 2015-04-23T23:30-0700 2015-04-24T13:00-0700

where 2015-04-23T23:30-0700 and 2015-04-24T13:00-0700 are the start/end times of the time range in question.

The script will not remove job status entries for pending jobs (unless you use the --force flag). It will complain on console if it finds pending jobs, and exit with a non-zero exit code. In that case, the operator should wait longer (say, 15 minutes) for the pending jobs to complete and the user to collect the results, before re-running the clear-time-range script.

Regression testing

It is useful to run large numbers of articles through the backend in order to find crashers. We use the mw:Parsoid data set for this, which consists of 10,000 articles from a large number of wikis (and 1,000 articles from a larger set of wikis). To facilitate this, run your local mw-ocg-service as follows:

cd $OCG/mw-ocg-service
./mw-ocg-service.js -c localsettings-wmf.js

with localsettings-wmf.js looking something like:

module.exports = function(config) {
    // Increase this if you don't mind hosing your local machine
    config.coordinator.frontend_threads =
        config.coordinator.backend_threads = 1;
    // point parsoid at production
    config.backend.bundler.parsoid_api =
        'http://parsoid-lb.eqiad.wikimedia.org/';
    // default to enwiki, although we'll be specifying prefixes explicitly
    config.backend.bundler.parsoid_prefix = 'enwiki';
    // optional, but useful if you want to collect postmortem info locally
    // make sure this directory exists
    config.backend.post_mortem_dir = __dirname + '/postmortem';
};

Now you can pull large quantities of articles through. Start with:

cd $OCG/ocg-collection # checked out from https://gerrit.wikimedia.org/r/p/mediawiki/services/ocg-collection
cd loadtest ; npm install # only the first time
./loadtest.js -p enwiki -o 0 # reads from ./pages.list, filters to only enwiki, outputs files named 0-*.txt

This will take a while. But once you have a list of crashers (in 0-failed-render.txt) you can make some fixes and then recheck just the crashers like:

cp 0-failed-render.txt 1.txt
./loadtest.js -o 1 1.txt

Rinse and repeat: copy 1-failed-render.txt to 2.txt once you've fixed some more bugs, and rerun to see what's left.

Test scripts

  • $OCG/ocg-collection/loadtest/loadtest.js - Add metabook jobs to the queue for load testing or to find regressions
  • $OCG/ocg-collection/loadtest/injectMetabooks.js - Older version of the above; deprecated.

These scripts are also on the production machines in /srv/deployment/ocg/ocg/loadtest/ but you probably shouldn't find them from there. If you want to inject jobs in the production queue, I recommend running the loadtest script locally after first running:

ssh -L 17080:ocg.svc.eqiad.wmnet:8000 tin

to redirect queries to the production service. (The OCG service port is firewalled from outside connection, hence the need for the ssh tunnel.)

Finding crashers / Debugging "Error: 1"

Here's how to find crashers and reproduce them. Hopefully after that you can fix them!

First, go to logstash: https://logstash.wikimedia.org/#/dashboard/elasticsearch/OCG%20Backend (For labs/beta this is: https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/OCG%20Backend )

Select an appropriate timeframe from the top-right dropdown, and then type in: "process died with" (with the quotes, and replacing the default *) in the QUERY field.

You should now see some crashers, and some event counts.

Clicking on an entry under "all events" will give you the basic parameters of the request in the full_message.job.metabook field.

  • Copy and paste the contents of the full_message.job.metabook field into a new local file, let's call it somebug.json.
  • For some reason, logstash adds spurious semicolons. Search and replace all semicolons in the JSON file with nothing.
  • Remove the line "parsoid": "http://10.x.x.x", from the end of the JSON file.
  • Run mw-ocg-bundler -v -D -o somebug.zip -m somebug.json. If this was a bundler crasher, this command should crash and you're done.

If this was a LaTeX crasher, then:

  • Run mw-ocg-latexer -v -D -o somebug.pdf somebug.zip. Hopefully this will crash for you.
  • To debug further, use mw-ocg-latexer -v -D -l -o somebug.tex somebug.zip.
  • Now you can rerun LaTeX with: TEXINPUTS=tex/: xelatex somebug.tex
  • Note that somebug.tex just includes the "real" TeX files in a temporary directory somewhere. Open that up in your editor. Commenting stuff out with % is a good first step to narrow down the bug. If this is a collection of multiple articles, comment out the \input lines in output.tex until you've figured out which article the problem is. The XeLaTeX output hopefully then gives you the line number within that file. Good luck!

Older hints

(04:30:28 PM) mwalker: so... the process right now is to look in logstash for the collection id
(04:31:00 PM) mwalker: cscott, https://logstash.wikimedia.org/#/dashboard/elasticsearch/OCG%20Backend
(04:31:17 PM) mwalker: cscott, which will tell you the IP of the generating host
(04:33:19 PM) mwalker: cscott, once you have the IP; run `host <ip>` on any of the ocg servers or tin to
figure out what host that actually was; ssh there, and then look in /var/log/ocg.log for anything that
looks like what you've kicked off

See also