Dumps/New dumps and datasets

This Dumps infrastructure is in maintenance mode only. We will not be adding any new use cases. If you'd like to generate a new dump, please talk to the Data Products team for guidance on new infrastructure.

New dumps or datasets

Adding new dumps

So you want to generate dumps for a new extension or for new content; what should you do?

These guidelines describe what is ncessary to get your dumps and datasets generated and added to our public webserver.

Talk to us first. How big will these dumps grow? How long will they take to run on one CPU? How much memory will they need? What resources will they need over the next three to five years? We need ths information so that we can plan properly for server capacity.

Dumps that communicate with MediaWiki databases must use the vslow (dump) db server group, as described in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php or analogous files. We do this because long-running queries, as dump queries typically are, cause a problem when run on the databases used by the application servers. Typically one can use code like the following to get a connection to a database server in the right group:

lb = wfGetLBFactory()->newMainLB();
db = lb->getConnection( DB_REPLICA, 'dump' );

If your dump process retrieves revision content and not just metadata, it must be written to run in two passes, one pass to write out the metadata, and a second pass to re-use revision content from the previous run if available. Because retrieval of revision content from our external storage is very expensive, reusing previously retrieved content whenever possible is paramount, both for speed of the dump run, and for reducing the load on the external storage database servers.

Expect database servers to be depooled for maintenance without warning during your dump run. This means that any given dump job should be broken own into small tasks that take no longer than a few hours, and that can be rerun automatically up to some number of retries.

If there are consistency checks that can be done on your data to be sure that the output is valid, you should do so. A bug in deployed code can cause all kinds of things to go wrong; you can, for example, check that files have the right starting and ending content (for xml files), and that compressed files were written completely (gzip or bzip2 files).

Dumps of content stored by extensions (Flow, FlaggedRevs, etc) should be part of the bi-monthly dump run, and should be generated in xml format, with a corresponding schema, see https://www.mediawiki.org/xml/export-0.10.xsd for an example.

For more information on writing MediaWiki maintenance scripts for dumping data from MediaWiki tables, see mw:SQL/XML_Dumps/Writing_maintenance_scripts.

All other dumps are run on a weekly basis, more or less, and their files are stored in a directory tree with the following structure: other/dumpname/date/files where all files for all wikis generated on the same date go into the same directory. These dumps are listed on the page of 'other' dumps, see https://dumps.wikimedia.org/other/

If you want to add your weekly dumps to the index.html page there, you may submit a gerrit patchset to puppet, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/other_index.html.

If you want an index.html page for your dumps, it should be provided with a gerrit patch to puppet, adding a file to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html (see the files in that directory for existing examples).

There are shell scripts with helper functions available for weekly dump jobs of this sort. You can look at an example, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/cron/dump-global-blocks.sh. These may change in the future, but if that happens, all such jobs will be migrated at once and you probably won't be asked to do anything except give your ok.

Output should be written to a temporary location or a temporary file and then moved into place; this makes our automated rsyncs happier, since they only will pick up the completed files. Use .tmp at the end of the filename, or write into the temporary directory, the location of which you can get in bash by using the helper scripts mentioned above:

args="output:temp"
results=`python3 "${repodir}/getconfigvals.py" --configfile "$configfile" --args "$args"`
tempDir=`getsetting "$results" "output" "temp"` || exit 1

Cron jobs to run weekly dumps should not generate output except on error. They should not however direct all whines to /dev/null; if there is an error, we need to know about it so we can ask you to fix it.

It's best to schedule your dump's weekly cron job so that it doesn't overlap with others (except for the Wikidata dumps), to the extent possible. For now, this means doublechecking the dates and times in <> and estimating your own job's run time.

We are available to support you in all of these things, and to coordinate merging and deployment of all puppet patches. The basic dump script should come from your team; after that we can work with you to iron out the rest of the details.

Happy dumping!

Adding new datasets

Other datasets provided by you may be served to the public; the datasets should be checked to make sure contain no private or sensitive information, the number of old datasets you want to keep should be specified, and an estimate of how much space they will take should be provided, both now and over the next three to five years. These will be listed on the 'other' index.html page at https://dumps.wikimedia.org/other/, and you may add your content to that page via a gerrit change to puppet, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/other_index.html.

If your dataset is to be copied to the public webserver via rsync, you should add a gerrit patch to puppet that does the fetch, see files in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/manifests/web/fetches/ for examples. This will require setting up rsyncd on the source host and configuring it to allow access by the dumps web server, as well as setting up ferm rules to permit rsync through. We recommend rsyncs be done no more often than once a day.

Index.html pages for your new datasets can be provided by adding an html page to our puppet repo, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html for examples.

All such datasets will be automatically made available to users of stats1005/6 and to users of WMF Cloud instances, via nfs.