Dumps/Adds-changes dumps
Adds/changes dumps overview
We have an experimental service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private.
The code for this service is available in our git repository (master branch). It relies on the python modules used by the regular dumps, at in the regular dumps repo.
The job runs out of cron on one of the snapshot hosts (see hiera/hosts for which one), as the datasets user. Everything except initial script deployment is puppetized. Scripts are deployed via scap3 as part of the general Dumps deployment.
Directory structure:
Everything for a given run is stored in dumproot/projectname/yyyymmdd/ much as we do for regular dumps.
How it works
We record the largest revision id for the given project, in the file maxrevid.txt, older than a configurable cuttof (currently at least 12 hours old). All revisions between this and the previously recorded revision for the previous day will be dumped. The delay gives editors on the specific wiki some time to have weeded out vandalism, advertising spam and so on.
We generate a stubs file containing metadata in xml format for each revision added since the previous day, consulting the file maxrevid.txt for the previous day to get the start of the range. We then generate a meta-history xml file which contains the text of these revisions grouped together and sorted by page id. Md5 sums of these are available in an md5sums.txt file. A status.txt file is available to indicate whether we had a successful run ("done") or not.
After all wikis have run, we check the directories for successful runs and writes a main index.html file with links for each project to the stub and content files for the latest successful run.
When stuff breaks
You can rerun various jobs by hand for specified dates. Be on the snapshot host responsible (check iera/hosts for the one that runs misc cron jobs). In a screen session, do:
sudo -s datasets
python3 /srv/deployment/dumps/dumps/xmldumps-backup/generatemiscdumps.py --configfile /etc/dumps/confs/addschanges.conf --dumptype incrdumps --date YYYYMMDD
If you want more information you can run the above script with --help for a usage message.
Some numbers
Here's a few fun numbers from the March 12 2019 run.
wiki | revision count | stubs time | content time |
---|---|---|---|
wikidatawiki | 756837 | 2 m | 2 h 56 m |
enwiki | 82370 | 1 m | 27 m |
commons | 75576 | 1 m | 4 m |
dewki | 19938 | 1 m | 4 m |
Explanation:
- stubs time: length of time to generate the gzipped file containing metadata for each new revision
- content time: length of time to generate the bzip2-compressed file containing content for each new revision
- h = hours, m = minutes