Dumps/OtherMisc

From Wikitech

This page documents various dumpsets that are produced daily or weekly, not part of the generation of the xml/sql dumps.

All of these dumps run on database servers designated 'vslow, dumps', on a snapshot host dedicated to 'misc' dump generation (everything other than the xml/sql dumps).

The dump scripts are in our git puppet repo.

If errors are encountered when the specific cron job runs, the output is sent to ops-dumps@wikimedia.org.

  • Global block table:
    • dumped weekly
    • contains an sql-format dump of information in the global block table
    • managed by mw:Extension:GlobalBlocking) (code)
    • Issues: Unless the database server goes away during the run, or database credentials change, this job should just run
  • Cirrus search dumps:
    • dumped weekly
    • contains text indices, the file index (for commons) and the metadata index (for the entire cirrus cluster) in json format
    • run by a maintenance script in mw:Extension:CirrusSearch (code)
    • Issues: it's been quite reliable so far
  • Content Translation dumps:
    • dumped weekly
    • contains parallel corpora that can be used by developers working on machine translation.
    • run by a maintenance script in mw:Extension:ContentTranslation (code)
    • Issues: it has run out of memory when the language files being dumped have too much data; these can be split apart in order to resolve the problem. Example: see this phab task.
  • Media info:
    • dumped weekly
    • two files for each wiki, consisting of titles of media files stored locally, and those used on the project stored remotely (on Commons).
    • run by a shell wrapper around the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Page titles:
    • dumped daily
    • contains a list of all page titles in the main namespace (NS 0) per project
    • run by the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Media titles:
    • dumped daily
    • contains a list of all titles in the Media namespace (NS 6) per project
    • run by the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Short url mappings:
    • dumped weekly
    • each line contains an entry of the form short-url|log-url
    • run by the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.