Dumps/Wikibase dumps overview

From Wikitech
Jump to navigation Jump to search

All about Wikibase (rdf) dumps via cron

Okay, maybe not all about them, but enough to get you adding your own scripts for new datasets.

All dumps except for xml/sql dumps run from cron via shell scripts which can be found in our puppet manifests. See modules/snapshot/files/cron [1] for these. We’ll start by looking at common functions available to all scripts.

Common functions for dump scripts

The script dump_functions.sh [2] provides paths to a number of directories, such as the base directory tree for all of these dumps, and the path to the dumps configuration files. It also provides a function for extracting values from the output of a little python script that gets multiple values from those config files, such as paths to useful executables, and the directory tree for temporary files. You should always source dump_functions.sh at the top of your script.

Configuration files

We have a few dumps configuration files available; one is for xml/sql dumps and you won’t want that. One is for WMCS instances and you won’t want that either. The one you do want is wikidump.conf.other in the config dir, available to you in the $confsdir variable. This name is historical; everything other than xml/sql dumps was considered “other” and is put in a separate directory tree, and even generated by separate hosts writing to a separate filesystem.

Where to write files, what to name them

We like output files to have the wiki db name, the date in YYYYMMDD format, and the sort of dump being produced in the name, the type of output (json, txt, ttl etc) as well as the compression type as the extension. Typically files are arranged in directories by dump type and then date. For example, cirrussearch dumps are in the subdirectory cirrussearch under a further subdirectory of YYYYMMDD for the date that run was started, with files for all wikis in the same directory. Output files for mediainfo data from commons in rdf format might be in wikibase/commonswiki/20200830/commons-20200830-mediainfo.nt.gz for example.

Thus, at the beginning of the script you should save the date in the right format. The base directory tree for these dump output files is available to you in $crondir so you can just tack on the dump name and the date as subdirs after that.

Getting config values for your script

We use absolute paths for everything including e.g. gzip and php. This makes things a bit safer and also lets us switch between versions of php on a rolling basis when we start moving to a new version. We also need to be able to get the directory where MWScript.php and its ilk reside. The getconfigvals.py [3] script is available for retrieving these values. You will set up a single argument with the config file sections and setting names you want to retrieve, call the script once, get the output, and then use the getsetting() function (included in the dump_functions.sh script we saw earlier) to extract each value to its own shell variable. You’ll want to check that the value isn’t empty (that something horrible hasn’t gone wrong with the config file or whatever) by using checkval(), also provided in dump_functions.sh

Example code

Enough blah blah, let’s look at code right from wikibasedumps-shared.sh [4] and see for ourselves. This code does all of the above for you, but it’s a good idea to know what all the pieces are before you start building on top of them.

#!/bin/bash
#############################################################
# This file is maintained by puppet!
# modules/snapshot/cron/wikibase/wikibasedumps-shared.sh
#############################################################
#
# Shared variable and function declarations for creating Wikibase dumps
# of any sort
#
# Marius Hoch < hoo@online.de >

Here we source the dump_functions script to get those directory paths and the settings retrieval/check functions. We now have the config dir path so we can pick up the right config file.

source /usr/local/etc/dump_functions.sh
configfile="${confsdir}/wikidump.conf.other"

Stash the date for the entire run in YYYYMMDD format. The other two variables have to do with how many runs we keep around (not really true as we have aggressive cleanup scripts running on all hosts) and how many pages to process in a batch. We dump output in batches so that if any batch fails it can be retried up to some reasonable number of times without having to start the whole job over from the beginning.

today=`date +'%Y%m%d'`
daysToKeep=70
pagesPerBatch=200000

Here we get the path to the multiversion directory, the path to the temporary directory tree, and the paths for php and lbzip2 (you may want gzip or something else here instead).

args="wiki:multiversion;output:temp;tools:php,lbzip2"
results=`python3 "${repodir}/getconfigvals.py" --configfile "$configfile" --args "$args"`

multiversion=`getsetting "$results" "wiki" "multiversion"` || exit 1
tempDir=`getsetting "$results" "output" "temp"` || exit 1
php=`getsetting "$results" "tools" "php"` || exit 1
lbzip2=`getsetting "$results" "tools" "lbzip2"` || exit 1

Once we have the settings we make sure they all have actual values in them (double check).

for settingname in "multiversion" "tempDir"; do
    checkval "$settingname" "${!settingname}"
done

Stash the directory for the specific wiki and run date.

targetDirBase=${cronsdir}/wikibase/${projectName}wiki
targetDir=$targetDirBase/$today

We need this in order to run our maintenance script.

multiversionscript="${multiversion}/MWScript.php"

# Create the dir for the day: This may or may not already exist, we don't care
mkdir -p $targetDir

Reminder that all of this is done for you in wikibasedumps-shared.sh so you need only source that in your dumpwikibase(somenewformat).sh script.

Shared functions for wikibase dumps

A number of convenience functions are provided for you in wikibasedumps-shared.sh so please make use of them:

  • pruneOldLogs() tosses old log files, typically written to files in /var/log/wikidatadump/ or commonsdump/
  • runDcat() runs DCAT.php on the output files if there is a config specified for it.
  • putDumpChecksums() generates md5 and sha1 checksums for and output file and stashes them in files to be made available for download.
  • getNumberOfBatchesNeeded() gets the number of batches we’ll need to run, based on the number of processes we run at once and the max page id and some other stuffs.
  • setPerBatchVars() sets the first and last page id we want to retrieve in a specific batch, etc.
  • getTempFiles() gets a properly sorted list of all the output files matching some wildcard.
  • getFileSize() gets the byte count of one or more files. This is used as a sanity check to make sure we don’t suddenly have much tinier output files than expected.
  • handleBatchFailure() logs errors and handles retries of a batch.
  • getContinueBatchNumber() is used in the case that we have run this script manually to continue a previously failed run. It will determine where we left off.
  • moveLinkFile() is used at the end to move the temporary files we write into their permanent name and location.

These are all pretty self-explanatory, and the calls to them in e.g. dumpwikibaserdf.sh should be good examples.

Per-project functions for wikibase rdf dumps

You’ll be setting up for some other format but these are the sort of things you’ll want to provide.

Let’s look at the example for commons rdf dumps: commonsrdf_functions.sh [5]

#!/bin/bash
#############################################################
# This file is maintained by puppet!
# puppet:///modules/snapshot/cron/wikibase/commonsrdf_functions.sh
#############################################################
# function used by wikibase rdf dumps, customized for Commons

usage() {
    echo -e "Usage: $0 commons [--continue] mediainfo ttl|nt [nt|ttl]\n"
    echo -e "\t--continue\tAttempt to continue a previous dump run."
    echo -e "\tttl|nt\t\tOutput format."
    echo -e "\t[nt|ttl]\t\tOutput format for extra dump, converted from above (optional)."

    exit 1
}

setProjectName() {
    projectName="commons"
}

setEntityType() {
    entityTypes="--entity-type mediainfo --ignore-missing"
}

setDumpFlavor() {
    dumpFlavor="full-dump"
}

setFilename() {
    filename=commons-$today-$dumpName
}

setDumpNameToMinSize() {
    # TODO: figure out what number makes sense here
    dumpNameToMinSize=(["mediainfo"]=1000)
}

setDcatConfig() {
    # TODO: add DCAT info
    dcatConfig=""
}

The usage message is because I will not remember 5 minutes later how to run these things and neither will you by the time you come back to fix the next bug.

You can see we set the project name to commons, the filename to wiki-date-dumptype as described earlier, a cutoff output files size below which the script should complain if the output is smaller, and the dcat config which in the case of commons does not exist yet.

Pretty simple stuff.

Now let’s see how all that gets used in dumpwikibaserdf.sh [6]

#!/bin/bash
#############################################################
# This file is maintained by puppet!
# puppet:///modules/snapshot/cron/wikibase/dumpwikibaserdf.sh
#############################################################
#
# Generate a RDF dump for wikibase datasets and remove old ones.
# This script requires a second shell script with function definitions
# in it specific to the given wikibase project and entity types;
# place it in modules/snapshot/cron/wikibase/<projectname>rdf_functions.sh
# using one of the existing files as a guide, and then add the
# project name to PROJECTS below.
# The project name should be the wiki db name without the 'wiki'
# suffix. If someday we move to run wikibase on wiktionaries
# or what have you, we'll redo the project and file name logic!

This code works only for commons and wikidata (db names commonswiki, wikidatawiki).

PROJECTS="wikidata|commons"

if [[ "$1" == '--help' ]]; then
    echo -e "$0 $PROJECTS --help for help"
    exit 1
fi

projectName=$1
if [ -z "$projectName" ]; then
    echo -e "Missing project name."
    echo -e "$0 $PROJECTS --help for help"
    exit 1
fi
if [ "$projectName" != "wikidata" -a  "$projectName" != "commons" ]; then
    echo -e "Unknown project name."
    echo -e "$0 $PROJECTS --help for help"
    exit 1
fi

Source all those common settings and functions.

. /usr/local/bin/wikibasedumps-shared.sh
. /usr/local/bin/${projectName}rdf_functions.sh

if [[ "$2" == '--help' ]]; then
     usage
     exit 1
fi

In case we are doing a manual run to finish up a run that broke partway through.

continue=0
if [[ "$2" == '--continue' ]]; then
    shift
    continue=1
fi

For rdf dumps this can currently be “all”, “truthy”, or “nt”.

dumpName=$2
if [ -z "$dumpName" ]; then
    echo "No dump name given."
    usage
    exit 1
fi

if [ $continue -gt 0 ]; then
    # Remove old leftovers, as we start from scratch.
    rm -f $tempDir/$projectName$dumpFormat-$dumpName.*-batch*.gz
fi

For rdf dumps this varies according to the dumpName, the dumpflavor variable eventually gets passed in as an arg to the maintenance script. For other formats you may want something else or to not even have this.

setDumpFlavor

Yeah more special format stuff for the rdf jobs.

dumpFormat=$3
extraFormat=$4

if [[ "$dumpFormat" != "ttl" ]] && [[ "$dumpFormat" != "nt" ]]; then
    echo "Unknown format: $dumpFormat"
    usage
    exit 1
fi

if [ -n "$extraFormat" ]; then
    declare -A serdiDumpFormats
    serdiDumpFormats=(["ttl"]="turtle" ["nt"]="ntriples")
    extraIn=${serdiDumpFormats[$dumpFormat]}
    extraOut=${serdiDumpFormats[$extraFormat]}
    if [ -z "$extraIn" -o -z "$extraOut" -o "$extraIn" = "$extraOut" ]; then
        extraFormat=""
    fi
fi

Set up that dump output file name though!

setFilename

failureFile="/tmp/dump${projectName}${dumpFormat}-${dumpName}-failure"
logLocation="/var/log/${projectName}dump"
mainLogFile="${logLocation}/dump${projectName}${dumpFormat}-${filename}-main.log"

This is hardcoded in; how many batches are we dumping at once? Increases to this are inevitable but should be coordinated with the dba and whoever is managing the host(s) generating the dumps.

shards=8

i=0
rm -f $failureFile

When is the output so small that it’s probably broken?

setDumpNameToMinSize

Set up batch info

getNumberOfBatchesNeeded ${projectName}wiki
numberOfBatchesNeeded=$(($numberOfBatchesNeeded / $shards))

if [[ $numberOfBatchesNeeded -lt 1 ]]; then
# wiki is too small for default settings, change settings to something sane
# this assumes wiki has at least four entities, which sounds plausible
    shards=4
    numberOfBatchesNeeded=1
    pagesPerBatch=$(( $maxPageId / $shards ))
fi

What are we dumping again? This sets the entity-type arg for the rdf dumps maintenance script.

setEntityType

while [ $i -lt $shards ]; do
    (

Gotta get errors from anywhere in the pipeline.

        set -o pipefail
        errorLog=${logLocation}/dump$projectName$dumpFormat-$filename-$i.log

        batch=0

        if [ $continue -gt 0 ]; then
            getContinueBatchNumber "$tempDir/$projectName$dumpFormat-$dumpName.$i-batch*.gz"
        fi

        retries=0
        while [ $batch -lt $numberOfBatchesNeeded ] && [ ! -f $failureFile ]; do
            setPerBatchVars

            echo "(`date --iso-8601=minutes`) Starting batch $batch" >> $errorLog

Actually run the script, woo hoo!

            $php $multiversionscript extensions/Wikibase/repo/maintenance/dumpRdf.php \
                --wiki ${projectName}wiki \
                --shard $i \
                --sharding-factor $shards \
                --batch-size $(($shards * 250)) \
                --format $dumpFormat ${dumpFlavor:+--flavor} ${dumpFlavor:+"$dumpFlavor"} \
                $entityTypes \
                --dbgroupdefault dump \
                --part-id $i-$batch \
                $firstPageIdParam \
                $lastPageIdParam 2>> $errorLog | gzip -9 > $tempDir/$projectName$dumpFormat-$dumpName.$i-batch$batch.gz

            exitCode=$?
            if [ $exitCode -gt 0 ]; then
                handleBatchFailure
                continue
            fi

            retries=0
            let batch++
        done
    ) &
    let i++
done

wait

This failureFile thing is literally “did something fail”

if [ -f $failureFile ]; then
    echo -e "\n\n(`date --iso-8601=minutes`) Giving up after a shard failed." >> $mainLogFile
    rm -f $failureFile

    exit 1
fi

i=0

Everything got written to temp files in the temp dir, do size sanity check, then concat them all together

while [ $i -lt $shards ]; do
    getTempFiles "$tempDir/$projectName$dumpFormat-$dumpName.$i-batch*.gz"
    if [ -z "$tempFiles" ]; then
        echo "No files for shard $i!" >> $mainLogFile
        exit 1
    fi
    getFileSize "$tempFiles"
    if [ $fileSize -lt ${dumpNameToMinSize[$dumpName]} ]; then
        echo "File size of $tempFile is only $fileSize. Aborting." >> $mainLogFile
        exit 1
    fi
    cat $tempFiles >> $tempDir/$projectName$dumpFormat-$dumpName.gz
    let i++
done

This is rdf specific stuff.

if [ -n "$extraFormat" ]; then
    # Convert primary format to extra format
    i=0
    while [ $i -lt $shards ]; do
        getTempFiles "$tempDir/$projectName$dumpFormat-$dumpName.$i-batch*.gz"
        (
            set -o pipefail
            for tempFile in $tempFiles; do
                extraFile=${tempFile/$projectName$dumpFormat/$projectName$extraFormat}
                gzip -dc $tempFile | serdi -i $extraIn -o $extraOut -b -q - | gzip -9 > $extraFile
                exitCode=$?
                if [ $exitCode -gt 0 ]; then
                    echo -e "\n\n(`date --iso-8601=minutes`) Converting $tempFile failed with exit code $exitCode" >> $errorLog
                fi
            done
        ) &
        let i++
    done
    wait
fi

i=0
while [ $i -lt $shards ]; do
    getTempFiles "$tempDir/$projectName$dumpFormat-$dumpName.$i-batch*.gz"
    rm -f $tempFiles
    if [ -n "$extraFormat" ]; then
        getTempFiles "$tempDir/$projectName$extraFormat-$dumpName.$i-batch*.gz"
        cat $tempFiles >> $tempDir/$projectName$extraFormat-$dumpName.gz
        rm -f $tempFiles
    fi
    let i++
done

They recompress the gzip output to bzip2 here, ymmv

nthreads=$(( $shards / 2))
if [ $nthreads -lt 1 ]; then
    nthreads=1
fi

But first they move the concatted temp file into place.

moveLinkFile $projectName$dumpFormat-$dumpName.gz $filename.$dumpFormat.gz latest-$dumpName.$dumpFormat.gz $projectName

gzip -dc "$targetDir/$filename.$dumpFormat.gz" | "$lbzip2" -n $nthreads -c > $tempDir/$projectName$dumpFormat-$dumpName.bz2
moveLinkFile $projectName$dumpFormat-$dumpName.bz2 $filename.$dumpFormat.bz2 latest-$dumpName.$dumpFormat.bz2 $projectName

The rdf specific extra format stuff, more of it

if [ -n "$extraFormat" ]; then
    moveLinkFile $projectName$extraFormat-$dumpName.gz $filename.$extraFormat.gz latest-$dumpName.$extraFormat.gz $projectName
    gzip -dc "$targetDir/$filename.$extraFormat.gz" | "$lbzip2" -n $nthreads -c > $tempDir/$projectName$extraFormat-$dumpName.bz2
    moveLinkFile $projectName$extraFormat-$dumpName.bz2 $filename.$extraFormat.bz2 latest-$dumpName.$extraFormat.bz2 $projectName
fi

Clean up old logs, run dcat if there’s a config, and we’re done at last!

pruneOldLogs
setDcatConfig
runDcat