WMDE/Wikidata/PropertySuggester update

From Wikitech
< WMDE‎ | Wikidata
Jump to navigation Jump to search

Occasionally, the data for the property suggester needs to be updated from the latest JSON dumps; we usually try to do this once a month. Here’s how it works:

One-time setup

Run the following commands on Toolforge, in your home directory.

Run the following commands on a production maintenance host (currently mwmaint1002), in your home directory.

TODO: Move the scripts somewhere else?

Each update

Instructions based on this gist.

  • Find the latest JSON dump beneath /public/dumps/public/wikidatawiki/entities/. We’ll use yyyymmdd as a placeholder for its name below.
    • Note that the dumps take several days to run – the date is when the dumps started, but the results will not be available that day.
  • Run ./scheduleUpdateSuggester yyyymmdd on Toolforge.
    • This will take almost three days (as of 2019-03-18).
    • Check the logs at updateSuggester.err for progress or problems during the creation. It will first log “processed XMB” lines (up to 706838.54MB as of 2019-03-14), then “processed Y entities” (see d:Special:Statistics for the approximate current number of entities), then “rows Z” (up to 1919000 as of 2019-03-18)
  • jsub -sync y sha1sum analyzed-out (or whatever hashing algorithm you prefer)
  • jsub -sync y gzip analyzed-out
  • Rsync analyzed-out.gz to your local machine, commit to the wbs_propertypairs repo with the commit message Add propertypairs from the yyyymmdd dump.
  • Load it down to the maintenance host with https_proxy=http://webproxy.eqiad.wmnet:8080 wget 'https://github.com/wmde/wbs_propertypairs/raw/master/yyyymmdd/wbs_propertypairs.csv.gz'.
  • Unpack it: gzip -d wbs_propertypairs.csv
  • Compare the checksum to the one obtained on Toolforge
  • Update the actual table: mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv.
    • This will take some four minutes.
    • It will first log (to your terminal) a bunch of “deleting a batch” lines, then “X rows inserted” up to the total number of lines in the CSV file (which you can count with wc -l wbs_propertypairs.csv beforehand).
  • Run T132839-Workarounds.sh (on the maintenance host).
    • This takes about three minutes.
  • Log your changes: !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds