Jump to content

Data Platform/Systems/Edit history administration

From Wikitech

We rebuild the mediawiki edit history from the clouddb db replicas. Those databases only hold public data.


Test Sqoop Run

If you're developing sqoop or the related systems, you can make a copy of the sqoop command that will run at the start of the month and modify it to query less data.

The systemd service refinery-sqoop-whole-mediawiki.service has the script which is run: /usr/local/bin/refinery-sqoop-whole-mediawiki, which in turn calls:

- /usr/local/bin/refinery-sqoop-mediawiki for public data, and

- /usr/local/bin/refinery-sqoop-mediawiki-production for private data.

Focusing on the public data script:

razzi@an-launcher1002:~$ sudo cat /usr/local/bin/refinery-sqoop-mediawiki
# NOTE: This file is managed by puppet

export PYTHONPATH=\${PYTHONPATH}:/srv/deployment/analytics/refinery/python

/usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \
    --job-name sqoop-mediawiki-monthly-$(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +'%Y-%m') \
    --labsdb \
    --output-dir /wmf/data/raw/mediawiki/tables \
    --wiki-file /mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv \
    --tables archive,category,categorylinks,change_tag,change_tag_def,content,content_models,externallinks,image,imagelinks,ipblocks,ipblocks_restrictions,iwlinks,langlinks,logging,page,pagelinks,page_props,page_restrictions,redirect,revision,slots,slot_roles,templatelinks,user,user_groups,user_properties,wbc_entity_usage \
    --user s53272 \
    --password-file /user/analytics/mysql-analytics-labsdb-client-pw.txt \
    --partition-name snapshot \
    --partition-value $(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +'%Y-%m') \
    --mappers 64 \
    --processors 10 \
    --yarn-queue production \
    --output-format avrodata \
    --log-file /var/log/refinery/sqoop-mediawiki.log

Make a copy to your home directory and change the permissions:

$ sudo cp /usr/local/bin/refinery-sqoop-mediawiki ~
$ sudo chown $(whoami):wikidev refinery-sqoop-mediawiki
$ chmod u+w refinery-sqoop-mediawiki

Then change the following:

- if you're testing changes to refinery, change the PYTHONPATH export and sqoop-mediawiki-tables invocation to use the local script

- change job name to add "test", like sqoop-mediawiki-2021-03-test-1

- change --output-dir to be somewhere in your hadoop user data, like /user/razzi/sqoop-tmp. This must be editable by analytics; the easiest way to accomplish this is to change the ownership using sudo -u hdfs hdfs dfs -chown analytics:analytics /user/razzi/sqoop-tmp

- change which wikis are sqooped by changing wiki-file to be a local .csv, such as small_sqoop_run.csv:

$ cat small-sqoop-run.csv

The columns are wiki id, group for processing (for testing having group 1 is fine), and article count, which is not used by sqoop.

- tables can be changed to run on a smaller number of tables.

- mappers and processors can be scaled down if you're running on fewer wikis

- yarn-queue can be changed to default

- log file should be a local log file, such as ~/sqoop.log. This file must be editable by the user that executes the script (analytics), so create it and change its permissions:

$ touch ~/sqoop.log
$ sudo chown analytics:analytics ~/sqoop.log

Putting it all together, a test script could look like:


export PYTHONPATH=${PYTHONPATH}:/home/razzi/refinery/python

/usr/bin/python3 /home/razzi/refinery/bin/sqoop-mediawiki-tables \
    --job-name sqoop-mediawiki-monthly-test-2021-03 \
    --labsdb \
    --output-dir /user/razzi/sqoop \
    --wiki-file /home/razzi/small-sqoop-run.csv \
    --tables archive \
    --user s53272 \
    --password-file /user/analytics/mysql-analytics-labsdb-client-pw.txt \
    --partition-name snapshot \
    --partition-value $(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +'%Y-%m') \
    --mappers 1 \
    --processors 1 \
    --yarn-queue default \
    --output-format avrodata \
    --log-file /home/razzi/sqoop.log

Which can be run as follows:

sudo -u analytics kerberos-run-command analytics ./refinery-sqoop-mediawiki

Rerun what's needed

If sqoop fails, it will print out something like:

2019-04-02T14:58:09 ERROR  **************************************************
2019-04-02T14:58:09 ERROR  *  Jobs to re-run:
2019-04-02T14:58:09 ERROR  *    - enwiktionary:revision
2019-04-02T14:58:09 ERROR  *    - svwiki:revision
2019-04-02T14:58:09 ERROR  *    - viwiki:pagelinks
2019-04-02T14:58:09 ERROR  *    - metawiki:pagelinks
2019-04-02T14:58:09 ERROR  **************************************************

To rerun, make a list of the wikis that failed for each table, run the sqoop for just those wikis and just the tables you need, passing a "rerun" snapshot to the sqoop script so that the output goes to a separate folder. Once done, move the output to the right location and manually put the _SUCCESS flags on HDFS. For example, for the wikis above, you would make two files "revision-wikis-failed.csv" and "pagelinks-wikis-failed.csv", and use them as follows:

export PYTHONPATH=\${PYTHONPATH}:/srv/deployment/analytics/refinery/python

/usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \
    --job-name sqoop-mediawiki-monthly-2019-03-rerun-revision \
    --labsdb \
    --output-dir /wmf/data/raw/mediawiki/tables \
    --wiki-file /tmp/revision-wikis-failed.csv \
    --tables revision \
    --user s53272 \
    --password-file /user/hdfs/mysql-analytics-labsdb-client-pw.txt \
    --partition-name snapshot \
    --partition-value 2019-03-rerun \
    --mappers 64 \
    --processors 3 \
    --output-format avrodata \
    --log-file /var/log/refinery/sqoop-mediawiki-rerun-2019-03-revision.log

/usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \
    --job-name sqoop-mediawiki-monthly-2019-03-rerun-pagelinks \
    --labsdb \
    --output-dir /wmf/data/raw/mediawiki/tables \
    --wiki-file /tmp/pagelinks-wikis-failed.csv \
    --tables pagelinks \
    --user s53272 \
    --password-file /user/hdfs/mysql-analytics-labsdb-client-pw.txt \
    --partition-name snapshot \
    --partition-value 2019-03-rerun \
    --mappers 64 \
    --processors 3 \
    --output-format avrodata \
    --log-file /var/log/refinery/sqoop-mediawiki-rerun-2019-03-pagelinks.log

And then you would move /wmf/data/raw/mediawiki/tables/revision/snapshot=2019-03-rerun/wiki_db=enwiktionary to /wmf/data/raw/mediawiki/tables/revision/snapshot=2019-03/wiki_db=enwiktionary and same for the other three outputs (two tables x two wikis). Finally, write the _SUCCESS flags that are missing from the root of the table directory on HDFS. But be aware that once you do, jobs depending on these flags may kick off (like the mediawiki-history-load job).

See sqoop errors

Logs are in "/var/log/refinery", grep for ERROR

nuria@an-coord1001:/var/log/refinery$ more  sqoop-mediawiki.log | grep  ERROR  | more
2018-03-02T10:22:27 ERROR  ERROR: zhwiki.revision (try 1)
2018-03-02T10:31:20 ERROR  ERROR: zhwiki.pagelinks (try 1)
2018-03-02T11:09:17 ERROR  ERROR: svwiki.pagelinks (try 1)
2018-03-02T11:30:38 ERROR  ERROR: zhwiki.pagelinks (try 2)
2018-03-02T13:17:17 ERROR  ERROR: viwiki.pagelinks (try 1)

QA: Assessing quality of a snapshot

Once denormalization has run we need to be able to look that the snapshot created is of quality (i.e. data should match last snapshot, bugs might have been introduced since last snapshot was run).

Automatic validation steps. What to do if things fail

There is a validation step that is automated, the algorithm that runs to check events is described here: See algorithm: Analytics/Systems/Cluster/Mediawiki_History_Snapshot_Check and executed for the top wikis in terms of events.

If automatic validation fails you will get an e-mail with a patch to an error file, a command like the one below would allow you to look at errors:

hdfs -text hdfs://analytics-hadoop/wmf/data/wmf/mediawiki/history_check_errors/snapshot=2018-10/part-00000-3d79f06b-5604-4525-8976-8a5f351313ff-c000.json.snappy > errors.txt

The format of this error file is:

 "wiki_db": "liwiki",
 "event_entity": "userHistory",
 "event_type": "alterblocks",
 "growths": {
   "growth_count_user_event": -0.016260162601626018, -> this is a ratio, number of events has decreased 1.6%  
   "growth_distinct_user_id": 0,
   "growth_distinct_user_text": 0,
   "growth_count_user_group_bot": 0,
   "growth_count_user_anonymous": null,
   "growth_count_user_self_created": 0

Things to look for in error file:

  • Ratios, we should have more events, not less, we accept a variability of plus minus 1% but variations over this threshold need to be looked at.
  • If number of events is smaller perhaps scoop failed in a non obvious way, do check size of snapshots, a healthy sequence of snapshot sizes looks like the following, check for all tables, you might see a difference:
nuria@stat1007:~/mediawiki-history-fail2018-10$ hdfs dfs -du -s -h /wmf/data/raw/mediawiki/tables/page/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
11.3 G  33.8 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-04
11.4 G  34.1 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-05
11.5 G  34.4 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-06
11.5 G  34.6 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-07
11.7 G  35.0 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-08
11.7 G  35.2 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-09

Have events disappeared or moved mysteriously?

Check the mediawiki history statistics. Look at the number of errors in the event entity that you're investigate. For example, how many parse errors were there in pageHistory? Compare that figure to last snapshot's. If you see any number of errors that have grown disproportionally, that might be the source of the problem. Once you see something to focus on, run the mediawiki history process in only one relatively small wiki (every a-team member has a favorite small wiki, joal's is simplewiki, fdans's is glwiki), but in the parameters, make sure that errors are not being dropped. Examine the errors and see if there is something definitely going wrong.

Manually compare data with available data sources

Example: Data is available for all wikipedias in pages like : https://en.wikipedia.org/wiki/Special:Statistics

For all wikipedias that page lists for example the number of articles, does the data returned by request below match that number?


A handy link to transform json data into cvs that can be exported into a spreadsheet for easy computations: [1]

Is there data for all types of editors including anonymous editors?


Data Loading


How is this data gathered: public data from labs

Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log

How is this data gathered: ad-hoc private replicas

Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).

  1. Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
  2. Run the same cron that pulls from labs replicas but with the following changes:
    • $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
    • $db_host = 'analytics-store.eqiad.wmnet'
    • $db_user = 'research'
    • $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
    • $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
    • For the command itself:
      • --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
      • --snapshot <<YYYY-MM>>_private
      • --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
      • remove --labsdb

IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:

  • Put the _SUCCESS flag in all
  • Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
  • IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.