Data Platform/Systems/Edit history administration
We rebuild the mediawiki edit history from the clouddb db replicas. Those databases only hold public data.
Administration
Test Sqoop Run
If you're developing sqoop or the related systems, you can make a copy of the sqoop command that will run at the start of the month and modify it to query less data.
The systemd service refinery-sqoop-whole-mediawiki.service
has the script which is run: /usr/local/bin/refinery-sqoop-whole-mediawiki
, which in turn calls:
- /usr/local/bin/refinery-sqoop-mediawiki
for public data, and
- /usr/local/bin/refinery-sqoop-mediawiki-production
for private data.
Focusing on the public data script:
razzi@an-launcher1002:~$ sudo cat /usr/local/bin/refinery-sqoop-mediawiki
#!/bin/bash
# NOTE: This file is managed by puppet
#
export PYTHONPATH=\${PYTHONPATH}:/srv/deployment/analytics/refinery/python
/usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \
--job-name sqoop-mediawiki-monthly-$(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +'%Y-%m') \
--labsdb \
--output-dir /wmf/data/raw/mediawiki/tables \
--wiki-file /mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/labs_grouped_wikis.csv \
--tables archive,category,categorylinks,change_tag,change_tag_def,content,content_models,externallinks,image,imagelinks,ipblocks,ipblocks_restrictions,iwlinks,langlinks,logging,page,pagelinks,page_props,page_restrictions,redirect,revision,slots,slot_roles,templatelinks,user,user_groups,user_properties,wbc_entity_usage \
--user s53272 \
--password-file /user/analytics/mysql-analytics-labsdb-client-pw.txt \
--partition-name snapshot \
--partition-value $(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +'%Y-%m') \
--mappers 64 \
--processors 10 \
--yarn-queue production \
--output-format avrodata \
--log-file /var/log/refinery/sqoop-mediawiki.log
Make a copy to your home directory and change the permissions:
$ sudo cp /usr/local/bin/refinery-sqoop-mediawiki ~
$ sudo chown $(whoami):wikidev refinery-sqoop-mediawiki
$ chmod u+w refinery-sqoop-mediawiki
Then change the following:
- if you're testing changes to refinery, change the PYTHONPATH export and sqoop-mediawiki-tables invocation to use the local script
- change job name to add "test", like sqoop-mediawiki-2021-03-test-1
- change --output-dir to be somewhere in your hadoop user data, like /user/razzi/sqoop-tmp. This must be editable by analytics; the easiest way to accomplish this is to change the ownership using sudo -u hdfs hdfs dfs -chown analytics:analytics /user/razzi/sqoop-tmp
- change which wikis are sqooped by changing wiki-file to be a local .csv, such as small_sqoop_run.csv:
$ cat small-sqoop-run.csv
hrwiktionary,1,0
The columns are wiki id, group for processing (for testing having group 1 is fine), and article count, which is not used by sqoop.
- tables can be changed to run on a smaller number of tables.
- mappers and processors can be scaled down if you're running on fewer wikis
- yarn-queue can be changed to default
- log file should be a local log file, such as ~/sqoop.log. This file must be editable by the user that executes the script (analytics), so create it and change its permissions:
$ touch ~/sqoop.log
$ sudo chown analytics:analytics ~/sqoop.log
Putting it all together, a test script could look like:
#!/bin/bash
export PYTHONPATH=${PYTHONPATH}:/home/razzi/refinery/python
/usr/bin/python3 /home/razzi/refinery/bin/sqoop-mediawiki-tables \
--job-name sqoop-mediawiki-monthly-test-2021-03 \
--labsdb \
--output-dir /user/razzi/sqoop \
--wiki-file /home/razzi/small-sqoop-run.csv \
--tables archive \
--user s53272 \
--password-file /user/analytics/mysql-analytics-labsdb-client-pw.txt \
--partition-name snapshot \
--partition-value $(/bin/date --date="$(/bin/date +%Y-%m-15) -1 month" +'%Y-%m') \
--mappers 1 \
--processors 1 \
--yarn-queue default \
--output-format avrodata \
--log-file /home/razzi/sqoop.log
Which can be run as follows:
sudo -u analytics kerberos-run-command analytics ./refinery-sqoop-mediawiki
Rerun what's needed
If sqoop fails, it will print out something like:
2019-04-02T14:58:09 ERROR ************************************************** 2019-04-02T14:58:09 ERROR * Jobs to re-run: 2019-04-02T14:58:09 ERROR * - enwiktionary:revision 2019-04-02T14:58:09 ERROR * - svwiki:revision 2019-04-02T14:58:09 ERROR * - viwiki:pagelinks 2019-04-02T14:58:09 ERROR * - metawiki:pagelinks 2019-04-02T14:58:09 ERROR **************************************************
To rerun, make a list of the wikis that failed for each table, run the sqoop for just those wikis and just the tables you need, passing a "rerun" snapshot to the sqoop script so that the output goes to a separate folder. Once done, move the output to the right location and manually put the _SUCCESS flags on HDFS. For example, for the wikis above, you would make two files "revision-wikis-failed.csv" and "pagelinks-wikis-failed.csv", and use them as follows:
export PYTHONPATH=\${PYTHONPATH}:/srv/deployment/analytics/refinery/python /usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \ --job-name sqoop-mediawiki-monthly-2019-03-rerun-revision \ --labsdb \ --output-dir /wmf/data/raw/mediawiki/tables \ --wiki-file /tmp/revision-wikis-failed.csv \ --tables revision \ --user s53272 \ --password-file /user/hdfs/mysql-analytics-labsdb-client-pw.txt \ --partition-name snapshot \ --partition-value 2019-03-rerun \ --mappers 64 \ --processors 3 \ --output-format avrodata \ --log-file /var/log/refinery/sqoop-mediawiki-rerun-2019-03-revision.log /usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \ --job-name sqoop-mediawiki-monthly-2019-03-rerun-pagelinks \ --labsdb \ --output-dir /wmf/data/raw/mediawiki/tables \ --wiki-file /tmp/pagelinks-wikis-failed.csv \ --tables pagelinks \ --user s53272 \ --password-file /user/hdfs/mysql-analytics-labsdb-client-pw.txt \ --partition-name snapshot \ --partition-value 2019-03-rerun \ --mappers 64 \ --processors 3 \ --output-format avrodata \ --log-file /var/log/refinery/sqoop-mediawiki-rerun-2019-03-pagelinks.log
And then you would move /wmf/data/raw/mediawiki/tables/revision/snapshot=2019-03-rerun/wiki_db=enwiktionary to /wmf/data/raw/mediawiki/tables/revision/snapshot=2019-03/wiki_db=enwiktionary and same for the other three outputs (two tables x two wikis). Finally, write the _SUCCESS flags that are missing from the root of the table directory on HDFS. But be aware that once you do, jobs depending on these flags may kick off (like the mediawiki-history-load job).
See sqoop errors
Logs are in "/var/log/refinery", grep for ERROR
nuria@an-coord1001:/var/log/refinery$ more sqoop-mediawiki.log | grep ERROR | more 2018-03-02T10:22:27 ERROR ERROR: zhwiki.revision (try 1) 2018-03-02T10:31:20 ERROR ERROR: zhwiki.pagelinks (try 1) 2018-03-02T11:09:17 ERROR ERROR: svwiki.pagelinks (try 1) 2018-03-02T11:30:38 ERROR ERROR: zhwiki.pagelinks (try 2) 2018-03-02T13:17:17 ERROR ERROR: viwiki.pagelinks (try 1)
QA: Assessing quality of a snapshot
Once denormalization has run we need to be able to look that the snapshot created is of quality (i.e. data should match last snapshot, bugs might have been introduced since last snapshot was run).
Automatic validation steps. What to do if things fail
There is a validation step that is automated, the algorithm that runs to check events is described here: See algorithm: Analytics/Systems/Cluster/Mediawiki_History_Snapshot_Check and executed for the top wikis in terms of events.
If automatic validation fails you will get an e-mail with a patch to an error file, a command like the one below would allow you to look at errors:
hdfs -text hdfs://analytics-hadoop/wmf/data/wmf/mediawiki/history_check_errors/snapshot=2018-10/part-00000-3d79f06b-5604-4525-8976-8a5f351313ff-c000.json.snappy > errors.txt
The format of this error file is:
{ "wiki_db": "liwiki", "event_entity": "userHistory", "event_type": "alterblocks", "growths": { "growth_count_user_event": -0.016260162601626018, -> this is a ratio, number of events has decreased 1.6% "growth_distinct_user_id": 0, "growth_distinct_user_text": 0, "growth_count_user_group_bot": 0, "growth_count_user_anonymous": null, "growth_count_user_self_created": 0 }
Things to look for in error file:
- Ratios, we should have more events, not less, we accept a variability of plus minus 1% but variations over this threshold need to be looked at.
- If number of events is smaller perhaps scoop failed in a non obvious way, do check size of snapshots, a healthy sequence of snapshot sizes looks like the following, check for all tables, you might see a difference:
nuria@stat1007:~/mediawiki-history-fail2018-10$ hdfs dfs -du -s -h /wmf/data/raw/mediawiki/tables/page/* Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 11.3 G 33.8 G /wmf/data/raw/mediawiki/tables/page/snapshot=2018-04 11.4 G 34.1 G /wmf/data/raw/mediawiki/tables/page/snapshot=2018-05 11.5 G 34.4 G /wmf/data/raw/mediawiki/tables/page/snapshot=2018-06 11.5 G 34.6 G /wmf/data/raw/mediawiki/tables/page/snapshot=2018-07 11.7 G 35.0 G /wmf/data/raw/mediawiki/tables/page/snapshot=2018-08 11.7 G 35.2 G /wmf/data/raw/mediawiki/tables/page/snapshot=2018-09
Have events disappeared or moved mysteriously?
Check the mediawiki history statistics. Look at the number of errors in the event entity that you're investigate. For example, how many parse errors were there in pageHistory? Compare that figure to last snapshot's. If you see any number of errors that have grown disproportionally, that might be the source of the problem. Once you see something to focus on, run the mediawiki history process in only one relatively small wiki (every a-team member has a favorite small wiki, joal's is simplewiki, fdans's is glwiki), but in the parameters, make sure that errors are not being dropped. Examine the errors and see if there is something definitely going wrong.
Manually compare data with available data sources
Example: Data is available for all wikipedias in pages like : https://en.wikipedia.org/wiki/Special:Statistics
For all wikipedias that page lists for example the number of articles, does the data returned by request below match that number?
https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/content/monthly/2001010100/2018032900
A handy link to transform json data into cvs that can be exported into a spreadsheet for easy computations: [1]
Is there data for all types of editors including anonymous editors?
https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-projects/anonymous/all-page-types/monthly/2016030100/2018042400
Data Loading
Analytics/Systems/Data_Lake/Edits/Pipeline/Data_loading
How is this data gathered: public data from labs
Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log
How is this data gathered: ad-hoc private replicas
Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).
- Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
- Run the same cron that pulls from labs replicas but with the following changes:
- $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
- $db_host = 'analytics-store.eqiad.wmnet'
- $db_user = 'research'
- $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
- $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
- For the command itself:
- --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
- --snapshot <<YYYY-MM>>_private
- --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
- remove --labsdb
IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:
- Put the
_SUCCESS
flag in all/wmf/data/raw/mediawiki/tables/<<table>>/snapshot=<<YYYY-MM>>_private
directories - Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
- IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.