We rebuild the mediawiki edit history from the new db replicas on labs, those databases only hold public data
Data Loading description
How is this data gathered: public data from labs
Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log
How is this data gathered: ad-hoc private replicas
Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).
- Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
- Run the same cron that pulls from labs replicas but with the following changes:
- $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
- $db_host = 'analytics-store.eqiad.wmnet'
- $db_user = 'research'
- $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
- $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
- For the command itself:
- --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
- --snapshot <<YYYY-MM>>_private
- --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
- remove --labsdb
IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:
- Put the
_SUCCESSflag in all
- Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
- IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.