Analytics/Systems/Data Lake/Administration/Edit/Pipeline

From Wikitech
Jump to: navigation, search

We rebuild the mediawiki edit history from the new db replicas on labs, those databases only hold public data

Data Loading description

Analytics/Systems/Data_Lake/Edits/Pipeline/Data_loading

How is this data gathered: public data from labs

Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log

How is this data gathered: ad-hoc private replicas

Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).

  1. Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
  2. Run the same cron that pulls from labs replicas but with the following changes:
    • $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
    • $db_host = 'analytics-store.eqiad.wmnet'
    • $db_user = 'research'
    • $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
    • $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
    • For the command itself:
      • --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
      • --snapshot <<YYYY-MM>>_private
      • --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
      • remove --labsdb

IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:

  • Put the _SUCCESS flag in all
    /wmf/data/raw/mediawiki/tables/<<table>>/snapshot=<<YYYY-MM>>_private
    directories
  • Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
  • IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.