Analytics/Systems/Cluster/Edit history administration

From Wikitech
Jump to navigation Jump to search

We rebuild the mediawiki edit history from the new db replicas on labs, those databases only hold public data


Administration

See crons

Scoop jobs run from analytics1003.

sudo -u hdfs crontab -u hdfs -l

See errors

Logs are in "/var/log/refinery", grep for ERROR

nuria@analytics1003:/var/log/refinery$ more  sqoop-mediawiki.log | grep  ERROR  | more
2018-03-02T10:22:27 ERROR  ERROR: zhwiki.revision (try 1)
2018-03-02T10:31:20 ERROR  ERROR: zhwiki.pagelinks (try 1)
2018-03-02T11:09:17 ERROR  ERROR: svwiki.pagelinks (try 1)
2018-03-02T11:30:38 ERROR  ERROR: zhwiki.pagelinks (try 2)
2018-03-02T13:17:17 ERROR  ERROR: viwiki.pagelinks (try 1)

QA: Assessing quality of a snapshot

Once denormalization has run we need to be able to look that the snapshot created is of quality (i.e. data should match last snapshot, bugs might have been introduced since last snapshot was run)

Compare data with available data sources

Example: Data is available for all wikipedias in pages like : https://en.wikipedia.org/wiki/Special:Statistics

For all wikipedias that page lists for example the number of articles, does the data returned by request below match that number?

https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/content/monthly/2001010100/2018032900

A handy link to transform json data into cvs that can be exported into a spreadsheet for easy computations: [1]


Is there data for all types of editors including anonymous editors?

https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-projects/anonymous/all-page-types/monthly/2016030100/2018042400

Data Loading

Analytics/Systems/Data_Lake/Edits/Pipeline/Data_loading

How is this data gathered: public data from labs

Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log

How is this data gathered: ad-hoc private replicas

Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).

  1. Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
  2. Run the same cron that pulls from labs replicas but with the following changes:
    • $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
    • $db_host = 'analytics-store.eqiad.wmnet'
    • $db_user = 'research'
    • $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
    • $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
    • For the command itself:
      • --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
      • --snapshot <<YYYY-MM>>_private
      • --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
      • remove --labsdb

IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:

  • Put the _SUCCESS flag in all
    /wmf/data/raw/mediawiki/tables/<<table>>/snapshot=<<YYYY-MM>>_private
    directories
  • Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
  • IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.