Analytics/Systems/Cluster/Edit history administration

From Wikitech
Jump to navigation Jump to search

We rebuild the mediawiki edit history from the new db replicas on labs, those databases only hold public data


Administration

Rerun what's needed

If sqoop fails, it will print out something like:

2019-04-02T14:58:09 ERROR  **************************************************
2019-04-02T14:58:09 ERROR  *  Jobs to re-run:
2019-04-02T14:58:09 ERROR  *    - enwiktionary:revision
2019-04-02T14:58:09 ERROR  *    - svwiki:revision
2019-04-02T14:58:09 ERROR  *    - viwiki:pagelinks
2019-04-02T14:58:09 ERROR  *    - metawiki:pagelinks
2019-04-02T14:58:09 ERROR  **************************************************

To rerun, make a list of the wikis that failed for each table, run the sqoop for just those wikis and just the tables you need, passing a "rerun" snapshot to the sqoop script so that the output goes to a separate folder. Once done, move the output to the right location and manually put the _SUCCESS flags on HDFS. For example, for the wikis above, you would make two files "revision-wikis-failed.csv" and "pagelinks-wikis-failed.csv", and use them as follows:

export PYTHONPATH=\${PYTHONPATH}:/srv/deployment/analytics/refinery/python

/usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \
    --job-name sqoop-mediawiki-monthly-2019-03-rerun-revision \
    --labsdb \
    --output-dir /wmf/data/raw/mediawiki/tables \
    --wiki-file /tmp/revision-wikis-failed.csv \
    --tables revision \
    --user s53272 \
    --password-file /user/hdfs/mysql-analytics-labsdb-client-pw.txt \
    --partition-name snapshot \
    --partition-value 2019-03-rerun \
    --mappers 64 \
    --processors 3 \
    --output-format avrodata \
    --log-file /var/log/refinery/sqoop-mediawiki-rerun-2019-03-revision.log

/usr/bin/python3 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables \
    --job-name sqoop-mediawiki-monthly-2019-03-rerun-pagelinks \
    --labsdb \
    --output-dir /wmf/data/raw/mediawiki/tables \
    --wiki-file /tmp/pagelinks-wikis-failed.csv \
    --tables pagelinks \
    --user s53272 \
    --password-file /user/hdfs/mysql-analytics-labsdb-client-pw.txt \
    --partition-name snapshot \
    --partition-value 2019-03-rerun \
    --mappers 64 \
    --processors 3 \
    --output-format avrodata \
    --log-file /var/log/refinery/sqoop-mediawiki-rerun-2019-03-pagelinks.log

And then you would move /wmf/data/raw/mediawiki/tables/revision/snapshot=2019-03-rerun/wiki_db=enwiktionary to /wmf/data/raw/mediawiki/tables/revision/snapshot=2019-03/wiki_db=enwiktionary and same for the other three outputs (two tables x two wikis). Finally, write the _SUCCESS flags that are missing from the root of the table directory on HDFS. But be aware that once you do, jobs depending on these flags may kick off (like the mediawiki-history-load job).

See sqoop errors

Logs are in "/var/log/refinery", grep for ERROR

nuria@an-coord1001:/var/log/refinery$ more  sqoop-mediawiki.log | grep  ERROR  | more
2018-03-02T10:22:27 ERROR  ERROR: zhwiki.revision (try 1)
2018-03-02T10:31:20 ERROR  ERROR: zhwiki.pagelinks (try 1)
2018-03-02T11:09:17 ERROR  ERROR: svwiki.pagelinks (try 1)
2018-03-02T11:30:38 ERROR  ERROR: zhwiki.pagelinks (try 2)
2018-03-02T13:17:17 ERROR  ERROR: viwiki.pagelinks (try 1)

QA: Assessing quality of a snapshot

Once denormalization has run we need to be able to look that the snapshot created is of quality (i.e. data should match last snapshot, bugs might have been introduced since last snapshot was run).

Automatic validation steps. What to do if things fail

There is a validation step that is automated, the algorithm that runs to check events is described here: See algorithm: Analytics/Systems/Cluster/Mediawiki_History_Snapshot_Check and executed for the top wikis in terms of events.


If automatic validation fails you will get an e-mail with a patch to an error file, a command like the one below would allow you to look at errors:

hdfs -text hdfs://analytics-hadoop/wmf/data/wmf/mediawiki/history_check_errors/snapshot=2018-10/part-00000-3d79f06b-5604-4525-8976-8a5f351313ff-c000.json.snappy > errors.txt

The format of this error file is:

{
 "wiki_db": "liwiki",
 "event_entity": "userHistory",
 "event_type": "alterblocks",
 "growths": {
   "growth_count_user_event": -0.016260162601626018, -> this is a ratio, number of events has decreased 1.6%  
   "growth_distinct_user_id": 0,
   "growth_distinct_user_text": 0,
   "growth_count_user_group_bot": 0,
   "growth_count_user_anonymous": null,
   "growth_count_user_self_created": 0
 }


Things to look for in error file:

  • Ratios, we should have more events, not less, we accept a variability of plus minus 1% but variations over this threshold need to be looked at.
  • If number of events is smaller perhaps scoop failed in a non obvious way, do check size of snapshots, a healthy sequence of snapshot sizes looks like the following, check for all tables, you might see a difference:
nuria@stat1007:~/mediawiki-history-fail2018-10$ hdfs dfs -du -s -h /wmf/data/raw/mediawiki/tables/page/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
11.3 G  33.8 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-04
11.4 G  34.1 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-05
11.5 G  34.4 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-06
11.5 G  34.6 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-07
11.7 G  35.0 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-08
11.7 G  35.2 G  /wmf/data/raw/mediawiki/tables/page/snapshot=2018-09

Manually compare data with available data sources

Example: Data is available for all wikipedias in pages like : https://en.wikipedia.org/wiki/Special:Statistics

For all wikipedias that page lists for example the number of articles, does the data returned by request below match that number?

https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/content/monthly/2001010100/2018032900

A handy link to transform json data into cvs that can be exported into a spreadsheet for easy computations: [1]

Is there data for all types of editors including anonymous editors?

https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-projects/anonymous/all-page-types/monthly/2016030100/2018042400

Data Loading

Analytics/Systems/Data_Lake/Edits/Pipeline/Data_loading

How is this data gathered: public data from labs

Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log

How is this data gathered: ad-hoc private replicas

Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).

  1. Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
  2. Run the same cron that pulls from labs replicas but with the following changes:
    • $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
    • $db_host = 'analytics-store.eqiad.wmnet'
    • $db_user = 'research'
    • $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
    • $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
    • For the command itself:
      • --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
      • --snapshot <<YYYY-MM>>_private
      • --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
      • remove --labsdb

IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:

  • Put the _SUCCESS flag in all
    /wmf/data/raw/mediawiki/tables/<<table>>/snapshot=<<YYYY-MM>>_private
    directories
  • Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
  • IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.