Jump to content

Analytics/Archive/Webrequest partitions monitorin

From Wikitech
This page contains historical information. It may be outdated or unreliable.

Monitoring

Icinga monitoring

Since the cluster runs on trusty, while Icinga runs on precise, and trusty's send_nsca and precise's Icinga cannot talk to each other, Icinga's webrequest partition monitoring has been turned off for now.

Bringing that monitoring back would come down to

  1. Finding a way that trusty hosts can use a send_nsca to talk to Icinga. (E.g.: Installing precise's send_nsca on trusty cluster nodes),
  2. Reverting puppet's clean-up Ib56dd1, and
  3. Reverting the corresponding refinery clean-up Ifd05ff.

Manual monitoring

Until ops decided how/whether to bring back Icinga monitoring, we resort to manual monitoring.

Essentially, it comes down to checking if the partitions data directory (e.g.: hdfs:///wmf/data/raw/webrequest/webrequest_text/hourly/2015/01/02/20) contains the done-flag (i.e.: an empty file called _SUCCESS).

If the done-flag exists, the partition has been checked, and contains no missing lines, and no duplicates.

If the done-flag does not exist, either the partition contains missing lines, duplicate lines, or the monitoring jobs did not run/failed (See the section Dealing with faulty partitions below.

Since manual monitoring is tedious, there is a script that helps automating this checking in /home/qchris/cluster-scripts/dump_webrequest_status.sh on stat1002. Running should give you a table with the hours for the last two days (The script optionally takes the number of hours to look back as parameter) and status indicators whether a partition has been automatically marked ok, manually marked ok, or is still faulty.

Note that the “manually marked ok” is a state only stored in the raw data itself. Oozie jobs will start. But the duplicate/missing stats are not reset. Hence, after the raw data gets cleaned up (i.e.: after 30 days), “manually marked ok” partitions appear as “faulty” to the script, as the done-flag has been removed along with the raw data. So be extra-careful when running the script on partitions that are older than 30 days.

Running the script in a cron, and getting the output mailed should give you an impression how the partitions are doing, and where repairing work is necessary.

Until mid-December, all four partitions got cared for. Then QChris was told to not care about partitions we do not use. And hence, only text and mobile got cared for.

The upcoming jobs for legacy_tsvs will again use all available partitions.

Dealing with faulty partitions

Whatever you find, file a Phabricator task for it. This will help later on (like in 2 years) when understanding why todays data had a upwards/downwards bump. The root Phabricator task is task T72085. Please file tasks below that (or its children).

Oozie's automatic loading step for webrequest partitions only performs basic checks if there are missing/duplicate lines and how many of them there are. This data for faulty hosts is available underneath hdfs:///wmf/data/raw/webrequests_faulty_hosts as TSV. The TSVs format can be seen in the corresponding Hive query.

(Sometimes it can be useful to compare with stats of good partitions. The statistics for those partitions is available underneath hdfs:///user/hive/warehouse/wmf_raw.db/webrequest_sequence_stats. So run for example:

 cat /mnt/hdfs/user/hive/warehouse/wmf_raw.db/webrequest_sequence_stats/webrequest_source\=text/year\=2015/month\=1/day\=2/hour\=20/000000_0 | tr '\1' "$(printf '\t')"

).

Those automatic statistics are a bit short on data, and are for example missing timing information. To get timing information, you can use the script /home/qchris/cluster-scripts/hive_select_missing_sequence_runs.sh on stat1002. The script takes weberquest_source, year, month, day, and hour as parameter. So for example

_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 13:04:13 // exit code: 0
cwd: ~/cluster-scripts
/home/qchris/cluster-scripts/hive_select_missing_sequence_runs.sh upload 2015 1 10 20

Hosts
    211 cp3003.esams.wikimedia.org
    177 cp3004.esams.wikimedia.org
     34 cp3015.esams.wmnet

| Host | Start of issue | End of issue |
| --- | --- | --- |
| cp3003.esams.wikimedia.org | 2015-01-10T20:32:53 | 2015-01-10T20:36:03 |
| cp3004.esams.wikimedia.org | 2015-01-10T20:39:24 | 2015-01-10T20:39:58 |
| cp3015.esams.wmnet | 2015-01-10T20:33:08 | 2015-01-10T20:33:13 |

Minimal Start:
2015-01-10T20:32:53

Maximal End:
2015-01-10T20:39:58

Head out missing-sequence_runs-upload-2015-01-10-20.tsv
hostname        missing_start   missing_end     missing_count   dt_before_missing       dt_after_missing
cp3003.esams.wikimedia.org      1001904676      1001904711      36      2015-01-10T20:32:53     2015-01-10T20:32:53
cp3003.esams.wikimedia.org      1001906360      1001906374      15      2015-01-10T20:32:53     2015-01-10T20:32:53
cp3003.esams.wikimedia.org      1001906627      1001906975      349     2015-01-10T20:32:53     2015-01-10T20:32:53
cp3003.esams.wikimedia.org      1001907160      1001907776      617     2015-01-10T20:32:53     2015-01-10T20:32:53
cp3003.esams.wikimedia.org      1001908029      1001908941      913     2015-01-10T20:32:54     2015-01-10T20:32:54
cp3003.esams.wikimedia.org      1001909112      1001909371      260     2015-01-10T20:32:54     2015-01-10T20:32:54
cp3003.esams.wikimedia.org      1001909624      1001909778      155     2015-01-10T20:32:54     2015-01-10T20:32:54
cp3003.esams.wikimedia.org      1001910031      1001910645      615     2015-01-10T20:32:54     2015-01-10T20:32:54
cp3003.esams.wikimedia.org      1001911725      1001911730      6       2015-01-10T20:32:54     2015-01-10T20:32:54

Faulty hosts file: /mnt/hdfs/wmf/data/raw/webrequests_faulty_hosts/upload/2015/1/10/20/000000_0
cp3003.esams.wikimedia.org      994990553       1007486423      12424427        12495871        71444   0       0       -0.5717408574400296     upload  2015    1       10      20
cp3004.esams.wikimedia.org      992636128       1005102555      12416906        12466428        49522   0       0       -0.3972428990886564     upload  2015    1       10      20
cp3015.esams.wmnet      990369537       1002811711      12432250        12442175        9925    0       0       -0.0797690114469536     upload  2015    1       10      20

Total duplicates:
0

Total missing:
130891

upload 2015 01 10 20
pass /home/qchris/cluster-scripts/hive_select_missing_sequence_runs.sh

This output gives a brief overview of how many output lines could get found for each host. This is followed by a table in Phabricator format (Copy/Paste :-) )of when the issues started/ended for each host. Then the minimum start time and maximum end time of issues for this partition. Then the first 10 lines of the detailed stats. This helps to get a first impression. Then the full faulty_hosts file from the automatic Oozie job. This is to put things in perspective. Finally, the total number of duplicates, and missings, and the webrequest_source/time for the partition again.

If the missing_start column is 0 and the issues basically covers the whole hour, most of the time, the varnishes got restarted (caused for example by a configuration update). File such issues under task T74300.

If there is 1 or 2 missing messages at the end of one hour, and 1 or 2 missing messages at the beginning of the subsequent hour, it might be the result of a race condition. See task T71615. The HiveQL file /home/qchris/refinery/two_hour_stats.hql on stat1002.wmf will help you to compute stats for merging two hours and seeing whether or not this race condition triggered the issue.

Sometimes, there are partitions that miss a single message without neigbouring partitions being affected. File those at task T76977.

These days, esams is causing issues a lot. So if a partition is only having issues with esams caches, task T74809 might be a good place to file it under.

If there are duplicates, consider deduplication. This is not productionized. But a rough helper script is available at I8f324b.

If there are only missing lines, it might be that camus took more than two hours to finish the importing of data. This is basically only happens if the cluster is severly overloaded. See task T85704

From time to time, analytics1021 gets kicked out of its partition leader role (See http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics1012.eqiad.wmnet|analytics1018.eqiad.wmnet|analytics1021.eqiad.wmnet|analytics1022.eqiad.wmnet&mreg[]=kafka.network.RequestMetrics.Produce-RequestsPerSec.OneMinuteRate&z=large&gtype=stack&title=kafka.network.RequestMetrics.Produce-RequestsPerSec.OneMinuteRate&aggregate=1&r=hour). If that causes issues for the partitions, file them under task T72087.

In general, ganglia's “Views” tab, and especially the “kafka” and “varnishkafka-webrequest” graphs are helpful to see roughly, how the pipeline is doing, and allows to see what went on when.

Once you understood the impact of the partition being faulty, and you think the jobs should run on the partitions nonetheless, you can create the partition's done-flag by hand. To ease that, you can use the /home/qchris/cluster-scripts/hdfs_mark_webrequest_partition_done.sh script on analytics1027. Run it for example like

 sudo -u hdfs ~/cluster-scripts/hdfs_mark_webrequest_partition_done.sh text 2015 1 5 17

to manually mark the text partition for 2015-01-05T17 as ok.

When creating the done-flag (either manually, or through the script), please also log it in the #wikimedia-analytics channel by saying

 !log Marked raw $WEBREQUEST_SOURCE webrequest partition for $YEAR-$MONTH-${DAY}T$HOUR/1H ok (See {{PhabT|INSERT_TASK_NUMBER_HERE}})

in that channel. If analytics-logbot is not in the channel, also document it in the Analytics/Server Admin Log.

Once you marked a partition as ok, make sure that the relevant Oozie jobs also automatically start (or restart them if needed).

Also make sure that you log issues with datasets that depend on data from HDFS. Like pagecounts-all-sites.