Analytics/Archive/Webrequest partitions monitorin
Monitoring
Icinga monitoring
Since the cluster runs on trusty, while Icinga runs on precise, and trusty's send_nsca
and precise's Icinga cannot talk to each other, Icinga's webrequest partition monitoring has been turned off for now.
Bringing that monitoring back would come down to
- Finding a way that trusty hosts can use a send_nsca to talk to Icinga. (E.g.: Installing precise's send_nsca on trusty cluster nodes),
- Reverting puppet's clean-up Ib56dd1, and
- Reverting the corresponding refinery clean-up Ifd05ff.
Manual monitoring
Until ops decided how/whether to bring back Icinga monitoring, we resort to manual monitoring.
Essentially, it comes down to checking if the partitions data directory (e.g.: hdfs:///wmf/data/raw/webrequest/webrequest_text/hourly/2015/01/02/20
) contains the done-flag (i.e.: an empty file called _SUCCESS
).
If the done-flag exists, the partition has been checked, and contains no missing lines, and no duplicates.
If the done-flag does not exist, either the partition contains missing lines, duplicate lines, or the monitoring jobs did not run/failed (See the section Dealing with faulty partitions below.
Since manual monitoring is tedious, there is a script that helps automating this checking in /home/qchris/cluster-scripts/dump_webrequest_status.sh
on stat1002
. Running should give you a table with the hours for the last two days (The script optionally takes the number of hours to look back as parameter) and status indicators whether a partition has been automatically marked ok, manually marked ok, or is still faulty.
Note that the âmanually marked okâ is a state only stored in the raw data itself. Oozie jobs will start. But the duplicate/missing stats are not reset. Hence, after the raw data gets cleaned up (i.e.: after 30 days), âmanually marked okâ partitions appear as âfaultyâ to the script, as the done-flag has been removed along with the raw data. So be extra-careful when running the script on partitions that are older than 30 days.
Running the script in a cron, and getting the output mailed should give you an impression how the partitions are doing, and where repairing work is necessary.
Until mid-December, all four partitions got cared for. Then QChris was told to not care about partitions we do not use. And hence, only text and mobile got cared for.
The upcoming jobs for legacy_tsvs will again use all available partitions.
Dealing with faulty partitions
Whatever you find, file a Phabricator task for it. This will help later on (like in 2 years) when understanding why todays data had a upwards/downwards bump. The root Phabricator task is task T72085. Please file tasks below that (or its children).
Oozie's automatic loading step for webrequest partitions only performs basic checks if there are missing/duplicate lines and how many of them there are. This data for faulty hosts is available underneath hdfs:///wmf/data/raw/webrequests_faulty_hosts
as TSV. The TSVs format can be seen in the corresponding Hive query.
(Sometimes it can be useful to compare with stats of good partitions. The statistics for those partitions is available underneath hdfs:///user/hive/warehouse/wmf_raw.db/webrequest_sequence_stats
. So run for example:
cat /mnt/hdfs/user/hive/warehouse/wmf_raw.db/webrequest_sequence_stats/webrequest_source\=text/year\=2015/month\=1/day\=2/hour\=20/000000_0 | tr '\1' "$(printf '\t')"
).
Those automatic statistics are a bit short on data, and are for example missing timing information. To get timing information, you can use the script /home/qchris/cluster-scripts/hive_select_missing_sequence_runs.sh
on stat1002
. The script takes weberquest_source, year, month, day, and hour as parameter. So for example
_________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 13:04:13 // exit code: 0 cwd: ~/cluster-scripts /home/qchris/cluster-scripts/hive_select_missing_sequence_runs.sh upload 2015 1 10 20 Hosts 211 cp3003.esams.wikimedia.org 177 cp3004.esams.wikimedia.org 34 cp3015.esams.wmnet | Host | Start of issue | End of issue | | --- | --- | --- | | cp3003.esams.wikimedia.org | 2015-01-10T20:32:53 | 2015-01-10T20:36:03 | | cp3004.esams.wikimedia.org | 2015-01-10T20:39:24 | 2015-01-10T20:39:58 | | cp3015.esams.wmnet | 2015-01-10T20:33:08 | 2015-01-10T20:33:13 | Minimal Start: 2015-01-10T20:32:53 Maximal End: 2015-01-10T20:39:58 Head out missing-sequence_runs-upload-2015-01-10-20.tsv hostname missing_start missing_end missing_count dt_before_missing dt_after_missing cp3003.esams.wikimedia.org 1001904676 1001904711 36 2015-01-10T20:32:53 2015-01-10T20:32:53 cp3003.esams.wikimedia.org 1001906360 1001906374 15 2015-01-10T20:32:53 2015-01-10T20:32:53 cp3003.esams.wikimedia.org 1001906627 1001906975 349 2015-01-10T20:32:53 2015-01-10T20:32:53 cp3003.esams.wikimedia.org 1001907160 1001907776 617 2015-01-10T20:32:53 2015-01-10T20:32:53 cp3003.esams.wikimedia.org 1001908029 1001908941 913 2015-01-10T20:32:54 2015-01-10T20:32:54 cp3003.esams.wikimedia.org 1001909112 1001909371 260 2015-01-10T20:32:54 2015-01-10T20:32:54 cp3003.esams.wikimedia.org 1001909624 1001909778 155 2015-01-10T20:32:54 2015-01-10T20:32:54 cp3003.esams.wikimedia.org 1001910031 1001910645 615 2015-01-10T20:32:54 2015-01-10T20:32:54 cp3003.esams.wikimedia.org 1001911725 1001911730 6 2015-01-10T20:32:54 2015-01-10T20:32:54 Faulty hosts file: /mnt/hdfs/wmf/data/raw/webrequests_faulty_hosts/upload/2015/1/10/20/000000_0 cp3003.esams.wikimedia.org 994990553 1007486423 12424427 12495871 71444 0 0 -0.5717408574400296 upload 2015 1 10 20 cp3004.esams.wikimedia.org 992636128 1005102555 12416906 12466428 49522 0 0 -0.3972428990886564 upload 2015 1 10 20 cp3015.esams.wmnet 990369537 1002811711 12432250 12442175 9925 0 0 -0.0797690114469536 upload 2015 1 10 20 Total duplicates: 0 Total missing: 130891 upload 2015 01 10 20 pass /home/qchris/cluster-scripts/hive_select_missing_sequence_runs.sh
This output gives a brief overview of how many output lines could get found for each host. This is followed by a table in Phabricator format (Copy/Paste :-) )of when the issues started/ended for each host. Then the minimum start time and maximum end time of issues for this partition. Then the first 10 lines of the detailed stats. This helps to get a first impression. Then the full faulty_hosts file from the automatic Oozie job. This is to put things in perspective. Finally, the total number of duplicates, and missings, and the webrequest_source/time for the partition again.
If the missing_start column is 0 and the issues basically covers the whole hour, most of the time, the varnishes got restarted (caused for example by a configuration update). File such issues under task T74300.
If there is 1 or 2 missing messages at the end of one hour, and 1 or 2 missing messages at the beginning of the subsequent hour, it might be the result of a race condition. See task T71615. The HiveQL file /home/qchris/refinery/two_hour_stats.hql
on stat1002.wmf
will help you to compute stats for merging two hours and seeing whether or not this race condition triggered the issue.
Sometimes, there are partitions that miss a single message without neigbouring partitions being affected. File those at task T76977.
These days, esams is causing issues a lot. So if a partition is only having issues with esams caches, task T74809 might be a good place to file it under.
If there are duplicates, consider deduplication. This is not productionized. But a rough helper script is available at I8f324b.
If there are only missing lines, it might be that camus took more than two hours to finish the importing of data. This is basically only happens if the cluster is severly overloaded. See task T85704
From time to time, analytics1021 gets kicked out of its partition leader role (See http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics1012.eqiad.wmnet|analytics1018.eqiad.wmnet|analytics1021.eqiad.wmnet|analytics1022.eqiad.wmnet&mreg[]=kafka.network.RequestMetrics.Produce-RequestsPerSec.OneMinuteRate&z=large>ype=stack&title=kafka.network.RequestMetrics.Produce-RequestsPerSec.OneMinuteRate&aggregate=1&r=hour
). If that causes issues for the partitions, file them under task T72087.
In general, ganglia's âViewsâ tab, and especially the âkafkaâ and âvarnishkafka-webrequestâ graphs are helpful to see roughly, how the pipeline is doing, and allows to see what went on when.
Once you understood the impact of the partition being faulty, and you think the jobs should run on the partitions nonetheless, you can create the partition's done-flag by hand. To ease that, you can use the /home/qchris/cluster-scripts/hdfs_mark_webrequest_partition_done.sh
script on analytics1027
. Run it for example like
sudo -u hdfs ~/cluster-scripts/hdfs_mark_webrequest_partition_done.sh text 2015 1 5 17
to manually mark the text partition for 2015-01-05T17 as ok.
When creating the done-flag (either manually, or through the script), please also log it in the #wikimedia-analytics channel by saying
!log Marked raw $WEBREQUEST_SOURCE webrequest partition for $YEAR-$MONTH-${DAY}T$HOUR/1H ok (See {{PhabT|INSERT_TASK_NUMBER_HERE}})
in that channel. If analytics-logbot is not in the channel, also document it in the Analytics/Server Admin Log.
Once you marked a partition as ok, make sure that the relevant Oozie jobs also automatically start (or restart them if needed).
Also make sure that you log issues with datasets that depend on data from HDFS. Like pagecounts-all-sites.