Analytics/Data quality/Traffic per city entropy
This page describes a study by the Analytics team to find a metric suitable to monitor wiki traffic and alarm whenever a significant traffic drop event event happens in a given country. It could be caused by an outage or an active censhorship event. These events happen relatively frequently and are difficult to identify given the nature of wiki traffic. The metric "raw traffic counts" is too variable (has too much noise, seasonality, natural peaks and drops, etc.) to be used for anomaly detection, because it would generate too many false positives. An ideal metric would show a steady signal when traffic is regular, and show anomalies only whenever traffic is affected by a censorship or outage event.
We used the same approach as with user agent entropy metrics. We used Hive's entropy UDAF on top of several fields of the pageview_hourly data set. The field that gave best results was the city. Applying entropy on top of the city field would tell us how varied was the distribution of traffic among all cities in a given country. This metric seemed to be quite steady while there were no outage or censorship events, but clearly indicated anomalies when those events occurred. Here's an example of query:
ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar CREATE TEMPORARY FUNCTION entropy AS 'org.wikimedia.analytics.refinery.hive.EntropyUDAF' SELECT dt, entropy(counts) AS value FROM ( SELECT CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS dt, city, COUNT(*) AS counts FROM wmf.pageview_hourly WHERE year = 2019 AND (month = 9 AND day >= 1 OR month = 10 AND day <= 20) AND agent_type = 'user' AND country_code = 'IQ' GROUP BY CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')), city ) AS aux GROUP BY dt ORDER BY dt ASC LIMIT 10000
To check whether the metric was suitable for anomaly detection we used the report created by the Traffic team: Monitoring Wikipedia Accessibility Around The World. There, they define a set of outage/censorship events and also a set of false positive alterations of traffic. Based on those, we observed if the metric would show a significant change on censorship/outage events, and if the metric would be steady enough when there where false positive traffic alterations.
The following charts show the performance of the city entropy metric in comparison with the raw traffic metric for a couple examples of censorship/outage event, or false positives.
Iraq Oct 2019
On top we see the raw traffic metric (without self-identified bots), and on bottom we see the discussed traffic per city entropy metric. The censorship event happens on the right half of the charts. It is clear that the event affects both metrics, but the raw traffic is more difficult to alarm upon, because it's naturally more unstable; with natural peaks, drops and different levels before and after the event. On the other hand, the city entropy metric is more steady when not anomalous (has less noise) and shows the same level before and after event, which makes it way easier to detect the anomaly.
China Apr 2019
On top same raw traffic metric, on bottom same city entropy metric. We can also see here that city entropy is more stable, and is also affected by the censorship event on the right side of the charts.
Venezuela Jan 2019
Same metrics here. We can see again how the city entropy metric has much less noise than the raw traffic one. In this case, the censorship event is not perceivable in any of the metrics. Second the Research and Traffic teams this is expected because the event affected a reduced amount of users.
India Oct 2019
This is an example of a false positive. In the first chart, the raw traffic one, we can see a drop on the far right. It corresponds to the Diwali festival in India. It is not a censorship event or an outage, but rather a legit drop. We can see that the city entropy metric does not drop at all, which means it would not generate a false positive. Also, overall the city entropy metric is more stable.
We concluded that:
- The "entropy of city distribution" metric is more steady (has less noise) than the raw traffic metric.
- It seems to change significantly whenever there's a censorship/outage event.
- It seems to be more robust to false positives than the raw traffic metric.
- It remains to be seen whether the "entropy of city distribution" has other qualities that might cause false positives (maybe maxmind upgrades affect the metric).
- It remains to be seen whether the metric is indeed affected by most/all major censorship/outage events.