Analytics/Data quality/Traffic per city entropy

From Wikitech
Jump to navigation Jump to search

This page describes a study by the Analytics team to find a metric suitable to monitor wiki traffic and alarm whenever a significant traffic drop event event happens in a given country. It could be caused by an outage or an active censhorship event. These events happen relatively frequently and are difficult to identify given the nature of wiki traffic. The metric "raw traffic counts" is too variable (has too much noise, seasonality, natural peaks and drops, etc.) to be used for anomaly detection, because it would generate too many false positives. An ideal metric would show a steady signal when traffic is regular, and show anomalies only whenever traffic is affected by a censorship or outage event.

Method

We used the same approach as with user agent entropy metrics. We used Hive's entropy UDAF on top of several fields of the pageview_hourly data set. The field that gave best results was the city. Applying entropy on top of the city field would tell us how varied was the distribution of traffic among all cities in a given country. This metric seemed to be quite steady while there were no outage or censorship events, but clearly indicated anomalies when those events occurred. Here's an example of query:

ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive.jar
CREATE TEMPORARY FUNCTION entropy AS 'org.wikimedia.analytics.refinery.hive.EntropyUDAF'
SELECT
    dt,
    entropy(counts) AS value
FROM (
    SELECT
        CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS dt,
        city,
        COUNT(*) AS counts
    FROM wmf.pageview_hourly
    WHERE
        year = 2019 AND
        (month = 9 AND day >= 1 OR month = 10 AND day <= 20) AND
        agent_type = 'user' AND
        country_code = 'IQ'
    GROUP BY
        CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
        city
) AS aux
GROUP BY dt
ORDER BY dt ASC
LIMIT 10000

Criteria

To check whether the metric was suitable for anomaly detection we used the report created by the Traffic team: Monitoring Wikipedia Accessibility Around The World. There, they define a set of outage/censorship events and also a set of false positive alterations of traffic. Based on those, we observed if the metric would show a significant change on censorship/outage events, and if the metric would be steady enough when there where false positive traffic alterations.

Results

The following charts show the performance of the city entropy metric in comparison with the raw traffic metric for a couple examples of censorship/outage event, or false positives.

Iraq Oct 2019

Comparison chart between raw traffic metric and city distribution entropy metric, for event in Iraq on Oct 2019.

On top we see the raw traffic metric (without self-identified bots), and on bottom we see the discussed traffic per city entropy metric. The censorship event happens on the right half of the charts. It is clear that the event affects both metrics, but the raw traffic is more difficult to alarm upon, because it's naturally more unstable; with natural peaks, drops and different levels before and after the event. On the other hand, the city entropy metric is more steady when not anomalous (has less noise) and shows the same level before and after event, which makes it way easier to detect the anomaly.

China Apr 2019

Chart comparison between raw traffic metric and city distribution entropy metric, for event in China on Apr 2019.

On top same raw traffic metric, on bottom same city entropy metric. We can also see here that city entropy is more stable, and is also affected by the censorship event on the right side of the charts.

Venezuela Jan 2019

Comparison chart between raw traffic metric and city distribution entropy metric, for event in Venezuela on Jan 2019.

Same metrics here. We can see again how the city entropy metric has much less noise than the raw traffic one. In this case, the censorship event is not perceivable in any of the metrics. Second the Research and Traffic teams this is expected because the event affected a reduced amount of users.

India Oct 2019

Comparison chart between raw traffic metric and city distribution entropy metric, for event in India on Oct 2019.

This is an example of a false positive. In the first chart, the raw traffic one, we can see a drop on the far right. It corresponds to the Diwali festival in India. It is not a censorship event or an outage, but rather a legit drop. We can see that the city entropy metric does not drop at all, which means it would not generate a false positive. Also, overall the city entropy metric is more stable.

Conclusions

We concluded that:

  1. The "entropy of city distribution" metric is more steady (has less noise) than the raw traffic metric.
  2. It seems to change significantly whenever there's a censorship/outage event.
  3. It seems to be more robust to false positives than the raw traffic metric.
  4. It remains to be seen whether the "entropy of city distribution" has other qualities that might cause false positives (maybe maxmind upgrades affect the metric).
  5. It remains to be seen whether the metric is indeed affected by most/all major censorship/outage events.