Analytics/AQS/Pageviews/Pageviews by country

From Wikitech
< Analytics‎ | AQS‎ | Pageviews
Jump to navigation Jump to search

This page documents the implications and decisions taken regarding user privacy in the Pageviews by country data set.

Is Pageviews by country privacy sensitive?

The initial prototype version of this dataset was designed like this:

day project country pageviews
2017-01-01 fr.wikipedia ES 1234567
2017-01-02 de.wiktionary BR 4321
2017-01-03 ja.wikivoyage US 13
... ... ... ...

This data set would not be privacy sensitive by itself. But it could be a threat to user's privacy when combined with other public WMF data sets, like the revision table in mediawiki databases, mediawiki history derived data, or Pageviews by project data set in the Pageview API. Thus, the initial prototype version had to be transformed before being released publicly. The following sections explain how the final public data set looks like.

Monthly granularity

A possible attack to user's privacy combining Pageviews by country (PBC) and the revision table (REV) is the following: Imagine PBC states that fi.wiktionary has P pageviews generated from Greece for 2017-01-01, and REV has E edits for the same day and project. Usually an edit generates 1 or 2 pageviews (except for bot edits), so E should always be smaller than P. However, if E is close enough to P, i.e. E = 123 and P = 130, one can assume with high confidence that all registered editors that edited fi.wiktionary on 2017-01-01 are located in Greece.

We carried out a study on the ratio between E and P within a given project and day. The results are the following:

Daily

Note that E getting close enough to P is very unlikely to happen (log scale). However, outliers exist. And because of that we decided to drop daily granularity in favor of monthly. The following chart shows the ratio between E and P for a given project and month:

Monthly

In this case, the E outliers are still far enough from P, and we can consider the data set safe from the privacy attack described above. We consider also that monthly granularity still delivers a lot of value for this data set, especially since its standard deviation is very high (see below).

K-anonymization

Another possible attack to user's privacy is to use statistical algorithms to model the behavior of both the pageviews metric (PBC) and the edits metric (REV), and identify matching spikes in both metrics for the same time-span and project. Those spikes could help associating a group of editors to a set of pageviews, and ultimately to a country location.

We carried out a study on the standard deviation of the PBC metric, to see how likely it was to obtain trusty statistical models. It turns out that standard deviation for the PBC data set is very high, even with monthly granularity:

Simple chart about the ratio between standard deviation and average of pageviews per country with monthly granularity.

In this chart, the x-axis represents how big is in average the number of pageviews for a given tuple [project/country]. So on the left side, there are tuples with small number of pageviews, such as az.wiktionary/Armenia; and on the right side there are tuples with high number of pageviews, such as en.wikipedia/England.The tuples are bucketed, so the x-axis point with 32 pageviews represents a number of tuples (az.wiktionary/Armenia) but also (aragonese wikipedia/Andorra) for example. The y-axis represents the standard deviation (in % over average) of the pageviews for that tuple over time. A std deviation of 1 means that data is fluctuating 100%, thus, it has a lot of variability.

It is clear that the standard deviation for small and medium project/country tuples is very high. This makes statistical modelling attacks on those tuples very unlikely. The deviation for very large tuples is low, however those tuples are so big that editing spikes are lost in the high volume of pageviews. Now, for the tiniest tuples (less than 10 pageviews) the standard deviation is low enough and there is a risk of matching spikes.

To avoid that we decided to use K-anonymization. Namely choose a threshold K and report any PBC value that is below K as "<K". This avoids any possibility of statistical modelling for tuples that are smaller than K. Looking at the chart, K could be as small as 10, but to be safer we chose K=100.

Buckets

The problem of bucketed data is that it doesn't provide much value as a Wikistats metric. With the buckets we were unable to answer basic questions as "how many pageviews there were from the U.S. in English Wikipedia"? So we had to come up with a solution that prevented PII attacks on small wikis/countries but also didn't harm the value of the metric in bigger wikis where there is no such threat. We decided to round all pageview values to the nearest "thousand ceiling" using the following expression in SQL:

CEIL(SUM(view_count) / 1000) * 1000

Final data set

The final Pageviews by country data set to be made public after all the discussed sanitization measures should look like:

month project country pageviews views_ceil
2017-01 fr.wikipedia ES from 10,000 to 100,000 52,000
2017-02 de.wiktionary BR from 100 to 1,000 1,000
... ... ... ...