Analytics/Data Lake/Traffic/Pageview hourly/Sanitization

From Wikitech

In an effort toward privacy with regards to reader pageview data, we aim to sanitize the aggregate logs that we store long-term. There are two reasons for sanitizing the dataset. The first is to protect our users from having their browsing pattern reconstructed if somebody hacks our cluster. The second is to publicly release aggregated datasets on interesting dimensions (user agent and geography, to be precise) without risk for our users. The approach chosen to sanitize the dataset is to anonymize (set value to unknown) certain values on rows, when the row is subject to identification. This page summarizes and links to details about the Analytics Team approach, research and results on sanitizing the pageview_hourly dataset. Our analysis shows that the strategy we chose provides a strong level of resistance to attacks while still keeping a lot of value in our dataset.

Problems: Reconstruction of browsing patterns & safe data publication

Browsing patterns reconstruction

As we found in our Identity reconstruction analysis, an attacker with access to our cluster could follow user browsing patterns by combining two datasets: pageview_hourly and webrequest. More precisely, users with a rare combination of values in various fields, especially user-agent and geographical location, are at risk of first being identified in the more raw webrequest dataset, and then followed in pageview_hourly. We only keep data in the webrequest dataset for a short period of time, but we would like to keep pageview_hourly indefinitely, and so we need to make it safe against this type of attack. .

An analysis of the potential decay of fingerprinting data (see this page for more details) shows that data getting old doesn't imply enough change among user information to prevent us from sanitizing.

Safe pageview data publication

Our team core mission is to try to release publicly as much data as we can. The pageview_hourly dataset is no exception, and we'd like to use it to provide more data to our users. However the sensitivity of the data it contains needs us to be very careful about how we publish its content. This means:

  • Separate page_title (or page_id) from other non-global dimensions (particularly geo data and user agent data) - Example of attack: A user modifies a page, therefore making a hit for that page_title in the pageview_hourly dataset[1], and nobody else access that page for the given hour --> If we keep geo data and user agent data associated with page_title, the user geo and user agent becomes easily known.
  • Ensure published aggregated data is not easily reconcilable with page_title level traffic - We have already released pageview data at page_title level of granularity (see pageview API). We want to be sure that newly published geo and user agent data aggregated at project / access / agent_type level will not be linkable to page_title traffic (or to be more precise, could only be linked to anonymized traffic were geo and user_agent data are not present anymore).

Solution: Sanitizing using K-Anonymity over multiple fields

See this page for a detail version of the algorithm we propose.

Very briefly, the idea is to group pageviews into buckets by sensitive fields, such as user agent and location. When these buckets have less than Kip disctinct IPs or less than Kpv distinct pages viewed, we anonymize one of the sensitive fields and repeat so that all possible buckets have more than Kip distinct IPs and more than Kpv distinct pages viewed. Fields with values that are unlikely to show up often are anonymized first.


The good Ks

We (the Analytics-Team) did a manual/qualitative review of browsing patterns over an hour with various distinct IPs, distinct pages and settings. Detailed data on exercise can be found on this dedicated page along with Hive code.

We found that:

  • When looking at groups of pages viewed by multiple people, it is sometimes easy to guess which sub-groups of pages could have been viewed together based on topics.
    • It is however not possible to re-attach sub-groups to the underlying people with certainty.
    • It could be feasible to reattach subgroups to the underlying people with some probability of being right using prior knowledge of browsing habits of those people.
  • When looking at groups of pages having a small number of distinct pages, even with a very small number of distinct pages (2, 3, 4, 5), we have almost never identified those set sets as single sessions, and somewhat regularly we can identify them as two sessions.

It means that:

  • The minimum anonymization we could go for would make sure that at least 2 distinct IPs and 2 distinct pages occur per bucket.
  • We prefer to go on the safer side and add more variability to our buckets, ensuring that at least 3 distinct IPs and 5 distinct pages occur per bucket. It involves us anonymizing 91.28% of buckets making 35.11% of requests.

Choosing hourly or longer term data to establish the "uniqueness" of values in sensitive fields

We want to anonymize the most rare values first, because they are the most identifying. We can establish the "rareness" of each value by looking at either hourly statistics or longer term, such as monthly statistics:

  • Using hourly statistics would establish a Local probability. This should reduce the processing time and the number of steps needed to terminate the algorithm (because locally rare values are anonymized first, leading to faster progress towards buckets of size greater than K).
  • Using monthly statistics would establish a more Global probability. This normalizes any temporal patterns (such as hourly or weekly seasonality) and accounts for differences across time zones. This approach gives more value to global data quality but would run slower.

We decided as a team to use hourly statistics, for two reasons. One is technical computation resource, as statistics computation over a month would be very big, the other is data preservation, since using monthly statistics would mean anonymizing more data.

Information loss analysis

Entropy analysis - Definition

Per dimension

We used Shannon Entropy definition trying to measure how much information was lost in the process of anonymizing the pageview dataset.

For each fingerprinting dimension we computed entropy using the probability of a value (Pval) in the dimension as:

Entropy for the dimension was then computed using the formula:

An interesting point to notice is how to count unknown values in this definition. We have tried 3 methods:

  • unknown as a regular value
  • unknown only in adds to the total sum of view_counts for the dataset, but not as a value having a probability
  • not counting unknown at all

Results have shown that the last method gives better result: more coherent in term of entropy definition, and providing a better view of how much data has been lost.


Finally, we have computed a global value of entropy for the dataset, using the formula:

Entropy analysis - Results

Per dimension


Tests realized over 3 hours at different time of the day (see next section for circadian patterns details):

Hour (UTC) Entropy Default Dataset Entropy Anonymized Dataset Data loss
1 65525642.24 64143222.07 2.11%
8 65024942.53 63503727.41 2.34%
17 87047745.81 85519908.15 1.76%

Circadian Rhythm



  1. pageview_hourly doesn't contain edits, but in order to make an edit, the probability that you'll have a pageview for the given page (either before or after editing) is very high.