Analytics/Data Lake/Traffic/Pageview hourly/Sanitization algorithm proposal

From Wikitech

In a plan to protect our user's data privacy, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.

See this page for a broader view on the pageview_hourly sanitization project.

Algorithm using COUNT(DISTINCT ip) OR COUNT(DISTINCT page_title) as trigger

Plain English

Three input parameters:

  1. Ks, the two numbers of distinct IPs and distinct page_titles below which groups get anonymized
  2. dimensions, the list of fields on which to group and possibly anonymize
  3. dataset, an hour of rows of the pageview_hourly table with IPs

The first computation step is to generate a statistics table that will be reused to decide which value to anonymize when needed. This statistic table is a Map with key (field name, field value) and values (count of distinct IPs, count of distinct page_titles). For example: ("city", "New York") -> (203954, 3258796), ("os_family", "Android") -> (874645, 1257645), ....

Then the anonymization process loops starts:

  1. group the dataset by the dimensions, and count distinct IPs and page_titles for each group.
  2. If no group have their count of distinct IPs or page_titles lower than Kip / Kpv, anonymization is finished, return the dataset.
  3. Else, for every group having their count of distinct IPs/pageviews lower than Kip/Kpv, choose the pair (field, value) having the smaller statistics value (for ip or page_title) and anonymize that field for all the group's rows.

Pseudo-code

function build_statistics(dimensions: Set[String], dataset: List[Rows]) returns Map((String, String), (Long, Long)):
  var statistics_table = new Map
  for (field in dimensions):
    for (value, d_ips, d_pvs) in dataset.groupBy(field).agg(count(distinct ip) as d_ips, count(distinct page_title) as d_pvs):
      statistics_table[(field, value)] = (d_ips, d_pvs)
  return statistics_table

function getFieldToAnonymize(row: Row, d_idx: Int, dimensions: Set[String], statistics_table: (String, String), Long) returns String:
  var result_field = ""
  var min_d = -1
  for (field in dimensions):
      var field_d = statistics_table[(field, row[field])].get(d_idx)
    if ((min_d == -1) || (field_d < min_d)):
      result_field = field
      min_d = field_d
  return result_field

function anonymize(Ks: (Long, Long), dimensions: Set[String], dataset: List[Rows]):
  statistics_table = build_statistics(dimensions, dataset)
  do:
    var grouped_dataset = dataset.groupBy(dimensions).agg(count(distinct ip) as group_d_ips, count(distinct page_title) as group_d_pvs)
    var anonymized_rows = 0
    for (group in grouped_dataset):
      if (group.group_d_ips < K._1):
        for (row in group.rows):
          row[getFieldToAnonymize(row, 0, dimensions, statistics_table)] = "dummy_value"
          anonymized_rows++;
      else:
        if (group.group_d_pvs < K._2):
          for (row in group.rows):
            row[getFieldToAnonymize(row, 1, dimensions, statistics_table)] = "dummy_value"
            anonymized_rows++;
  while (anonymized_rows > 0)
  return dataset

Previous Idea - Using SUM(views_count) as trigger

The original proposal was to use SUM(views_count) as a trigger, with a K value making sense for that scale. This approach was (a lot) less precise in the rows to anonymize: rows where hacking value resides are the ones that, for the same finger-printing group, have a very small number of distinct IPs, and in particular groups having a high number of views. Using the SUM(views_count) as a trigger would have prevented us to distinguish between groups with a small number of distinct IPs but quite a lot of requests (high hacking value) and groups with high number of distinct IPs but a small number of requests per IP (low hacking value). Changing the anonymization trigger to use the number of distinct IPs for each finger-printing group ensures that we anonymize dangerous groups instead, whatever it's number view_count.

In addition to ensuring anonymization of groups having a big number of views, this approach also have us not anonymizing rows with a reasonably small number of IPs and with a small number of views, preserving more data.