# Analytics/Data Lake/Traffic/Pageview hourly/Sanitization algorithm proposal

In a plan to protect our user's data privacy, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.

# Algorithm using COUNT(DISTINCT ip) OR COUNT(DISTINCT page_title) as trigger

## Plain English

Three input parameters:

1. Ks, the two numbers of distinct IPs and distinct page_titles below which groups get anonymized
2. dimensions, the list of fields on which to group and possibly anonymize
3. dataset, an hour of rows of the pageview_hourly table with IPs

The first computation step is to generate a statistics table that will be reused to decide which value to anonymize when needed. This statistic table is a Map with key `(field name, field value)` and values (`count of distinct IPs, count of distinct page_titles)`. For example: `("city", "New York") -> (203954, 3258796), ("os_family", "Android") -> (874645, 1257645), ...`.

Then the anonymization process loops starts:

1. group the dataset by the dimensions, and count distint IPs and page_titles for each group.
2. If no group have their count of distint IPs or page_titles lower than Kip / Kpv, anonymization is finished, return the dataset.
3. Else, for every group having their count of distint IPs/pageviews than Kip/Kpv, choose the pair (field, value) having the smaller statistics value (for ip or page_title) and anonymize that field for all the group's rows.

## Pseudo-code

```function build_statistics(dimensions: Set[String], dataset: List[Rows]) returns Map((String, String), (Long, Long)):
var statistics_table = new Map
for (field in dimensions):
for (value, d_ips, d_pvs) in dataset.groupBy(field).agg(count(distinct ip) as d_ips, count(distinct page_title) as d_pvs):
statistics_table[(field, value)] = (d_ips, d_pvs)
return statistics_table

function getFieldToAnonymize(row: Row, d_idx: Int, dimensions: Set[String], statistics_table: (String, String), Long) returns String:
var result_field = ""
var min_d = -1
for (field in dimensions):
var field_d = statistics_table[(field, row[field])].get(d_idx)
if ((min_d == -1) || (field_d < min_d)):
result_field = field
min_d = field_d
return result_field

function anonymize(Ks: (Long, Long), dimensions: Set[String], dataset: List[Rows]):
statistics_table = build_statistics(dimensions, dataset)
do:
var grouped_dataset = dataset.groupBy(dimensions).agg(count(distinct ip) as group_d_ips, count(distinct page_title) as group_d_pvs)
var anonymized_rows = 0
for (group in grouped_dataset):
if (group.group_d_ips < K._1):
for (row in group.rows):
row[getFieldToAnonymize(row, 0, dimensions, statistics_table)] = "dummy_value"
anonymized_rows++;
else:
if (group.group_d_pvs < K._2):
for (row in group.rows):
row[getFieldToAnonymize(row, 1, dimensions, statistics_table)] = "dummy_value"
anonymized_rows++;
while (anonymized_rows > 0)
return dataset
```

# Previous Idea - Using SUM(views_count) as trigger

The original proposal was to use SUM(views_count) as a trigger, with a K value making sense for that scale. This approach was (a lot) less precise in the rows to anonymize: rows where hacking value resides are the ones that, for the same finger-printing group, have a very small number of distinct IPs, and in particular groups having a high number of views. Using the SUM(views_count) as a trigger would have prevented us to distinguish between groups with a small number of distinct IPs but quite a lot of requests (high hacking value) and groups with high number of distinct IPs but a small number of requests per IP (low hacking value). Changing the anonymization trigger to use the number of distinct IPs for each finger-printing group ensures that we anonymize dangerous groups instead, whatever it's number view_count.

In addition to ensuring anonymization of groups having a big number of views, this approach also have us not anonymizing rows with a reasonably small number of IPs and with a small number of views, preserving more data.