Data Platform/Data Lake/Traffic/Pageview hourly/Sanitization algorithm proposal
In a plan to protect our user's data privacy, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.
See this page for a broader view on the pageview_hourly sanitization project.
Algorithm using COUNT(DISTINCT ip) OR COUNT(DISTINCT page_title) as trigger
Plain English
Three input parameters:
- Ks, the two numbers of distinct IPs and distinct page_titles below which groups get anonymized
- dimensions, the list of fields on which to group and possibly anonymize
- dataset, an hour of rows of the pageview_hourly table with IPs
The first computation step is to generate a statistics table that will be reused to decide which value to anonymize when needed. This statistic table is a Map with key (field name, field value)
and values (count of distinct IPs, count of distinct page_titles)
. For example: ("city", "New York") -> (203954, 3258796), ("os_family", "Android") -> (874645, 1257645), ...
.
Then the anonymization process loops starts:
- group the dataset by the dimensions, and count distinct IPs and page_titles for each group.
- If no group have their count of distinct IPs or page_titles lower than Kip / Kpv, anonymization is finished, return the dataset.
- Else, for every group having their count of distinct IPs/pageviews lower than Kip/Kpv, choose the pair (field, value) having the smaller statistics value (for ip or page_title) and anonymize that field for all the group's rows.
Pseudo-code
function build_statistics(dimensions: Set[String], dataset: List[Rows]) returns Map((String, String), (Long, Long)): var statistics_table = new Map for (field in dimensions): for (value, d_ips, d_pvs) in dataset.groupBy(field).agg(count(distinct ip) as d_ips, count(distinct page_title) as d_pvs): statistics_table[(field, value)] = (d_ips, d_pvs) return statistics_table function getFieldToAnonymize(row: Row, d_idx: Int, dimensions: Set[String], statistics_table: (String, String), Long) returns String: var result_field = "" var min_d = -1 for (field in dimensions): var field_d = statistics_table[(field, row[field])].get(d_idx) if ((min_d == -1) || (field_d < min_d)): result_field = field min_d = field_d return result_field function anonymize(Ks: (Long, Long), dimensions: Set[String], dataset: List[Rows]): statistics_table = build_statistics(dimensions, dataset) do: var grouped_dataset = dataset.groupBy(dimensions).agg(count(distinct ip) as group_d_ips, count(distinct page_title) as group_d_pvs) var anonymized_rows = 0 for (group in grouped_dataset): if (group.group_d_ips < K._1): for (row in group.rows): row[getFieldToAnonymize(row, 0, dimensions, statistics_table)] = "dummy_value" anonymized_rows++; else: if (group.group_d_pvs < K._2): for (row in group.rows): row[getFieldToAnonymize(row, 1, dimensions, statistics_table)] = "dummy_value" anonymized_rows++; while (anonymized_rows > 0) return dataset
Previous Idea - Using SUM(views_count) as trigger
The original proposal was to use SUM(views_count) as a trigger, with a K value making sense for that scale. This approach was (a lot) less precise in the rows to anonymize: rows where hacking value resides are the ones that, for the same finger-printing group, have a very small number of distinct IPs, and in particular groups having a high number of views. Using the SUM(views_count) as a trigger would have prevented us to distinguish between groups with a small number of distinct IPs but quite a lot of requests (high hacking value) and groups with high number of distinct IPs but a small number of requests per IP (low hacking value). Changing the anonymization trigger to use the number of distinct IPs for each finger-printing group ensures that we anonymize dangerous groups instead, whatever it's number view_count.
In addition to ensuring anonymization of groups having a big number of views, this approach also have us not anonymizing rows with a reasonably small number of IPs and with a small number of views, preserving more data.