Data Platform/Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time
Appearance
In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path. See this page for a broad view on the pageview_hourly sanitization project.
This page go through as fingerprinting analysis over time, using webrequest and pageview_hourly datasets.
Data Preparation
CREATE TABLE hashes1 STORED AS PARQUET AS SELECT hash( pageview_info['project'], pageview_info['language_variant'], access_method, x_analytics_map['zero'], agent_type, geocoded_data['continent'], geocoded_data['country_code'], geocoded_data['country'], geocoded_data['subdivision'], geocoded_data['city'], user_agent_map) as h, COUNT(DISTINCT ip) as dip, COUNT(DISTINCT pageview_info['page_title']) AS dpv, COUNT(1) as c FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = 2016 AND month = 1 AND day = 11 AND agent_type = 'user' GROUP BY hash( pageview_info['project'], pageview_info['language_variant'], access_method, x_analytics_map['zero'], agent_type, geocoded_data['continent'], geocoded_data['country_code'], geocoded_data['country'], geocoded_data['subdivision'], geocoded_data['city'], user_agent_map);
WITH pv_hashes1 AS ( SELECT year, month, day, hash(project, language_variant, access_method, zero_carrier, agent_type, continent, country_code, country, subdivision, city, user_agent_map) as pvh, SUM(view_count) as vc FROM wmf.pageview_hourly WHERE ((year = 2015 AND month > 5) OR (year = 2016 AND month = 1 AND day < 12)) AND agent_type = 'user' GROUP BY year, month, day, hash(project, language_variant, access_method, zero_carrier, agent_type, continent, country_code, country, subdivision, city, user_agent_map) ) SELECT year, month, day, SUM(CASE WHEN dip = 1 THEN 1 ELSE 0 END) as dip1_n, SUM(CASE WHEN dip = 2 THEN 1 ELSE 0 END) as dip2_n, SUM(CASE WHEN dip = 3 THEN 1 ELSE 0 END) as dip3_n, SUM(CASE WHEN dip = 4 THEN 1 ELSE 0 END) as dip4_n, SUM(CASE WHEN dip = 5 THEN 1 ELSE 0 END) as dip5_n, SUM(CASE WHEN dip > 5 THEN 1 ELSE 0 END) as dipPlus_n, SUM(CASE WHEN dip = 1 THEN pvh1.vc ELSE 0 END) as dip1_vc, SUM(CASE WHEN dip = 2 THEN pvh1.vc ELSE 0 END) as dip2_vc, SUM(CASE WHEN dip = 3 THEN pvh1.vc ELSE 0 END) as dip3_vc, SUM(CASE WHEN dip = 4 THEN pvh1.vc ELSE 0 END) as dip4_vc, SUM(CASE WHEN dip = 5 THEN pvh1.vc ELSE 0 END) as dip5_vc, SUM(CASE WHEN dip > 5 THEN pvh1.vc ELSE 0 END) as dipPlus_vc FROM pv_hashes1 pvh1 INNER JOIN pv_san.hashes1 h1 ON (pvh1.pvh = h1.h) GROUP BY year, month, day ORDER BY year, month, day LIMIT 1000;
Results
As shown below, the decrease of number of hashes match is almost linear. Nonetheless the remaining number of hashes from December still existing in June is very substantial (over a million distinct hashes), particularly with hashes involving only one IP, the most dangerous bucket in term of browsing pattern reconstruction.