Analytics/Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time

From Wikitech

In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path. See this page for a broad view on the pageview_hourly sanitization project.

This page go through as fingerprinting analysis over time, using webrequest and pageview_hourly datasets.

Data Preparation

CREATE TABLE hashes1 STORED AS PARQUET AS
  SELECT
    hash(
      pageview_info['project'],
      pageview_info['language_variant'],
      access_method,
      x_analytics_map['zero'],
      agent_type,
      geocoded_data['continent'],
      geocoded_data['country_code'],
      geocoded_data['country'],
      geocoded_data['subdivision'],
      geocoded_data['city'],
      user_agent_map) as h,
    COUNT(DISTINCT ip) as dip,
    COUNT(DISTINCT pageview_info['page_title']) AS dpv,
    COUNT(1) as c FROM wmf.webrequest
  WHERE webrequest_source = 'text'
    AND year = 2016
    AND month = 1
    AND day = 11
    AND agent_type = 'user'
 GROUP BY
    hash(
      pageview_info['project'],
      pageview_info['language_variant'],
      access_method,
      x_analytics_map['zero'],
      agent_type,
      geocoded_data['continent'],
      geocoded_data['country_code'],
      geocoded_data['country'],
      geocoded_data['subdivision'],
      geocoded_data['city'],
      user_agent_map);
WITH pv_hashes1 AS
(
  SELECT
    year, month, day,
    hash(project, language_variant, access_method, zero_carrier, agent_type,
         continent, country_code, country, subdivision, city, user_agent_map) as pvh,
    SUM(view_count) as vc
  FROM wmf.pageview_hourly
  WHERE ((year = 2015 AND month > 5) OR (year = 2016 AND month = 1 AND day < 12))
    AND agent_type = 'user'
  GROUP BY
    year,
    month,
    day,
    hash(project, language_variant, access_method, zero_carrier, agent_type,
         continent, country_code, country, subdivision, city, user_agent_map)
)
SELECT
  year, month, day,
  SUM(CASE WHEN dip = 1 THEN 1 ELSE 0 END) as dip1_n,
  SUM(CASE WHEN dip = 2 THEN 1 ELSE 0 END) as dip2_n,
  SUM(CASE WHEN dip = 3 THEN 1 ELSE 0 END) as dip3_n,
  SUM(CASE WHEN dip = 4 THEN 1 ELSE 0 END) as dip4_n,
  SUM(CASE WHEN dip = 5 THEN 1 ELSE 0 END) as dip5_n,
  SUM(CASE WHEN dip > 5 THEN 1 ELSE 0 END) as dipPlus_n,
  SUM(CASE WHEN dip = 1 THEN pvh1.vc ELSE 0 END) as dip1_vc,
  SUM(CASE WHEN dip = 2 THEN pvh1.vc ELSE 0 END) as dip2_vc,
  SUM(CASE WHEN dip = 3 THEN pvh1.vc ELSE 0 END) as dip3_vc,
  SUM(CASE WHEN dip = 4 THEN pvh1.vc ELSE 0 END) as dip4_vc,
  SUM(CASE WHEN dip = 5 THEN pvh1.vc ELSE 0 END) as dip5_vc,
  SUM(CASE WHEN dip > 5 THEN pvh1.vc ELSE 0 END) as dipPlus_vc
FROM pv_hashes1 pvh1
  INNER JOIN pv_san.hashes1 h1
    ON (pvh1.pvh = h1.h)
GROUP BY
  year, month, day
ORDER BY year, month, day
LIMIT 1000;

Results

As shown below, the decrease of number of hashes match is almost linear. Nonetheless the remaining number of hashes from December still existing in June is very substantial (over a million distinct hashes), particularly with hashes involving only one IP, the most dangerous bucket in term of browsing pattern reconstruction.