Analytics/Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time

From Wikitech
Jump to navigation Jump to search

In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path. See this page for a broad view on the pageview_hourly sanitization project.

This page go through as fingerprinting analysis over time, using webrequest and pageview_hourly datasets.

Data Preparation

CREATE TABLE hashes1 STORED AS PARQUET AS
  SELECT
    hash(
      pageview_info['project'],
      pageview_info['language_variant'],
      access_method,
      x_analytics_map['zero'],
      agent_type,
      geocoded_data['continent'],
      geocoded_data['country_code'],
      geocoded_data['country'],
      geocoded_data['subdivision'],
      geocoded_data['city'],
      user_agent_map) as h,
    COUNT(DISTINCT ip) as dip,
    COUNT(DISTINCT pageview_info['page_title']) AS dpv,
    COUNT(1) as c FROM wmf.webrequest
  WHERE webrequest_source = 'text'
    AND year = 2016
    AND month = 1
    AND day = 11
    AND agent_type = 'user'
 GROUP BY
    hash(
      pageview_info['project'],
      pageview_info['language_variant'],
      access_method,
      x_analytics_map['zero'],
      agent_type,
      geocoded_data['continent'],
      geocoded_data['country_code'],
      geocoded_data['country'],
      geocoded_data['subdivision'],
      geocoded_data['city'],
      user_agent_map);
WITH pv_hashes1 AS
(
  SELECT
    year, month, day,
    hash(project, language_variant, access_method, zero_carrier, agent_type,
         continent, country_code, country, subdivision, city, user_agent_map) as pvh,
    SUM(view_count) as vc
  FROM wmf.pageview_hourly
  WHERE ((year = 2015 AND month > 5) OR (year = 2016 AND month = 1 AND day < 12))
    AND agent_type = 'user'
  GROUP BY
    year,
    month,
    day,
    hash(project, language_variant, access_method, zero_carrier, agent_type,
         continent, country_code, country, subdivision, city, user_agent_map)
)
SELECT
  year, month, day,
  SUM(CASE WHEN dip = 1 THEN 1 ELSE 0 END) as dip1_n,
  SUM(CASE WHEN dip = 2 THEN 1 ELSE 0 END) as dip2_n,
  SUM(CASE WHEN dip = 3 THEN 1 ELSE 0 END) as dip3_n,
  SUM(CASE WHEN dip = 4 THEN 1 ELSE 0 END) as dip4_n,
  SUM(CASE WHEN dip = 5 THEN 1 ELSE 0 END) as dip5_n,
  SUM(CASE WHEN dip > 5 THEN 1 ELSE 0 END) as dipPlus_n,
  SUM(CASE WHEN dip = 1 THEN pvh1.vc ELSE 0 END) as dip1_vc,
  SUM(CASE WHEN dip = 2 THEN pvh1.vc ELSE 0 END) as dip2_vc,
  SUM(CASE WHEN dip = 3 THEN pvh1.vc ELSE 0 END) as dip3_vc,
  SUM(CASE WHEN dip = 4 THEN pvh1.vc ELSE 0 END) as dip4_vc,
  SUM(CASE WHEN dip = 5 THEN pvh1.vc ELSE 0 END) as dip5_vc,
  SUM(CASE WHEN dip > 5 THEN pvh1.vc ELSE 0 END) as dipPlus_vc
FROM pv_hashes1 pvh1
  INNER JOIN pv_san.hashes1 h1
    ON (pvh1.pvh = h1.h)
GROUP BY
  year, month, day
ORDER BY year, month, day
LIMIT 1000;

Results

Fingerprinting stats.png

As shown below, the decrease of number of hashes match is almost linear. Nonetheless the remaining number of hashes from December still existing in June is very substantial (over a million distinct hashes), particularly with hashes involving only one IP, the most dangerous bucket in term of browsing pattern reconstruction.