Analytics/Data Lake/Traffic/Pageview hourly/K Anonymity Threshold Analysis

From Wikitech

In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.

See this page for a broad view on the pageview_hourly sanitization project, and this one for a detailed description of the algorithm proposal.

This page go through the various analysis we have been doing in defining K, the minimum number of distinct IPs a finger-printing group must have not to go through anonymization (see an introduction on k-anonymity).

Dataset specification

  • One hour of pageviews (${YEAR},${MONTH},${DAY},${HOUR})
  • filtered from
    • IPs having more than IP_UA_THRESHOLD different user agents
      (trying to remove mobile "many devices for a single IP" artifact)
    • IPs having more than IP_REQ_THRESHOLD views
      (trying to remove u "many devices for a single IP" artifact)
  • finger-printing Dimensions we took as defining a group are:
    • project (not to be anonymized)
    • language_variant (not to be anonymized)
    • access method (not to be anonymized)
    • agent type, (not to be anonymized)
    • zero carrier,
    • geocoding hierarchical data (continent, country, subdivision, city)
    • user_agent_map
  • Data to analyze
    • Distinct IPs (count and list)
    • Titles of page viewed
    • Number of requests
CREATE TABLE pv_san.idents_buckets STORED AS PARQUET AS
WITH wrong_ips AS  (
    SELECT
        client_ip AS w_ip,
        COUNT(DISTINCT user_agent) AS user_agent_count,
        COUNT(1) AS request_count
    FROM wmf.webrequest
    WHERE webrequest_source IN ('mobile', 'text')
        AND year=${YEAR} AND month=${MONTH} AND day=${DAY} AND hour=${HOUR}
        AND is_pageview
    GROUP BY client_ip
    HAVING (COUNT(DISTINCT user_agent) >= ${IP_UA_THRESHOLD}
            OR COUNT(1) >= ${IP_REQ_THRESHOLD} )
)
SELECT
    pageview_info['project'] as project,
    pageview_info['language_variant'] as language,
    agent_type,
    access_method,
    agent_type,
    x_analytics_map['zero'] AS zero_carrier,
    geocoded_data['continent'] AS continent,
    geocoded_data['country'] AS country,
    geocoded_data['subdivision'] AS subdivision,
    geocoded_data['city'] AS city,
    user_agent_map,
    COUNT(DISTINCT client_ip) AS ips_count,
    COLLECT_SET(client_ip) AS unique_ips,
    COUNT(DISTINCT pageview_info['page_title']) AS pages_count,
    COLLECT_SET(pageview_info['page_title']) AS unique_pages,
    COUNT(1) AS request_count
FROM wmf.webrequest webrequest
  LEFT OUTER JOIN wrong_ips ON (webrequest.client_ip = wrong_ips.w_ip)
WHERE wrong_ips.w_ip IS NULL and
    AND webrequest_source IN ('mobile', 'text')
    AND year=${YEAR} and month=${MONTH} and day=${DAY} and hour=${HOUR}
    AND geocoded_data['continent'] not in ('Unknown', '--', '-')
    AND geocoded_data['country'] not in ('Unknown', '--', '-')
    AND geocoded_data['subdivision'] not in ('Unknown', '--', '-')
    AND geocoded_data['city'] not in ('Unknown', '--', '-')
    AND is_pageview
    AND agent_type = 'user'
GROUP BY
    pageview_info['project'],
    pageview_info['language_variant'],
    access_method,
    agent_type,
    x_analytics_map['zero'],
    geocoded_data['continent'],
    geocoded_data['country'],
    geocoded_data['subdivision'],
    geocoded_data['city'],
    user_agent_map;

General shape of data

bucket_count
The number of unique sets of bucketed values
request_count
The total number of requests
bucket_count request_count
2865382 21135520
SELECT
     COUNT(1) AS bucket_count,
     SUM(request_count) AS request_count
FROM pv_san.idents_buckets
LIMIT 100000;

It looks like we see ~10 requests per unique bucket on average, but the distribution is highly skewed as can be seen below.

Finger-printing groups, distinct ips, distinct pages and requests

We only dig into finger-printing groups that have 10 or less distinct IPs, or 10 or less distinct pages since those are where the anonymization problem exists.

Tables below show show two results:

  • Number of request per IP is stable at 2.4 and number of request per page is stable at 1.2, making number of pages per IP stable at 2.
  • Both for distinct ips and distinct pages, number of buckets and number of requests follow power laws (more or less steep).

Distinct IPs

ips_count bucket_count % %_cumul request_count % %_cumul request_per_ip %_difference_to_average
1 2127846 74.26% 74.26% 4617515 21.85% 21.85% 2.17 4.74%
2 315539 11.01% 85.27% 1416920 6.70% 28.55% 2.25 1.43%
3 125822 4.39% 89.66% 868266 4.11% 32.66% 2.30 -0.98%
4 68535 2.39% 92.06% 623021 2.95% 35.61% 2.27 0.23%
5 42663 1.49% 93.54% 488106 2.31% 37.92% 2.29 -0.45%
6 29511 1.03% 94.57% 402144 1.90% 39.82% 2.27 0.30%
7 21460 0.75% 95.32% 343318 1.62% 41.44% 2.29 -0.33%
8 16422 0.57% 95.90% 299057 1.41% 42.86% 2.28 0.07%
9 12803 0.45% 96.34% 267363 1.26% 44.12% 2.32 -1.86%
10 10465 0.37% 96.71% 245865 1.16% 45.29% 2.35 -3.14%
SELECT
     ips_count,
     COUNT(1) AS bucket_count,
     SUM(request_count) AS request_count
FROM pv_san.idents_buckets
WHERE ips_count <= 10
GROUP BY ips_count
ORDER BY ips_count
LIMIT 10;


Distinct pages

pages_count bucket_count % %_cumul request_count % %_cumul request_per_page %_difference_to_average
1 1551522 54.15% 54.15% 2065809 9.77% 9.77% 1.33 -7.42%
2 463215 16.17% 70.31% 1164015 5.51% 15.28% 1.26 -1.37%
3 223403 7.80% 78.11% 822750 3.89% 19.17% 1.23 0.96%
4 131879 4.60% 82.71% 650671 3.08% 22.25% 1.23 0.49%
5 87189 3.04% 85.75% 533933 2.53% 24.78% 1.22 1.19%
6 61462 2.14% 87.90% 451024 2.13% 26.91% 1.22 1.33%
7 45987 1.60% 89.50% 389742 1.84% 28.76% 1.21 2.32%
8 35842 1.25% 90.76% 353985 1.67% 30.43% 1.23 0.40%
9 28092 0.98% 91.74% 313424 1.48% 31.91% 1.24 -0.01%
10 23060 0.80% 92.54% 279851 1.32% 33.24% 1.21 2.09%
SELECT
     pages_count,
     COUNT(1) AS bucket_count,
     SUM(request_count) AS request_count
FROM pv_san.idents_buckets
WHERE pages_count <= 10
GROUP BY pages_count
ORDER BY pages_count
LIMIT 10;

Example of pages viewed together

By number of distinct IPs

2 distinct IPs

  • English_words_without_vowels, Davido
  • Main_Page, Jay_Chou
  • Flunitrazepam, Photograph, Main_Page
  • Derivative, Leibniz_integral_rule, Fundamental_theorem_of_calculus,
    Sir_John_Mandeville, Crocodile
  • Main_Page, Hamster_wheel
  • List_of_countries_by_GDP_(PPP), Luxembourg, List_of_countries_by_GDP_(nominal),
    List_of_countries_by_GDP_(nominal)_per_capita, List_of_countries_by_GDP_(PPP)_per_capita,
    The_Curious_Case_of_Benjamin_Button_(film), The_Social_Network
  • Substance_over_form, SsangYong_Rexton
  • Barkha_Dutt, Arnab_Goswami, Sagarika_Ghose, Rajdeep_Sardesai, Rosacea
  • Bigg_Boss_9, Debtor, Sovereign_debt_crisis, Great_Recession, Sovereign_default,
    European_sovereign-debt_crisis, Credit_worthiness, Greek_government-debt_crisis,
    Secondary_market, Loan_covenant, Credit_rating, Default_(finance), Main_Page,
    Tom_Enders, Airbus_Group, Financial_crisis_of_2007–08, Economic_bubble,
    Credit_rating_agency, * Intrinsic_value_(finance), Bailout
  • Trilateration, Ilya_Ilyin
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE ips_count = 2
    AND project = 'en.wikipedia'
LIMIT 10;

3 distinct IPs

  • Sharman_Joshi, Amendment_of_the_Constitution_of_India,
    List_of_amendments_of_the_Constitution_of_India,
    Ninety-third_Amendment_of_the_Constitution_of_India, Abu_Bakr_al-Baghdadi,
    Abu_Ahmed_al-Kuwaiti, Alfreda_Frances_Bikowsky, Ayman_al-Zawahiri, Main_Page,
    Kelly_Hu, Battlefield_Hardline
  • Pahlaj_Nihalani, Main_Page, Okariki_Okaru, Johnson_&_Johnson
  • GDP, List_of_countries_by_GDP_(PPP), Gross_domestic_product, GDP_per_capita,
    List_of_countries_by_GDP_(nominal)_per_capita, List_of_countries_by_GDP_(nominal),
    Purchasing_power_parity, Nauru, Turkey, NATO, Main_Page, BATE_Borisov, FC_BATE_Borisov
  • Main_Page, The_Revenant_(2015_film), Brandade, Bruschetta
  • Jon_Hamm, Minions_(film), Romulan, Vitruvian_Man
  • The_Late_Show_with_Stephen_Colbert, Smart_casual, Alice_in_Chains, Main_Page
  • Howard_Schultz, Starbucks, Influenza, Rhinorrhea, 1918_flu_pandemic, Peet's_Coffee_&_Tea
  • Triple_H, Triple_X_(professional_wrestling), Stephanie_McMahon, Nauru, Tuvalu,
    Funafuti, Palau, President_of_the_United_States, Rolex
  • Sargassum_fish, Kim_Richards, Kyle_Richards, Jamiroquai, Jay_Kay
  • Jamie_Vardy, Mad_Season_(band), Greg_Graffin, General_Dynamics_F-16_Fighting_Falcon_variants,
    Sukhoi_Su-24, Academy_Award_for_Best_Picture, The_Departed, Eurofighter_Typhoon
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE ips_count = 3
LIMIT 100;

By number of distinct pages (with more than only 1 IP)

Showing 20 random samples having X distinct pages and more than one IP, we can see that even when only 2 distinct pages are seen, they are from different topics, preventing having a (single) section reconstructed.

2 distinct pages

SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 2
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;

3 distinct pages

SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 3
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;

4 distinct pages

SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 4
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;

5 distinct pages - Detailed analysis

Last three seems session related, but the other two not.
The last four seems electronic related, the first is not.
The last three are related (Sister sister TV show), the first two not.
The last two are related, the other three seem not.
The middle three are related, the other two are not.
Number 3 and 4 are related, but other three seem not.
Number 2 and three relates, other three seem not.
The four first are related, the last one is not.
The last two relate (same movie name), but the other three seem not.
No strong link between the five pages.
First two are related, other three are unrelated.
No strong link between the five pages.
First three are related, other two are not.
Last four are related, first is not.
Last four are related, first is not.
Last two are related, other three are not.
First two and last two form groups , third alone.
Last four relate, first doesn't.
Unrelated pages.
Last three are related, first two are not.
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 5
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;