Analytics/Data Lake/Traffic/Pageview hourly/Identity reconstruction analysis

From Wikitech

Summary

Pageview_hourly dataset

Refers to data in our pageview_hourly table on hive. This data is private at this time, once we feel confident about deidentifying scheme we will make it public.

Goal

Identify possible identity reconstruction with only pageview_hourly dataset (long lived, we keep it forever) assuming that you have access to cluster and can use some of webrequest data (short lived, only 60 days). Explore whether is possible to reconstruct identity with pageview_hourly data alone. This exercise is to help us reduce the bits of information in the datasets we release so as to protect the privacy of our users.

Results

Using city, country, user_agent_map and method of access it is possible to reconstruct an identity and trace it through our pageview_hourly dataset. It is important to point out that the user_agent_map is not the raw user agent but rather a processed version. While having access to webrequest table makes this reconstruction easier, we could do it just as well using only data in the pageview_hourly table.

It is easier to narrow an identity if the user_agent_map is a rare one, we have successfully done it for blackberry user agents and user agents for iOS devices that report the version of wikipedia mobile app installed in the device. This might not necessarily pinpoint a user but rather a device plus geographical location. In some instances the identity could be a library computer, in others it will be a user's mobile device. The bits of our dataset that carry most information are the user_agent_map and the geo-location fields, particularly when they both hold rare values for the same row.

Next Steps

To de-identify the dataset further we can:

1. Remove user_agent_map entirely

2. Remove user agents that do not appear more than a certain percentage of times


Explore whether without user_agent_map and using only geographical location we can reconstruct the identity too.

Details

Methodology

Plain English

1. We look in webrequest table for an a hour of data cities in which we have less than 10 different IPs coming for the hour.

2. For cities in which on that hour we only have one IP we calculate a hash with pageview_hourly data using:

  • user_agent_map
  • city
  • country
  • access_method

3. We look in pageview_hourly table for a timeperiod couple months back whether any of those hashes we have computed are present

4. We inspect by hand the browsing history of some of the hashes and try to establish whether it is the same identity behind both sessions

Caveats

Having one IP for a given city within an hour can indicate a very small number of users accessing the site but also - as we have seen - might indicate that the whole town is using an internet service that makes all users appear under the same IP. Also, it is worth mentioning that mobile carriers in many instances have a very small range of IPS for their whole user base.

Even in big towns we can find very small numbers of IPs if a small set of people have access to internet there.

Examples

Hourly Dataset: 1 hour of traffic on September 1st, available in webrequest table.

Monthly Dataset: 1 month of traffic (July) for hashes that are present on hourly dataset.

Example 1, same set of pages visited. iOS App

Data from hourly dataset, note that hash computed includes IOS application version:


Hash Page City Country user_agent_map access_method
1309608194 Pagina_principale SomeTown Italy {browser_major":"-","os_family":"iOS","os_major":"8","device_family":"Other","browser_family":"Mobile Safari","os_minor":"3","wmf_app_version":"some app version"} mobile app
1309608194 Forum_(programma_televisivo) SomeTown Italy {"browser_major":"-","os_family":"iOS","os_major":"8","device_family":"Other","browser_family":"Mobile Safari","os_minor":"3","wmf_app_version":"some app version"} mobile app

Data from monthly dataset, note pages viewed are the same than in hourly dataset (which means that we could follow this hash and possibly find additional pages viewed by this identity)

Hash Page City Country date of access
1309608194 Forum_(programma_televisivo) SomeTown Italy 2015-7-30
1309608194 Pagina_principale SomeTown Italy 2015-7-30

Example 2, similar set of pages. Blackberry device

Data from hourly dataset, note that hash computed includes not very common blackberry device. Pages searched seem academical topics in nature.

Hash Page City Country user_agent_map access_method
989888942 Feeble-minded SomeBigCity Nigeria {"browser_major":"7","os_family":"BlackBerry OS","os_major":"7","device_family":"BlackBerry 9810","browser_family":"BlackBerry WebKit","os_minor":"0","wmf_app_version":"-"} mobile web
989888942 The_School_of_Athens SomeBigCity Nigeria {"browser_major":"7","os_family":"BlackBerry OS","os_major":"7","device_family":"BlackBerry 9810","browser_family":"BlackBerry WebKit","os_minor":"0","wmf_app_version":"-"} mobile web



Data from monthly dataset, more academical topics:

Hash Page City Country date of access
989888942 Friction_(disambiguation) SomeBigCity Nigeria 2015-7-10
989888942 Dating SomeBigCity Nigeria 2015-7-11
989888942 Jerk_(physics) SomeBigCity Nigeria 2015-7-16
989888942 Dating SomeBigCity Nigeria 2015-7-11
989888942 Videos SomeBigCity Nigeria 2015-7-11
989888942 Quotient_rule SomeBigCity Nigeria 2015-7-11
989888942 List_of_Nigerian_films SomeBigCity Nigeria 2015-7-9
989888942 Youtube SomeBigCity Nigeria 2015-7-9

SQL

 -- geocoded data on webrequest is like:
 -- {"city":"Unknown","country_code":"--","longitude":"-1","postal_code":"Unknown","timezone":"Unknown","subdivision":"Unknown","continent":"Unknown","latitude":"-1","country":" Unknown"}

    use wmf;
    -- find cities with less than 10 ips
    -- Keep only IPs with reasonable number of hits

    with cities as (
    select geocoded_data['country'] as country, geocoded_data['city'] as city,
            count(distinct client_ip) as cnt,
            collect_set(client_ip) as unique_ips_for_this_city
       from webrequest
      where webrequest_source in('mobile', 'text') and month=09 and day=01 and hour=01
        and geocoded_data['country'] not in ('Unknown', '--', '-')
        and geocoded_data['city'] not in ('Unknown', '--', '-')
      group by geocoded_data['country'], geocoded_data['city']
     having count(distinct client_ip) < 10
    ),

    -- use cities that only have 1 IP for the hour, compute a signature
    unique_signatures_plus_pageview as (
    select hash(ph.user_agent_map, ph.city, ph.country, ph.access_method) ua_geo_hash, 
              ph.page_title, cities.city, cities.country,
              ph.user_agent_map, ph.access_method
        from cities
        join pageview_hourly ph 
       on
       cities.city = ph.city 
       and
       cities.country = ph.country
       where cnt = 1
       and  (ph.year=2015 and ph.month=09 and ph.day=01 and ph.hour=01)
       and ph.agent_type="user" and ph.user_agent_map['os_family'] not in ("Other", "-")),

    -- get all distinct signatures
    uniques as (
    select distinct ua_geo_hash, city, country,
              access_method, user_agent_map
          from unique_signatures_plus_pageview
    )

    -- go back one month
    select uniques.ua_geo_hash, ph2.page_title,ph2.city, ph2.country, 
             concat_ws("-",cast(year as string), cast(month as string), cast(day as string))
       from uniques
       join pageview_hourly as ph2
       on
       uniques.city = ph2.city
       and
       uniques.country = ph2.country
       and
      uniques.user_agent_map = ph2.user_agent_map
      and
     uniques.access_method = ph2.access_method
     where (ph2.year=2015 and ph2.month=07 )
     and ph2.agent_type="user" and ph2.user_agent_map['os_family'] not in ("Other", "-")


Checks And Balances

We also took at our computed signatures that had 1 IP associated in a 1 hour timestamp (25.000) and tried to establish whether they had a few or a lot of addressed associated in a bigger time period, a day. We found about 6000 of our signatures present the following day and of those 3000 still only had 1 IP associated with the signature.

    -- now find how many IPS there are associated with this signature
    -- in the last couple of days in webrequest
    -- we know is at least 1
    select hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method) as ua_geo_hash_w, count(distinct client_ip) as cnt,
        collect_set(client_ip) as unique_ips_for_this_city
    from uniques as u
    join webrequest as w
    on
        u.ua_geo_hash = hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method)
    where
        webrequest_source in('mobile', 'text') and
        agent_type='user' and
        (month=09  and day=02 and year=2015)
        group by hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method)

Other exploits

Will be easy to geo-locate editors for pages with very few pageviews as an edit is normally preceded by a view