Data Platform/Data Lake/Traffic/Pageview hourly/Identity reconstruction analysis
Summary
Pageview_hourly dataset
Refers to data in our pageview_hourly table on hive. This data is private at this time, once we feel confident about deidentifying scheme we will make it public.
Goal
Identify possible identity reconstruction with only pageview_hourly dataset (long lived, we keep it forever) assuming that you have access to cluster and can use some of webrequest data (short lived, only 60 days). Explore whether is possible to reconstruct identity with pageview_hourly data alone. This exercise is to help us reduce the bits of information in the datasets we release so as to protect the privacy of our users.
Results
Using city, country, user_agent_map and method of access it is possible to reconstruct an identity and trace it through our pageview_hourly dataset. It is important to point out that the user_agent_map is not the raw user agent but rather a processed version. While having access to webrequest table makes this reconstruction easier, we could do it just as well using only data in the pageview_hourly table.
It is easier to narrow an identity if the user_agent_map is a rare one, we have successfully done it for blackberry user agents and user agents for iOS devices that report the version of wikipedia mobile app installed in the device. This might not necessarily pinpoint a user but rather a device plus geographical location. In some instances the identity could be a library computer, in others it will be a user's mobile device. The bits of our dataset that carry most information are the user_agent_map and the geo-location fields, particularly when they both hold rare values for the same row.
Next Steps
To de-identify the dataset further we can:
1. Remove user_agent_map entirely
2. Remove user agents that do not appear more than a certain percentage of times
Explore whether without user_agent_map and using only geographical location we can reconstruct the identity too.
Details
Methodology
Plain English
1. We look in webrequest table for an a hour of data cities in which we have less than 10 different IPs coming for the hour.
2. For cities in which on that hour we only have one IP we calculate a hash with pageview_hourly data using:
- user_agent_map
- city
- country
- access_method
3. We look in pageview_hourly table for a timeperiod couple months back whether any of those hashes we have computed are present
4. We inspect by hand the browsing history of some of the hashes and try to establish whether it is the same identity behind both sessions
Caveats
Having one IP for a given city within an hour can indicate a very small number of users accessing the site but also - as we have seen - might indicate that the whole town is using an internet service that makes all users appear under the same IP. Also, it is worth mentioning that mobile carriers in many instances have a very small range of IPS for their whole user base.
Even in big towns we can find very small numbers of IPs if a small set of people have access to internet there.
Examples
Hourly Dataset: 1 hour of traffic on September 1st, available in webrequest table.
Monthly Dataset: 1 month of traffic (July) for hashes that are present on hourly dataset.
Example 1, same set of pages visited. iOS App
Data from hourly dataset, note that hash computed includes IOS application version:
Hash | Page | City | Country | user_agent_map | access_method |
---|---|---|---|---|---|
1309608194 | Pagina_principale | SomeTown | Italy | {browser_major":"-","os_family":"iOS","os_major":"8","device_family":"Other","browser_family":"Mobile Safari","os_minor":"3","wmf_app_version":"some app version"} | mobile app |
1309608194 | Forum_(programma_televisivo) | SomeTown | Italy | {"browser_major":"-","os_family":"iOS","os_major":"8","device_family":"Other","browser_family":"Mobile Safari","os_minor":"3","wmf_app_version":"some app version"} | mobile app |
Data from monthly dataset, note pages viewed are the same than in hourly dataset (which means that we could follow this hash and possibly find additional pages viewed by this identity)
Hash | Page | City | Country | date of access |
---|---|---|---|---|
1309608194 | Forum_(programma_televisivo) | SomeTown | Italy | 2015-7-30 |
1309608194 | Pagina_principale | SomeTown | Italy | 2015-7-30 |
Example 2, similar set of pages. Blackberry device
Data from hourly dataset, note that hash computed includes not very common blackberry device. Pages searched seem academical topics in nature.
Hash | Page | City | Country | user_agent_map | access_method |
---|---|---|---|---|---|
989888942 | Feeble-minded | SomeBigCity | Nigeria | {"browser_major":"7","os_family":"BlackBerry OS","os_major":"7","device_family":"BlackBerry 9810","browser_family":"BlackBerry WebKit","os_minor":"0","wmf_app_version":"-"} | mobile web |
989888942 | The_School_of_Athens | SomeBigCity | Nigeria | {"browser_major":"7","os_family":"BlackBerry OS","os_major":"7","device_family":"BlackBerry 9810","browser_family":"BlackBerry WebKit","os_minor":"0","wmf_app_version":"-"} | mobile web |
Data from monthly dataset, more academical topics:
Hash | Page | City | Country | date of access |
---|---|---|---|---|
989888942 | Friction_(disambiguation) | SomeBigCity | Nigeria | 2015-7-10 |
989888942 | Dating | SomeBigCity | Nigeria | 2015-7-11 |
989888942 | Jerk_(physics) | SomeBigCity | Nigeria | 2015-7-16 |
989888942 | Dating | SomeBigCity | Nigeria | 2015-7-11 |
989888942 | Videos | SomeBigCity | Nigeria | 2015-7-11 |
989888942 | Quotient_rule | SomeBigCity | Nigeria | 2015-7-11 |
989888942 | List_of_Nigerian_films | SomeBigCity | Nigeria | 2015-7-9 |
989888942 | Youtube | SomeBigCity | Nigeria | 2015-7-9 |
SQL
-- geocoded data on webrequest is like:
-- {"city":"Unknown","country_code":"--","longitude":"-1","postal_code":"Unknown","timezone":"Unknown","subdivision":"Unknown","continent":"Unknown","latitude":"-1","country":" Unknown"}
use wmf;
-- find cities with less than 10 ips
-- Keep only IPs with reasonable number of hits
with cities as (
select geocoded_data['country'] as country, geocoded_data['city'] as city,
count(distinct client_ip) as cnt,
collect_set(client_ip) as unique_ips_for_this_city
from webrequest
where webrequest_source in('mobile', 'text') and month=09 and day=01 and hour=01
and geocoded_data['country'] not in ('Unknown', '--', '-')
and geocoded_data['city'] not in ('Unknown', '--', '-')
group by geocoded_data['country'], geocoded_data['city']
having count(distinct client_ip) < 10
),
-- use cities that only have 1 IP for the hour, compute a signature
unique_signatures_plus_pageview as (
select hash(ph.user_agent_map, ph.city, ph.country, ph.access_method) ua_geo_hash,
ph.page_title, cities.city, cities.country,
ph.user_agent_map, ph.access_method
from cities
join pageview_hourly ph
on
cities.city = ph.city
and
cities.country = ph.country
where cnt = 1
and (ph.year=2015 and ph.month=09 and ph.day=01 and ph.hour=01)
and ph.agent_type="user" and ph.user_agent_map['os_family'] not in ("Other", "-")),
-- get all distinct signatures
uniques as (
select distinct ua_geo_hash, city, country,
access_method, user_agent_map
from unique_signatures_plus_pageview
)
-- go back one month
select uniques.ua_geo_hash, ph2.page_title,ph2.city, ph2.country,
concat_ws("-",cast(year as string), cast(month as string), cast(day as string))
from uniques
join pageview_hourly as ph2
on
uniques.city = ph2.city
and
uniques.country = ph2.country
and
uniques.user_agent_map = ph2.user_agent_map
and
uniques.access_method = ph2.access_method
where (ph2.year=2015 and ph2.month=07 )
and ph2.agent_type="user" and ph2.user_agent_map['os_family'] not in ("Other", "-")
Checks And Balances
We also took at our computed signatures that had 1 IP associated in a 1 hour timestamp (25.000) and tried to establish whether they had a few or a lot of addressed associated in a bigger time period, a day. We found about 6000 of our signatures present the following day and of those 3000 still only had 1 IP associated with the signature.
-- now find how many IPS there are associated with this signature
-- in the last couple of days in webrequest
-- we know is at least 1
select hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method) as ua_geo_hash_w, count(distinct client_ip) as cnt,
collect_set(client_ip) as unique_ips_for_this_city
from uniques as u
join webrequest as w
on
u.ua_geo_hash = hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method)
where
webrequest_source in('mobile', 'text') and
agent_type='user' and
(month=09 and day=02 and year=2015)
group by hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method)
Other exploits
Will be easy to geo-locate editors for pages with very few pageviews as an edit is normally preceded by a view