User:HTriedman (WMF)/Pageviews by Country Dataset (Differential Privacy)
This dataset (country_may_dp) describes the number of pageviews by country on Wikipedia articles for each day within May 1-10, 2024. This dataset was created to highlight trends of Wikipedia traffic patterns on a regional or national level that may not be visible at a global level. This dataset contains 4,280,528 records.
Dataset Schema
| Variable name | Type | Description | Example value |
|---|---|---|---|
project |
string | The site views occurred on | en.wikipedia |
page_title |
string | The Wikipedia pageviews occurred on | Taylor_Swift |
country |
string | The name of the country where these views occurred | France |
country_code |
string | The 2-letter ISO code of the country where these views occurred | FR |
privacy_protected_views |
integer | The number of pageviews reported using differential privacy | 1,250 |
year |
integer | The year in which pageviews were made | 2024 |
month |
integer | The month in which pageviews were made; 1 = January … 12 = December | 5 |
day |
integer | The day in which pageviews were made | 9 |
In this dataset, the privacy_protected_views variable is protected under differential privacy. Differential privacy is an approach to data sharing which involves injecting statistical noise into released statistics—like counts of pageviews—while maintaining overall patterns in the data. This added statistical noise limits anyone’s ability to recover information about specific people using the published dataset.
How was the data pre-processed?
The country_may_dp dataset contains at most 10 unique pageviews from each device per day. This is done to reduce the impact of bots and to protect user privacy.
How was privacy protection applied?
The original counts of pageviews are not included in the dataset for privacy reasons. Instead, the original counts are replaced by privacy-protected counts (i.e., privacy_protected_views). Specifically, each original count is replaced with a randomly drawn value from a Gaussian distribution centered around the original, unprotected count: . Figure 1 depicts this process:
tk figure
| Advanced information about the implementation of differential privacy |
|---|
|
|
How does privacy protection affect the accuracy of the published data?
The 95% margin of error for each reported value in the privacy_protected_views variable is 35.7 pageviews.
What does this mean? The reported privacy-protected count is within 35.7 pageviews of the (unreported) original count, with 95% probability over the random sampling of the privacy-protected count (see Figure 1).
Only rows with a privacy_protected_views count greater than or equal to 90 are reported in the dataset. This is done to reduce the number of spurious records (records with an original count of 0 that are reported to be above 0 in the privacy_protected_views variable).
How can you use the published data?
You can use this dataset to understand country-specific patterns and trends regarding pageviews. The accuracy metric described above (the 95% margin of error for each reported pageview) can help you calibrate your confidence in the results of your analysis.
Let’s walk through an example. Suppose the country_may_dp dataset reports that for a specific project, country, and date (i.e., year, month, day), page A has a value of 203, page B has a value of 203, and page C has a value of 298, where each value corresponds to the page’s respective privacy_protected_views variable.
How can you use this information to understand each page’s original_views variable for that same project, country, and date?
- You can use the 95% margin of error provided for the
privacy_protected_viewsvalue to make deductions about page A’soriginal_viewsvalue. For example, you can deduce that with 95% probability, the value of page A’soriginal_viewsvariable is between 164.3 and 235.7. - You cannot say anything for certain (i.e., with 100% confidence) about the value of page A’s
original_viewsvariable. For example, you cannot know for certain whether page A’soriginal_viewsvalue is above or below 200.
- You can use the 95% margin of error provided for the
privacy_protected_viewsvalue to make some deductions when comparing page A’soriginal_viewsvalue and page B’soriginal_viewsvalue. For example, you can deduce that with at least 95% probability, the value of page B’soriginal_viewsvariable is less than the value of page C’soriginal_viewsvariable. - You cannot say anything for certain (i.e., with 100% confidence) when comparing the value of page A’s
original_viewsvariable with the value of page B’soriginal_viewsvariable. For example, you cannot know for certain whether the value of page A’soriginal_viewsvariable is the same as the value of page B’soriginal_viewsvariable.
- You cannot make any statements or deductions about pages with
privacy_protected_viewscounts less than 90.