Jump to content

User:HTriedman (WMF)/Pageviews by Country Dataset (Differential Privacy)

From Wikitech
This is a draft page intended for use in an ongoing academic study. The page is not WMF’s official documentation for this dataset, nor is it reflective of WMF’s views or policy.

This dataset (country_may_dp) describes the number of pageviews by country on Wikipedia articles for each day within May 1-10, 2024. This dataset was created to highlight trends of Wikipedia traffic patterns on a regional or national level that may not be visible at a global level. This dataset contains 4,280,528 records.

Dataset Schema

Variable name Type Description Example value
project string The site views occurred on en.wikipedia
page_title string The Wikipedia pageviews occurred on Taylor_Swift
country string The name of the country where these views occurred France
country_code string The 2-letter ISO code of the country where these views occurred FR
privacy_protected_views integer The number of pageviews reported using differential privacy 1,250
year integer The year in which pageviews were made 2024
month integer The month in which pageviews were made; 1 = January … 12 = December 5
day integer The day in which pageviews were made 9

In this dataset, the privacy_protected_views variable is protected under differential privacy. Differential privacy is an approach to data sharing which involves injecting statistical noise into released statistics—like counts of pageviews—while maintaining overall patterns in the data. This added statistical noise limits anyone’s ability to recover information about specific people using the published dataset.

How was the data pre-processed?

The country_may_dp dataset contains at most 10 unique pageviews from each device per day. This is done to reduce the impact of bots and to protect user privacy.

How was privacy protection applied?

The original counts of pageviews are not included in the dataset for privacy reasons. Instead, the original counts are replaced by privacy-protected counts (i.e., privacy_protected_views). Specifically, each original count is replaced with a randomly drawn value from a Gaussian distribution centered around the original, unprotected count: . Figure 1 depicts this process:

tk figure

How does privacy protection affect the accuracy of the published data?

The 95% margin of error for each reported value in the privacy_protected_views variable is 35.7 pageviews.

What does this mean? The reported privacy-protected count is within 35.7 pageviews of the (unreported) original count, with 95% probability over the random sampling of the privacy-protected count (see Figure 1).

Only rows with a privacy_protected_views count greater than or equal to 90 are reported in the dataset. This is done to reduce the number of spurious records (records with an original count of 0 that are reported to be above 0 in the privacy_protected_views variable).

How can you use the published data?

You can use this dataset to understand country-specific patterns and trends regarding pageviews. The accuracy metric described above (the 95% margin of error for each reported pageview) can help you calibrate your confidence in the results of your analysis.

Let’s walk through an example. Suppose the country_may_dp dataset reports that for a specific project, country, and date (i.e., year, month, day), page A has a value of 203, page B has a value of 203, and page C has a value of 298, where each value corresponds to the page’s respective privacy_protected_views variable.

How can you use this information to understand each page’s original_views variable for that same project, country, and date?

  • You can use the 95% margin of error provided for the privacy_protected_views value to make deductions about page A’s original_views value. For example, you can deduce that with 95% probability, the value of page A’s original_views variable is between 164.3 and 235.7.
  • You cannot say anything for certain (i.e., with 100% confidence) about the value of page A’s original_views variable. For example, you cannot know for certain whether page A’s original_views value is above or below 200.
  • You can use the 95% margin of error provided for the privacy_protected_views value to make some deductions when comparing page A’s original_views value and page B’s original_views value. For example, you can deduce that with at least 95% probability, the value of page B’s original_views variable is less than the value of page C’s original_views variable.
  • You cannot say anything for certain (i.e., with 100% confidence) when comparing the value of page A’s original_views variable with the value of page B’s original_views variable. For example, you cannot know for certain whether the value of page A’s original_views variable is the same as the value of page B’s original_views variable.
  • You cannot make any statements or deductions about pages with privacy_protected_views counts less than 90.