Russian editor information (2022-23)

From Wikitech

Following a request from Russian Wikimedia community members to help better understand the editor community in the country, the Wikimedia Foundation has released one year of data quantifying the number of Wikipedia editors editing from Russia. This is a break from standard practice at the Foundation — data releases typically shown on Wikistats exclude countries on the country protection list, which includes Russia — but we’ve taken extra privacy precautions (specifically, using differential privacy) with this data release to protect editor safety and to quantify the risk of releasing this data.

This page includes brief explanations of the following:

  • data and initial findings
  • differential privacy data release process
  • privacy loss and risk

Data and initial findings

Data

Month Monthly activity level Total editors
1 to 4 edits 5 to 99 edits 100 or more edits
2022-09 24,723 3,879 503 29,105
2022-10 26,354 4,008 499 30,861
2022-11 26,431 3,994 538 30,963
2022-12 26,259 4,012 541 30,812
2023-01 28,826 4,352 555 33,733
2023-02 25,730 3,805 505 30,040
2023-03 26,588 3,938 527 31,053
2023-04 25,601 3,843 499 29,943
2023-05 26,165 4,062 475 30,702
2023-06 23,171 3,772 447 27,390
2023-07 20,579 3,625 464 24,668
2023-08 20,938 3,525 467 24,930

Initial findings

The data release shows a large and relatively stable editor population across all editor levels, with ~25,000-30,000 editors in Russia contributing to Russian Wikipedia each month.

There has been a uniform decrease of 10-20% in editor activity over the past 12 months, especially in June, July, and August 2023. This holds true for beginner (1 to 4 edits / month), intermediate (5 to 99 edits / month), and advanced (100+ edits / month) editors.

Regardless of data fluctuations, Wikipedia editors in Russia are consistently one of the top 10 most active country-specific Wikipedia editor communities.

Differential privacy data release process

Differential privacy is a statistical definition of privacy that allows data owners to release sensitive data more safely. Think of it like adding blur to a picture: it is difficult to see the details, but you can still have a broad view of what the picture looks like. In this case, “blur” is a mathematical framework used to add a controlled amount of random noise to the dataset. This noise impedes attempts to recover information about any single individual in the dataset, while still allowing us to see the bigger picture.

In this case, the source data for this release was pre-aggregated in the geoeditors_monthly table (code for table building). For each month, each editor account is sorted into one activity level bucket based on the number of edits they made — on a monthly basis, no account appears in multiple partitions. We extracted this data for Russia/Russian Wikipedia from geoeditors_monthly and added Laplace noise (epsilon = 0.1, sensitivity = 1, noise scale = 10) to the data to privatize it (code for private data release).

This algorithm has high performance — the difference between the real value and the noisy value in a given row of data does not exceed 10%.

Privacy loss and risk

Privacy loss

We anticipate the risk of a privacy leak from this data release to be relatively low. But differential privacy provides a guarantee about safety in the worst-case scenario, so we’ve sketched out that worst-case privacy loss below.

Because each account is sorted into exactly one activity level bucket in a month, this data release can guarantee account-month level privacy. This means that for each account that someone uses, and for each month they use it, their worst-case privacy loss will increase by 0.1. Because a single editor only appears once in a given month, loss is directly proportional to the number of editor accounts a person uses in that month.

Some examples:

  • If someone edited using one account all months of the data release, they would have a worst-case privacy loss of 0.1 for all months:

  • If someone edited using two accounts for five months of the data release, they would have a worst-case privacy loss of 0.2 for those five months:

  • If someone edited using ten accounts for the first six months of the data release and one account for the second six months of the data release, they would have a worst-case privacy loss of 1.0 for the first six months and 0.1 for the second six months.

Risk

Privacy loss does not have a unit, but it can be understood probabilistically as a worst-case Bayesian update of an observer’s priors due to releasing this data. Say, for example, that an observer is completely uncertain about a given account’s presence or absence in the dataset (50-50 chances). With a privacy loss of 0.1, they could at most be ~2.5% more certain of that account’s presence or absence. With a privacy loss of 1, the certainty could increase by a maximum of ~23.1%. However, these are worst-case bounds; the average case will be that little information is leaked about a given account.

Below is a table of privacy losses and corresponding worst-case Bayesian updates with a prior of 50%:

Privacy loss If an observer is uncertain (50-50 odds) about a person's presence in a given

month, they could at most become _______ more certain after seeing this data.

0.1 ±2.50%
0.2 ±4.98%
0.3 ±7.44%
0.4 ±9.87%
0.5 ±12.25%
0.6 ±14.57%
0.7 ±16.82%
0.8 ±19.00%
0.9 ±21.09%
1.0 ±23.11%
1.1 ±25.03%
1.2 ±26.85%
1.3 ±28.58%
1.4 ±30.22%
1.5 ±31.76%

In this case, the WMF Privacy Engineering and Legal teams have reviewed the potential risks of this release and accepted them.