This is a public version of the Geoeditors Monthly dataset. It reports the number of active editors per country per month for a set number of countries.
The data is released monthly as a flat file at our Dumps site. The file will have the following columns:
- wiki db: the code name for the wiki, "enwiki" for English Wikipedia, at this time the dataset is available just for Wikipedias
- country: the name of the country with editors of this wiki
- activity level: how many edits this group of editors has made in the past month (either 5 to 99 or more than 100)
- lower bound: at least this many editors in this group
- upper bound: at most this many editors in this group
Note: The final count of editors is an aggregate of both registered and anonymous editors. It may happen that an editor edits as both registered and anonymous in the same month. If so, that editor is going to be counted twice.
Note: Bots are identified as well as possible and filtered out, so this data does not include any bot activity we can identify.
Since this data has many privacy concerns this public release applies the following changes to make the data reveal less while providing value for a public audience:
Country Protection List
WMF does not release aggregations of sensitive data in countries identified by independent organizations as potentially dangerous for journalists or internet freedom. Each year we will look at lists published by organizations like Reporters Without Borders and Freedom on the Net and combine the lowest rated countries into the protection list. For 2023, the list is as follows:
- Countries deemed "Not Free" in Freedom on the Net's 2022 report: Azerbaijan, Bahrain, Belarus, China, Cuba, Egypt, Ethiopia, Iran, Kazakhstan, Myanmar, Pakistan, Russia, Rwanda, Saudi Arabia, Sudan, Thailand, Turkey, United Arab Emirates, Uzbekistan, Venezuela, Vietnam
- Countries with the lowest scores according to Reporters Without Borders: Afghanistan, Azerbaijan, Bahrain, Bangladesh, Belarus, China, Cuba, Djibouti, Egypt, Eritrea, Honduras, Iran, Iraq, Kuwait, Laos, Myanmar, Nicaragua, North Korea, Oman, Pakistan, Russia, Saudi Arabia, Syria, Turkmenistan, Venezuela, Vietnam, Yemen
Combining the two lists together yields a list of 35 countries:
- Afghanistan, Azerbaijan, Bahrain, Bangladesh, Belarus, China, Cuba, Djibouti, Egypt, Eritrea, Ethiopia, Honduras, Iran, Iraq, Kazakhstan, Kuwait, Laos, Myanmar, Nicaragua, North Korea, Oman, Pakistan, Russia, Rwanda, Saudi Arabia, Sudan, Syria, Thailand, Turkey, Turkmenistan, United Arab Emirates, Uzbekistan, Venezuela, Vietnam, Yemen
You can find the country protection list in Hive as the
htriedman.non_country_protection_list, as well as in the
is_protected column of
No exact counts
To add a small amount of imprecision to the data, instead of saying, for example, there are 5 editors editing Estonian Wikipedia from Romania, we say there are between 1 and 10. This does not dramatically improve the privacy of the dataset, but it adds a small amount of uncertainty if someone is trying to guess the country of an editor. The amount of uncertainty does not depend on the bucket size but rather in the number of countries for which there are editors for a given project.
Only active wikis
We are only releasing data for wikis with at least 3 active editors on any given month. That's three distinct editors making 5 or more edits in a month. Past research indicates that any less activity than that can't support the healthy collaboration and exchange of ideas essential to wikis.
Initial Risk: Medium
Mitigations: Aggregation, Protection List
Residual Risk: Low
The Wikimedia Foundation has developed a process for reviewing datasets prior to release in order to determine a privacy risk level, appropriate mitigations, and a residual risk level. WMF takes privacy very seriously, and seeks to be as transparent as possible while still respecting the privacy of our readers and editors.
Our Privacy Risk Review process first documents the anticipated benefits of releasing a dataset. Because we feel transparency is so crucial to free information, generally WMF takes a release-by-default approach - that is, release unless there is a compelling reason not to. Often, however, there are additional reasons for releasing a particular dataset, such as supporting research. We want to capture those reasons and account for them.
Second, WMF identifies populations that might possibly be impacted by the release of a dataset. We also specifically identify potential impacts to particularly vulnerable populations, such as political dissidents, ethnic minorities, religious minorities, etc.
Next, we catalog potential threat actors, such as organized crime, data aggregators, or other malicious actors that might potentially seek to violate a user’s privacy. We work to identify the potential motivations of these actors and populations they may target.
Finally, we analyze the Opportunity, Ease, and Probability of action by a threat actor against a potential target, along with the Magnitude of privacy harm to arrive at an initial risk score. Once we have identified our initial risks, we develop a mitigation strategy to minimize the risks we can, resulting in a residual (or post-mitigation) risk level.
WMF does not publicly publish this information because we do not want to motivate threat actors, or give them additional ideas for potential abuse of data. Unlike publishing a security vulnerability for code that could be patched, a publicly released dataset cannot be “patched” - it has already been made public.
Any dataset that contains this notice has been reviewed using this process.