Analytics/Data Lake/Edits/Geoeditors/Public

From Wikitech

This is a public version of the Geoeditors Monthly dataset. It reports the number of active editors per country per month for a set number of countries.

Data Format

The data is released monthly as a flat file at our Dumps site. The file will have the following columns:

  • wiki db: the code name for the wiki, "enwiki" for English Wikipedia, at this time the dataset is available just for Wikipedias
  • country: the name of the country with editors of this wiki
  • activity level: how many edits this group of editors has made in the past month (either 5 to 99 or more than 100)
  • lower bound: at least this many editors in this group
  • upper bound: at most this many editors in this group

Note: The final count of editors is an aggregate of both registered and anonymous editors. It may happen that an editor edits as both registered and anonymous in the same month. If so, that editor is going to be counted twice.

Note: Bots are identified as well as possible and filtered out, so this data does not include any bot activity we can identify.

Privacy

Since this data has many privacy concerns this public release applies the following changes to make the data reveal less while providing value for a public audience:

Country Protection List

#REDIRECTfoundation:Legal:Country and Territory Protection List
This page is a soft redirect.


No exact counts

To add a small amount of imprecision to the data, instead of saying, for example, there are 5 editors editing Estonian Wikipedia from Romania, we say there are between 1 and 10. This does not dramatically improve the privacy of the dataset, but it adds a small amount of uncertainty if someone is trying to guess the country of an editor. The amount of uncertainty does not depend on the bucket size but rather in the number of countries for which there are editors for a given project.

Only active wikis

We are only releasing data for wikis with at least 3 active editors on any given month. That's three distinct editors making 5 or more edits in a month. Past research indicates that any less activity than that can't support the healthy collaboration and exchange of ideas essential to wikis.

Risk Assessment

Initial Risk: Medium

Mitigations: Aggregation, Protection List

Residual Risk: Low

The Wikimedia Foundation has developed a process for reviewing datasets prior to release in order to determine a privacy risk level, appropriate mitigations, and a residual risk level. WMF takes privacy very seriously, and seeks to be as transparent as possible while still respecting the privacy of our readers and editors.

Our Privacy Risk Review process first documents the anticipated benefits of releasing a dataset. Because we feel transparency is so crucial to free information, generally WMF takes a release-by-default approach - that is, release unless there is a compelling reason not to. Often, however, there are additional reasons for releasing a particular dataset, such as supporting research. We want to capture those reasons and account for them.

Second, WMF identifies populations that might possibly be impacted by the release of a dataset. We also specifically identify potential impacts to particularly vulnerable populations, such as political dissidents, ethnic minorities, religious minorities, etc.

Next, we catalog potential threat actors, such as organized crime, data aggregators, or other malicious actors that might potentially seek to violate a user’s privacy. We work to identify the potential motivations of these actors and populations they may target.

Finally, we analyze the Opportunity, Ease, and Probability of action by a threat actor against a potential target, along with the Magnitude of privacy harm to arrive at an initial risk score. Once we have identified our initial risks, we develop a mitigation strategy to minimize the risks we can, resulting in a residual (or post-mitigation) risk level.

WMF does not publicly publish this information because we do not want to motivate threat actors, or give them additional ideas for potential abuse of data. Unlike publishing a security vulnerability for code that could be patched, a publicly released dataset cannot be “patched” - it has already been made public.

Any dataset that contains this notice has been reviewed using this process.