Jump to content

Data Platform/Systems/EventLogging/Publishing

From Wikitech
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
See WMF's official data publication guidelines at https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines

WMF's EventLogging database is private, because it may hold sensitive information during a certain time window. To access it, one must be an employee of the Wikimedia Foundation or have signed an NDA. Hence, any reports or data sets based on EventLogging data are potentially harmful and need to be subject of review before they can be published.

Publishing reports

We consider a report: any collection of prose, graphs and statistics that has been drafted by a human for the purpose of communicating some learning. It typically does NOT contain actual records of the database, nor parts of them.

Before publishing any report, you should ensure that it does not contain any potentially sensitive information, as defined bellow. If you are unsure if your report contains private data, please consult with the Research or Analytics teams.

Publishing data sets

We consider a data set: a collection of (whole or partial) records extracted from the database for the purpose of enabling future analyses.

The preferred option is NOT to release any such data sets publicly. If you'd like to open an exception, please contact the Legal team AND also the Community Advocacy team to review your data set, and ensure that it contains no sensitive data. If you have other questions, please ask the Analytics team or the Research team.

Potentially sensitive data

  • PII (Personally identifiable information), like clientIp, userAgent, userName, userId, editCount, and in general, any piece of information that can uniquely identify a physical or virtual person.
  • User-inputed textual fields, like pageTitle, imageTitle, summary, userName, userText, etc. Schemas containing this kind of data are marked as such in the schema talk page.