Data Engineering/Systems/Event Data retention

From Wikitech

To comply with WMF's Privacy Policy and Data Retention Guidelines, event data goes through an automatic sanitization process. In a nutshell, this process deletes all sensitive information contained in the Hive event database older than 90 days. Sanitization is necessary because event data can contain several forms of sensitive data, like PII or sensitive browsing information. The "Definitions" section of the Wikimedia Privacy Policy is the authority on these concepts, but the next sections describe them in the context of events.

Definitions

Identifying information
Identifying information is any field that uniquely identifies a physical person, or can potentially be used to identify a person (or narrow down the pool of possible persons enough) given a certain situation and context. The following subsections explain some types of identifying fields:
Personally identifying information (PII)
PII is any field that could be used to uniquely identify an individual user. Examples include name, email, phone number, credit card number, and government ID number. Note that EventLogging does not store any of these fields under any circumstance.
Potential PII
Other fields that are not strictly PII, can still identify a user in certain circumstances. For example, the field edit_count looks OK at first sight, but for very prolific editors, their number of edits may be unique in their context, i.e. 37296 edits. Most likely they are the only one that have this number of edits, and thus the field edit_count can be identifying, and should be treated as such. Other potential PII fields are: user-agent, country/region/city, ip address etc.
Reverse identifiers
"Reverse identifiers" are random IDs generated and stored by the user's device as long time cookies or app_install_ids. Given a data set with reverse identifiers, there's no way to trace back to the user that generated that data, because there's no context (internal or external) that associates that random identifier to any PII like address, name, government ID, etc. However, as the identifier is stored in the user's device, if someone has access to the person's device, they will be able to retrieve their identifier and trace back to that person's data. Thus, reverse identifiers are also considered privacy sensitive and should be avoided.
Persistent tokens
Some schemas have persistent identifiers like usernames, user IDs and other persistent tokens. While these identifiers are not PII as described in the Privacy Policy, they can help in identifying a physical person by observing their whole history of events. Thus, persistent tokens are considered sensitive because they add to the privacy risk of the dataset. Note that non-persistent tokens that are reset periodically, like session tokens or one-time tokens, are considered non-sensitive.
Cross-schema tokens
Even when the token is non-persistent, it can still create privacy sensitive structures if it is cross-schema. A cross-schema token corresponds to the same user in more than one table, thus linking events of different tables together. Two tables that are non-sensitive by themselves, can become a sensitive dataset when linked together with a cross-schema token. Thus, cross-schema tokens potentially add to the privacy risk of a data set.
Browsing information
Any field that contains personal information about topics such as racial or ethnic origins, sexual orientation, marital or familial status, religion, political affiliation, etc... is highly sensitive. Usually, in the context of events, this corresponds to browsing information: the pages visited by a user, or the pages watched, or the recommendations clicked, and so on. All of them can potentially contain information about the state or personal preference of their users. This, practically, means that you cannot keep together page IDs and session tokens for over 90 days.

Rationale for purging

The privacy threat exists on data sets that contain both identifying information and browsing information. This way, the browsing information can be linked to a specific physical person, and the personal status or preference of that user can be exposed. Data sets that contain browsing information but not identifying information might be non-sensitive, for example the Pageview API which has pageview counts per wiki article. Similarly, data sets that contain identifying information but not browsing information can also be non-sensitive, like the Browser Statistics dashboard, which has usage stats broken down by OS and Browser versions (data coming from identifying user agent field). Now, certainly, when both elements are combined in the same schema, the data set becomes highly sensitive.

Data semantics revealing facts

Some tables that contain identifying fields (but not browsing information) can still be sensitive if the semantics of the data reveals aspects of the users. For example, imagine a table called 'pageviews_by_women' that stores information about all pageviews performed by women. Even if there's no browsing information in the table, the sole table name reveals the gender of its users, and renders the data set sensitive.

Exceptions

Tables that have only edit-related information (logged-in users) are considered non-sensitive, since MediaWiki already makes this information publicly available, as specified in the Privacy Policy and Data Retention Guidelines. Note this schemas are non-sensitive even if they contain identifiers like username or userId, and also browsing information like URLs of the pages being edited.

What do the data retention guidelines recommend?

So, given all those definitions and situations, the data retention guidelines recommend the following (please read the "How long do we retain non-public data" section of the Data Retention Guidelines as the true authority on those concepts: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data.3F):

  • Non-sensitive information: Keep it indefinitely.
  • Sensitive information: After at most 90 days, delete, aggregate, or anonymize.

Purging Strategies

The purging strategy for the Hive event database tables and the fields that will be kept in each case are in allowlist YAML files. For more details on sanitization and allowlisting see this page.

Best privacy practices when creating or modifying event schemas

This practices assume you want to keep your data indefinitely. If you don't need the data produced by your event stream for historical querying, consider sticking to the full-purging strategy (default).

  • Use short/simple/single-purpose schemas as opposed to giant/all-aware/complex schemas. Simple schemas are more likely to be non-sensitive.
  • Avoid using persistent tokens, reverse identifiers or any kind of fingerprint.
  • Avoid using personal identifying fields, like: username, userId, etc. Unless the rest of the schema does not contain any browsing data or personal context, it's very likely that they need to be purged.
  • Avoid fields that contain text inputed by the users. Those can eventually contain private information of the users which they inputed by mistake; for example, by copy-pasting their credit card number.
  • Avoid using cross-schema tokens that can associate events from one event stream with events from another stream. This can make 2 non-sensitive event streams / tables into 1 combined sensitive data set.
  • Bucketize potential identifiers. For example, instead of emitting the edit_count (integer) of a user, emit a bucketized version ("0 edits"|"1-4 edits"|"5-99 edits"|"100-999 edits"|"1000+ edits"). This way the field becomes non-identifying and can be combined with other data safely.
  • When logging the skin of the current user bear in mind that when combined with fields such as wiki or webhost it can be possible to identify people. For instance there may be a low number of users who may be using the mobile "Minerva" skin on desktop. Consider bucketing skins into popular skins and "other" field.
  • When logging page title bear in mind that when combined with other fields you are potentially creating a list of a user's reading history. When logging page title, consider not logging additional data such as skin, webhost or user.
  • Avoid changing field names.