Analytics/Systems/EventLogging/Data retention and auto-purging

From Wikitech
Jump to: navigation, search

To comply with WMF's Privacy Policy and Data Retention Guidelines, EventLogging data goes through an automatic purging process. In a nutshell, this process deletes all sensitive information contained in EventLogging events older than 90 days. Purging is necessary because EventLogging data can contain several forms of sensitive data, like PII or sensitive personal information. The "Definitions" section of the Wikimedia Privacy Policy is the authority on these concepts, but the next sections describe them in the context of EventLogging.

Definitions

Identifying information

Identifying information is any field that uniquely identifies a physical person, or can potentially be used to identify a person (or narrow down the pool of possible persons enough) given a certain situation and context. The following subsections explain some types of identifying fields:

Personally identifying information (PII)

PII is any field that could be used to uniquely identify an individual user. Examples include name, email, phone number, credit card number, and government ID number. Note that EventLogging does not store any of these fields under any circumstance.

Potential PII

Other fields that are not strictly PII, can still identify a user in certain circumstances. For example, the field 'editCount' looks OK at first sight, but for very prolific editors, their number of edits may be unique in their context, i.e. 37296 edits. Most likely they are the only one that have this number of edits, and thus the field 'editCount' can be identifying, and should be treated as such. Other potential PII fields are: 'userAgent', 'country/region/city', 'ipAdress' etc.

Persistent tokens

Some schemas have persistent identifiers like usernames, user IDs, appInstallIDs and other persistent tokens. While these identifiers are not strict PII as described in the Privacy Policy, they can still identify a physical person by observing the whole history of their events or by crossing the data with external data sources. Thus, persistent tokens are also treated as sensitive and identifying. Note that non-persistent tokens that are reset periodically, like session tokens or one-time tokens, are considered non-sensitive (provided they are not cross-schema tokens).

Cross-schema tokens

Even when the token is non-persistent, it can still be dangerous if it is cross-schema. A cross-schema token corresponds to the same user in more than one table, thus linking events of different tables together. Two tables that are non-sensitive by themselves, can become a sensitive dataset when linked together with a cross-schema token. Thus, cross-schema tokens are considered sensitive as well.

Browsing information

Any field that contains personal information about topics such as racial or ethnic origins, sexual orientation, marital or familial status, religion, political affiliation, etc... is highly sensitive. Usually, in the context of EventLogging, this corresponds to browsing information: the pages visited by a user, or the pages watched, or the recommendations clicked, and so on. All of them can potentially contain information about the state or personal preference of their users.

Rationale for purging

The privacy threat exists on data sets that contain both identifying information and browsing information. This way, the browsing information can be linked to a specific physical person, and the personal status or preference of that user can be exposed. Data sets that contain browsing information but not identifying information might be non-sensitive, for exaple the Pageview API which has pageview counts per wiki article. Similarly, data sets that contain identifying information but not browsing information can also be non-sensitive, like the Browser Statistics dashboard, which has usage stats broken down by OS and Browser versions (data coming from identifying user agent field). Now, certainly, when both elements are combined in the same schema, the data set becomes highly sensitive.

Schema semantics revealing facts

Some schemas that contain identifying fields (but not browsing information) can still be sensitive if the semantics of the schema reveals aspects of the users. For example, imagine a schema called 'PageviewsByWomen' that stores information about all pageviews performed by women. Even if there's no browsing information in the schema, the sole schema name reveals the gender of its users, and renders the data set sensitive.

Exceptions

Schemas that have only edit-related information (logged-in users) are always considered non-sensitive, since MediaWiki already makes this information publicly available, as specified in the Privacy Policy and Data Retention Guidelines. Note this schemas are non-sensitive even if they contain identifiers like username or userId, and also browsing information like URLs of the pages being edited.

What do the data retention guidelines recommend?

So, given all those definitions and situations, the data retention guidelines recommend the following (please read the "How long do we retain non-public data" section of the Data Retention Guidelines as the true authority on those concepts: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data.3F):

  • Non-sensitive information: Keep it indefinitely.
  • Sensitive information: After at most 90 days, delete, aggregate, or anonymize.

Purging Strategies

There are 3 purging strategies in EventLogging, ranging from more strict to more permissive.

Full purge

It permanantly deletes the whole event records from the database when they reach the age of 90 days. This is suited for schemas that are sensitive (see all types of sensitive data sets above) or for schemas whose information doesn't need to be kept for a longer period of time. Note that this is the default strategy for new schemas and new fields in existing schemas.

Partial purge

It permanantly assigns a NULL value to to subset of the event's fields that are sensitive when the event reaches the age of 90 days. The rest of the fields (non-sensitive) are kept indefinitely. This is suited for schemas that can be easily sanitized and whose information is of great value and needs to be kept for a longer period of time.

Minimal purge

It permanently assigns a NULL value to the EventCapsule's userAgent field when the event reaches the age of 90 days. The EventCapsule is a wrapper schema that is common in all EventLogging schemas. All the other fields in the schema, are kept indefinitely. This is suited for totally non-sensitive schemas.

Implementation

EventLogging data is kept in 2 separate storage systems: MariaDB and Hadoop. Both contain a copy of the same data, with some exceptions explained below.

Hadoop store

The Analytics' Hadoop cluster stores all EventLogging schemas, including those with a very high volume. However it only stores the last 90 days of events for all of them, regardless of the agreed purging strategy. In a daily basis, the partitions that are older than 90 days are deleted by a script. If you want to access EL historical data (that has been kept for longer than 90 days), you'll find it in the MariaDB hosts. Note that there are no Hive external tables created on top of these data sets; if you want to query the data in hive, you'll have to create it (see Work in progress).

Work in progress

In the Analytics team, we are developing a way to automatically create Hive external tables on top of EL data files, so that Hadoop will also be an easy way to access EL data. When we do this, we'll also implement the partial and minimal purging for Hadoop so that historical data can be accessed from there.

MariaDB

The MariaDB auto-purging system is implemented by a script and a white-list, and it will be run in 4 hosts: 2 master hosts and 2 replicas.

Purging on master hosts

To guarantee the performance of the masters, those will only hold events for the last 90 days. The purging script will ensure that all events older than 90 days are deleted on a daily basis regardless of what the white-list says. So, the only hosts keeping the whole history of non-sensitive EL events will be the MariaDB replicas. If you want to query the data, you should use one of the replicas.

Purging on replica hosts

This is where the actual white-list purging happens. Note: The replica hosts, at the moment, are the only stores that keep non-sensitive EL data for more than 90 days.

The purging script

It's a python script that lives in puppet and it's run daily by a cron job. The purging script reads the white-list and one by one, checks the tables and fields that need to be deleted or set to NULL after 90 days, and does so. It deletes or updates MariaDB's records in batches so that the table is not blocked for a long time, and interferes with user queries.

The white-list

The EL purging white-list is a TSV file with 2 columns: schema and field. All schema-field pairs listed there will be kept indefinitely in the MariaDB replicas. The rest of schema-field pairs not listed will be deleted after 90 days. Note that the white-list is schema-centric (not table-centric), meaning it serves for all revisions of a given schema. This way, when a schema is altered, the white-list continues to work.

Black-listed schemas

Warning: This black-list should not be confused with the auto-purging white-list we talk about in the rest of this page. Some EL schemas receive too many events for MariaDB to handle. Those are blacklisted through puppet (using Hiera)] and are not stored in MariaDB. They are still stored in the Hadoop cluster, though. Note that schemas that are black-listed like this, do not currently have the options of partial or minimal purging. They will be fully purged after 90 days. If you plan on adding a schema that will produce more than 10 events per second (in average), please let the Analytics team know; it might be necessary to black-list it.

F.A.Q.

Which schemas and fields are being purged?

The single source of truth regarding the purging strategy of the schemas and the fields that will be kept in each case is the EventLogging purging white-list. For more details see its section above.

Is the information about purging that lives in the schema talk pages correct?

We can *not* ensure that the puring strategy that is mentioned in the schema talk pages is the actual one that is implemented in the white-list. Listing the purging strategy in the talk pages was a decision that came out to be non-practical, and in the end we decided that the white-list would be the place for that.

Will the schema talk pages ever have correct purging info?

There's a task in Analytics' backlog to write a script that automatically updates the talk pages with the changes to the white-list.

What is the default purging strategy for new schemas?

The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside EventLogging databases. This means, if you create a new schema and don't take action to white-list its fields, the events produced to that schema are going to be purged after 90 days.

Should I white-list all fields of my schema every time I modify it?

No. The white-list is schema-centric, meaning it does not observe revisions. All the fields that are in the whitelist for previous revisions of your schema will also apply to the new revision.

If I add a new field to an existing schema, what will happen?

The new field won't be in the white-list, because it's new. So by default, it will be purged after 90 days. Note, that all other white-listed fields will still be kept. If you want to keep the new field, follow the steps to white-list it described below.

If I remove fields from my schema, should I remove them from the white-list?

Normally no. The older fields, will continue to white-list older revisions of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the white-list.

Best privacy practices when creating or modifying schemas

  • If you don't need the data produced by your schema for historical querying, consider sticking to the full-purging strategy (default).
  • Use short/simple/single-purpose schemas as opposed to giant/all-aware/complex schemas. Simple schemas are more likely to be non-sensitive.
  • Do not use persistent tokens or any kind of fingerprint.
  • Try to avoid personal identifying fields, like: username, userId, editCount, appInstallID, etc. Unless the rest of the schema is totally non-sensitive, it's very likely that they need to be purged.
  • Avoid fields that contain text inputed by the users. Those can eventually contain private information of the users which they inputed by mistake; for example, by copy-pasting their credit card number.
  • Do not use cross-schema tokens that can associate events from a schema with events from another schema. This can make 2 non-sensitive schemas into 1 combined sensitive data set.
  • Bucketize potential identifiers. For example, instead of emitting the editCount (integer) of a user, emit a bucketized version ("0 edits"|"1-4 edits"|"5-99 edits"|"100-999 edits"|"1000+ edits"). This way the field becomes non-identifying and can be combined with other data safely.
  • When logging the skin of the current user bear in mind that when combined with fields such as "wiki" or "webhost" it can be possible to identify people. For instance there may be a low number of users who may be using the mobile "Minerva" skin on desktop. Consider bucketing skins into popular skins and "other" field.
  • When logging page title bear in mind that when combined with other fields you are potentially creating a list of a user's reading history. When logging page title, consider not logging additional data such as skin, webhost or user.

How to change the purging strategy of a schema

Assuming that you already:

  1. Created a new schema or modified an existing one.
  2. Instrumented your code to emit events to that new schema/revision and deployed it.

At this point the new events are already flowing into EventLogging. You have 90 days to alter the default purging strategy of the new schema/fields before they start getting dropped. To do so, please choose and do one of the following, either or:

  • Submit a Gerrit patch to the EL purging white-list in puppet where you add the schema and fields you want to keep indefinitely. Please, take the purging rationale into account when selecting them. Then add someone in the Analytics team to review the patch. We'll review and merge it, and that's it.
  • Alternatively, create a Phabricator task named i.e. "Add <SchemaName_123> fields to EL purging white-list" and tag it with the "Analytics" project. In the task description, mention which field you'd like to keep. We Analytics will update the white-list and that's it. This option might take a bit longer, because it might take a couple days until we groom the task from our backlog.