Analytics/Systems/EventLogging/Data retention and auto-purging
- 1 Definitions
- 2 Rationale for purging
- 3 Purging Strategies
- 4 Implementation
- 5 F.A.Q.
- 5.1 Which schemas and fields are being purged?
- 5.2 Is the information about purging that lives in the schema talk pages correct?
- 5.3 Will the schema talk pages ever have correct purging info?
- 5.4 What is the default purging strategy for new schemas?
- 5.5 Should I white-list all fields of my schema every time I modify it?
- 5.6 If I add a new field to an existing schema, what will happen?
- 5.7 If I remove fields from my schema, should I remove them from the white-list?
- 6 Best privacy practices when creating or modifying schemas
- 7 How to change the purging strategy of a schema
Identifying information is any field that uniquely identifies a physical person, or can potentially be used to identify a person (or narrow down the pool of possible persons enough) given a certain situation and context. The following subsections explain some types of identifying fields:
Personally identifying information (PII)
PII is any field that could be used to uniquely identify an individual user. Examples include name, email, phone number, credit card number, and government ID number. Note that EventLogging does not store any of these fields under any circumstance.
Other fields that are not strictly PII, can still identify a user in certain circumstances. For example, the field 'editCount' looks OK at first sight, but for very prolific editors, their number of edits may be unique in their context, i.e. 37296 edits. Most likely they are the only one that have this number of edits, and thus the field 'editCount' can be identifying, and should be treated as such. Other potential PII fields are: 'userAgent', 'country/region/city', 'ipAdress' etc.
"Reverse identifiers" are random ids generated and stored by the user's device as long time cookies or appInstallIds. Given a data set with reverse identifiers, there's no way to trace back to the user that generated that data, because there's no context (internal or external) that associates that random identifier to any PII like address, name, government ID, etc. However, as the identifier is stored in the user's device, if someone has access to the person's device, they will be able to retrieve their identifier and trace back to that person's data. Thus, reverse identifiers are also considered privacy sensitive and should be avoided.
Even when the token is non-persistent, it can still create privacy sensitive structures if it is cross-schema. A cross-schema token corresponds to the same user in more than one table, thus linking events of different tables together. Two tables that are non-sensitive by themselves, can become a sensitive dataset when linked together with a cross-schema token. Thus, cross-schema tokens potentially add to the privacy risk of a data set.
Any field that contains personal information about topics such as racial or ethnic origins, sexual orientation, marital or familial status, religion, political affiliation, etc... is highly sensitive. Usually, in the context of EventLogging, this corresponds to browsing information: the pages visited by a user, or the pages watched, or the recommendations clicked, and so on. All of them can potentially contain information about the state or personal preference of their users.
Rationale for purging
The privacy threat exists on data sets that contain both identifying information and browsing information. This way, the browsing information can be linked to a specific physical person, and the personal status or preference of that user can be exposed. Data sets that contain browsing information but not identifying information might be non-sensitive, for exaple the Pageview API which has pageview counts per wiki article. Similarly, data sets that contain identifying information but not browsing information can also be non-sensitive, like the Browser Statistics dashboard, which has usage stats broken down by OS and Browser versions (data coming from identifying user agent field). Now, certainly, when both elements are combined in the same schema, the data set becomes highly sensitive.
Schema semantics revealing facts
Some schemas that contain identifying fields (but not browsing information) can still be sensitive if the semantics of the schema reveals aspects of the users. For example, imagine a schema called 'PageviewsByWomen' that stores information about all pageviews performed by women. Even if there's no browsing information in the schema, the sole schema name reveals the gender of its users, and renders the data set sensitive.
What do the data retention guidelines recommend?
So, given all those definitions and situations, the data retention guidelines recommend the following (please read the "How long do we retain non-public data" section of the Data Retention Guidelines as the true authority on those concepts: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data.3F):
- Non-sensitive information: Keep it indefinitely.
- Sensitive information: After at most 90 days, delete, aggregate, or anonymize.
There are 3 purging strategies in EventLogging, ranging from more strict to more permissive.
It permanantly deletes the whole event records from the database when they reach the age of 90 days. This is suited for schemas that are sensitive (see all types of sensitive data sets above) or for schemas whose information doesn't need to be kept for a longer period of time. Note that this is the default strategy for new schemas and new fields in existing schemas.
It permanantly assigns a NULL value to to subset of the event's fields that are sensitive when the event reaches the age of 90 days. The rest of the fields (non-sensitive) are kept indefinitely. This is suited for schemas that can be easily sanitized and whose information is of great value and needs to be kept for a longer period of time.
It permanently assigns a NULL value to the EventCapsule's
userAgent field when the event reaches the age of 90 days. The EventCapsule is a wrapper schema that is common in all EventLogging schemas. All the other fields in the schema, are kept indefinitely. This is suited for totally non-sensitive schemas.
The Analytics' Hive cluster stores all EventLogging schemas, including those with a very high volume. It uses 2 databases: event and event_sanitized. The event database stores original (unsanitized) events, while event_database stores sanitized events. Sanitization happens right after events are generated (with a couple hours lag). So, unsanitized and sanitized events co-exist in 2 different databases during 90 days. After that, however, unsanitized events older than 90 days are automatically deleted from the event database, and the only events that persist indefinitely are those in event_sanitized.
Hive sanitization job
It's a job that lives in refinery and it's run every hour by a cron job. It reads the new unsanitized events from the event database, sanitizes them using the white-list and copies them over to the event_sanitized database. Only schemas that are present in the white-list will be copied over the event_sanitized database.
MariaDB events are stored in the log database of db1107.eqiad.wmnet (master) and db1108.eqiad.wmnet (replica). A sanitization script runs in both hosts every day and sanitizes (in place) events older than 90 days. In MariaDB both un-sanitized and sanitized events share the same database.
MariaDB sanitization script
It's a python script that lives in puppet and it's run daily by a cron job. The purging script reads the white-list and one by one, checks the tables and fields that need to be deleted or set to NULL after 90 days, and does so. It deletes or updates MariaDB's records in batches so that the table is not blocked for a long time, and interferes with user queries.
The EL sanitization white-list is a YAML file with the following format: The first level corresponds to schema names. All schemas that we want to partially or fully keep need to be there, otherwise, the whole contents of that schema's table is going to be purged. Under each schema name, at the second level of the YAML, there have to be the field names that we want to keep indefinitely. Each field name must have the label 'keep'. See example:
schemaName: fieldName1: keep fieldName2: keep
The white-list supports partially white-listing nested fields. Actually, the main event information is enclosed in the event nested object. Thus, the whitelist for EL schemas should look like this (note capsule fields are still outside the scope of event):
schemaName: event: fieldName1: keep fieldName2: keep capsuleFieldName1: keep capsuleFieldName2: keep
This feature can also be used for nested fields like userAgent or geocoded_data, but this will work for Hive only. MariaDB does not support nested structures except for event.
- For EventLogging, using the keep label for nested fields or for whole schemas is not allowed.
- The white-list is schema-centric (not table-centric), meaning it serves for all revisions of a given schema. This way, when a schema is altered, the white-list continues to work.
Warning: This black-list should not be confused with the auto-purging white-list we talk about in the rest of this page. Some EL schemas receive too many events for MariaDB to handle. Those are blacklisted through puppet (using Hiera)] and are not stored in MariaDB. They are still stored in the Hadoop cluster (Hive), though. If you plan on adding a schema that will produce more than 10 events per second (in average), please let the Analytics team know; it might be necessary to black-list it.
Which schemas and fields are being purged?
The single source of truth regarding the purging strategy of the schemas and the fields that will be kept in each case is the EventLogging sanitization white-list. For more details see its section above.
Is the information about purging that lives in the schema talk pages correct?
We can *not* ensure that the puring strategy that is mentioned in the schema talk pages is the actual one that is implemented in the white-list. Listing the purging strategy in the talk pages was a decision that came out to be non-practical, and in the end we decided that the white-list would be the place for that.
Will the schema talk pages ever have correct purging info?
There's a task in Analytics' backlog to write a script that automatically updates the talk pages with the changes to the white-list.
What is the default purging strategy for new schemas?
The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside EventLogging databases. This means, if you create a new schema and don't take action to white-list its fields, the events produced to that schema are going to be purged after 90 days.
Should I white-list all fields of my schema every time I modify it?
No. The white-list is schema-centric, meaning it does not observe revisions. All the fields that are in the whitelist for previous revisions of your schema will also apply to the new revision.
If I add a new field to an existing schema, what will happen?
The new field won't be in the white-list, because it's new. So by default, it will be purged after 90 days. Note, that all other white-listed fields will still be kept. If you want to keep the new field, follow the steps to white-list it described below.
If I remove fields from my schema, should I remove them from the white-list?
Normally no. The older fields, will continue to white-list older revisions of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the white-list.
Best privacy practices when creating or modifying schemas
This practices assume you want to keep your data indefinitely. If you don't need the data produced by your schema for historical querying, consider sticking to the full-purging strategy (default).
- Use short/simple/single-purpose schemas as opposed to giant/all-aware/complex schemas. Simple schemas are more likely to be non-sensitive.
- Avoid using persistent tokens, reverse identifiers or any kind of fingerprint.
- Avoid using personal identifying fields, like: username, userId, etc. Unless the rest of the schema does not contain any browsing data or personal context, it's very likely that they need to be purged.
- Avoid fields that contain text inputed by the users. Those can eventually contain private information of the users which they inputed by mistake; for example, by copy-pasting their credit card number.
- Avoid using cross-schema tokens that can associate events from a schema with events from another schema. This can make 2 non-sensitive schemas into 1 combined sensitive data set.
- Bucketize potential identifiers. For example, instead of emitting the editCount (integer) of a user, emit a bucketized version ("0 edits"|"1-4 edits"|"5-99 edits"|"100-999 edits"|"1000+ edits"). This way the field becomes non-identifying and can be combined with other data safely.
- When logging the skin of the current user bear in mind that when combined with fields such as "wiki" or "webhost" it can be possible to identify people. For instance there may be a low number of users who may be using the mobile "Minerva" skin on desktop. Consider bucketing skins into popular skins and "other" field.
- When logging page title bear in mind that when combined with other fields you are potentially creating a list of a user's reading history. When logging page title, consider not logging additional data such as skin, webhost or user.
How to change the purging strategy of a schema
- Submit a Gerrit patch to the EL sanitization white-list in puppet where you add the schema and fields you want to keep indefinitely. Please, take the sanitization rationale into account when selecting them. Then add someone in the Analytics team to review the patch. We'll review and merge it, and that's it.
- Alternatively, create a Phabricator task named i.e. "Add <SchemaName_123> fields to EL purging white-list" and tag it with the "Analytics" project. In the task description, mention which field you'd like to keep. We Analytics will update the white-list and that's it. This option might take a bit longer, because it might take a couple days until we groom the task from our backlog.