Analytics/Systems/EventLogging/Data retention and auto-purging
- 1 Definitions
- 2 Rationale for purging
- 3 Purging Strategies
- 4 Implementation
- 5 F.A.Q.
- 5.1 Which schemas and fields are being purged?
- 5.2 Is the information about purging that lives in the schema talk pages correct?
- 5.3 Will the schema talk pages ever have correct purging info?
- 5.4 What is the default purging strategy for new schemas?
- 5.5 Should I white-list all fields of my schema every time I modify it?
- 5.6 If I add a new field to an existing schema, what will happen?
- 5.7 If I remove fields from my schema, should I remove them from the white-list?
- 6 Best privacy practices when creating or modifying schemas
- 7 How to change the purging strategy of a schema
Identifying information is any field that uniquely identifies a physical person, or can potentially be used to identify a person (or narrow down the pool of possible persons enough) given a certain situation and context. The following subsections explain some types of identifying fields:
Personally identifying information (PII)
PII is any field that could be used to uniquely identify an individual user. Examples include name, email, phone number, credit card number, and government ID number. Note that EventLogging does not store any of these fields under any circumstance.
Other fields that are not strictly PII, can still identify a user in certain circumstances. For example, the field 'editCount' looks OK at first sight, but for very prolific editors, their number of edits may be unique in their context, i.e. 37296 edits. Most likely they are the only one that have this number of edits, and thus the field 'editCount' can be identifying, and should be treated as such. Other potential PII fields are: 'userAgent', 'country/region/city', 'ipAdress' etc.
Even when the token is non-persistent, it can still be dangerous if it is cross-schema. A cross-schema token corresponds to the same user in more than one table, thus linking events of different tables together. Two tables that are non-sensitive by themselves, can become a sensitive dataset when linked together with a cross-schema token. Thus, cross-schema tokens are considered sensitive as well.
Any field that contains personal information about topics such as racial or ethnic origins, sexual orientation, marital or familial status, religion, political affiliation, etc... is highly sensitive. Usually, in the context of EventLogging, this corresponds to browsing information: the pages visited by a user, or the pages watched, or the recommendations clicked, and so on. All of them can potentially contain information about the state or personal preference of their users.
Rationale for purging
The privacy threat exists on data sets that contain both identifying information and browsing information. This way, the browsing information can be linked to a specific physical person, and the personal status or preference of that user can be exposed. Data sets that contain browsing information but not identifying information might be non-sensitive, for exaple the Pageview API which has pageview counts per wiki article. Similarly, data sets that contain identifying information but not browsing information can also be non-sensitive, like the Browser Statistics dashboard, which has usage stats broken down by OS and Browser versions (data coming from identifying user agent field). Now, certainly, when both elements are combined in the same schema, the data set becomes highly sensitive.
Schema semantics revealing facts
Some schemas that contain identifying fields (but not browsing information) can still be sensitive if the semantics of the schema reveals aspects of the users. For example, imagine a schema called 'PageviewsByWomen' that stores information about all pageviews performed by women. Even if there's no browsing information in the schema, the sole schema name reveals the gender of its users, and renders the data set sensitive.
What do the data retention guidelines recommend?
So, given all those definitions and situations, the data retention guidelines recommend the following (please read the "How long do we retain non-public data" section of the Data Retention Guidelines as the true authority on those concepts: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data.3F):
- Non-sensitive information: Keep it indefinitely.
- Sensitive information: After at most 90 days, delete, aggregate, or anonymize.
There are 3 purging strategies in EventLogging, ranging from more strict to more permissive.
It permanantly deletes the whole event records from the database when they reach the age of 90 days. This is suited for schemas that are sensitive (see all types of sensitive data sets above) or for schemas whose information doesn't need to be kept for a longer period of time. Note that this is the default strategy for new schemas and new fields in existing schemas.
It permanantly assigns a NULL value to to subset of the event's fields that are sensitive when the event reaches the age of 90 days. The rest of the fields (non-sensitive) are kept indefinitely. This is suited for schemas that can be easily sanitized and whose information is of great value and needs to be kept for a longer period of time.
It permanently assigns a NULL value to the EventCapsule's
userAgent field when the event reaches the age of 90 days. The EventCapsule is a wrapper schema that is common in all EventLogging schemas. All the other fields in the schema, are kept indefinitely. This is suited for totally non-sensitive schemas.
The Analytics' Hadoop cluster stores all EventLogging schemas, including those with a very high volume. However it only stores the last 90 days of events for all of them, regardless of the agreed purging strategy. In a daily basis, the partitions that are older than 90 days are deleted by a script. If you want to access EL historical data (that has been kept for longer than 90 days), you'll find it in the MariaDB hosts. Note that there are no Hive external tables created on top of these data sets; if you want to query the data in hive, you'll have to create it (see Work in progress).
Work in progress
In the Analytics team, we are developing a way to automatically create Hive external tables on top of EL data files, so that Hadoop will also be an easy way to access EL data. When we do this, we'll also implement the partial and minimal purging for Hadoop so that historical data can be accessed from there.
The MariaDB auto-purging system is implemented by a script and a white-list, and it will be run in 4 hosts: 2 master hosts and 2 replicas.
Purging on master hosts
To guarantee the performance of the masters, those will only hold events for the last 90 days. The purging script will ensure that all events older than 90 days are deleted on a daily basis regardless of what the white-list says. So, the only hosts keeping the whole history of non-sensitive EL events will be the MariaDB replicas. If you want to query the data, you should use one of the replicas.
Purging on replica hosts
This is where the actual white-list purging happens. Note: The replica hosts, at the moment, are the only stores that keep non-sensitive EL data for more than 90 days.
The purging script
It's a python script that lives in puppet and it's run daily by a cron job. The purging script reads the white-list and one by one, checks the tables and fields that need to be deleted or set to NULL after 90 days, and does so. It deletes or updates MariaDB's records in batches so that the table is not blocked for a long time, and interferes with user queries.
The EL purging white-list is a TSV file with 2 columns: schema and field. All schema-field pairs listed there will be kept indefinitely in the MariaDB replicas. The rest of schema-field pairs not listed will be deleted after 90 days. Note that the white-list is schema-centric (not table-centric), meaning it serves for all revisions of a given schema. This way, when a schema is altered, the white-list continues to work.
Warning: This black-list should not be confused with the auto-purging white-list we talk about in the rest of this page. Some EL schemas receive too many events for MariaDB to handle. Those are blacklisted through puppet (using Hiera)] and are not stored in MariaDB. They are still stored in the Hadoop cluster, though. Note that schemas that are black-listed like this, do not currently have the options of partial or minimal purging. They will be fully purged after 90 days. If you plan on adding a schema that will produce more than 10 events per second (in average), please let the Analytics team know; it might be necessary to black-list it.
Which schemas and fields are being purged?
The single source of truth regarding the purging strategy of the schemas and the fields that will be kept in each case is the EventLogging purging white-list. For more details see its section above.
Is the information about purging that lives in the schema talk pages correct?
We can *not* ensure that the puring strategy that is mentioned in the schema talk pages is the actual one that is implemented in the white-list. Listing the purging strategy in the talk pages was a decision that came out to be non-practical, and in the end we decided that the white-list would be the place for that.
Will the schema talk pages ever have correct purging info?
There's a task in Analytics' backlog to write a script that automatically updates the talk pages with the changes to the white-list.
What is the default purging strategy for new schemas?
The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside EventLogging databases. This means, if you create a new schema and don't take action to white-list its fields, the events produced to that schema are going to be purged after 90 days.
Should I white-list all fields of my schema every time I modify it?
No. The white-list is schema-centric, meaning it does not observe revisions. All the fields that are in the whitelist for previous revisions of your schema will also apply to the new revision.
If I add a new field to an existing schema, what will happen?
The new field won't be in the white-list, because it's new. So by default, it will be purged after 90 days. Note, that all other white-listed fields will still be kept. If you want to keep the new field, follow the steps to white-list it described below.
If I remove fields from my schema, should I remove them from the white-list?
Normally no. The older fields, will continue to white-list older revisions of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the white-list.
Best privacy practices when creating or modifying schemas
- If you don't need the data produced by your schema for historical querying, consider sticking to the full-purging strategy (default).
- Use short/simple/single-purpose schemas as opposed to giant/all-aware/complex schemas. Simple schemas are more likely to be non-sensitive.
- Do not use persistent tokens or any kind of fingerprint.
- Try to avoid personal identifying fields, like: username, userId, editCount, appInstallID, etc. Unless the rest of the schema is totally non-sensitive, it's very likely that they need to be purged.
- Avoid fields that contain text inputed by the users. Those can eventually contain private information of the users which they inputed by mistake; for example, by copy-pasting their credit card number.
- Do not use cross-schema tokens that can associate events from a schema with events from another schema. This can make 2 non-sensitive schemas into 1 combined sensitive data set.
- Bucketize potential identifiers. For example, instead of emitting the editCount (integer) of a user, emit a bucketized version ("0 edits"|"1-4 edits"|"5-99 edits"|"100-999 edits"|"1000+ edits"). This way the field becomes non-identifying and can be combined with other data safely.
- When logging the skin of the current user bear in mind that when combined with fields such as "wiki" or "webhost" it can be possible to identify people. For instance there may be a low number of users who may be using the mobile "Minerva" skin on desktop. Consider bucketing skins into popular skins and "other" field.
- When logging page title bear in mind that when combined with other fields you are potentially creating a list of a user's reading history. When logging page title, consider not logging additional data such as skin, webhost or user.
How to change the purging strategy of a schema
Assuming that you already:
- Created a new schema or modified an existing one.
- Instrumented your code to emit events to that new schema/revision and deployed it.
At this point the new events are already flowing into EventLogging. You have 90 days to alter the default purging strategy of the new schema/fields before they start getting dropped. To do so, please choose and do one of the following, either or:
- Submit a Gerrit patch to the EL purging white-list in puppet where you add the schema and fields you want to keep indefinitely. Please, take the purging rationale into account when selecting them. Then add someone in the Analytics team to review the patch. We'll review and merge it, and that's it.
- Alternatively, create a Phabricator task named i.e. "Add <SchemaName_123> fields to EL purging white-list" and tag it with the "Analytics" project. In the task description, mention which field you'd like to keep. We Analytics will update the white-list and that's it. This option might take a bit longer, because it might take a couple days until we groom the task from our backlog.