Analytics/Event Sanitization

From Wikitech

This page describes the event sanitization processes used with Event Platform data for retaining certain event data in Hive beyond the 90 day retention period per WMF's Privacy Policy and Data Retention Guidelines.

Data retention

To learn more about data retention practices for events, see Event Data retention.

Hive

The Analytics Data Lake Hive event database stores event streams as Hive tables, including those with a very high volume. It uses 2 databases: event and event_sanitized. The event database stores original (unsanitized) events, while event_sanitized database stores sanitized events. Sanitization happens right after events are generated (with a couple hours lag). So, unsanitized and sanitized events co-exist in 2 different databases during 90 days. After that, however, unsanitized events older than 90 days are automatically deleted from the event database, and the only events that persist indefinitely are those in event_sanitized.

Hive event sanitization job

It's a job that lives in analytics/refinery/source and it runs every hour by a cron job. It reads the new unsanitized events from the event database, sanitizes them using the allowlist and copies them over to the event_sanitized database. Only tables that are present in the allowlists will be sanitized and copied over the event_sanitized database.

A second sanitization job runs 45 days after data was received, just in case any changes were made to the allowlist in that time.


Allowlists

The allowlists are YAML files with the following format: The first level corresponds to table names. All tables that we want to partially or fully keep need to be there, otherwise, the whole contents of that table is going to be purged. Under each table name, at the second level of the YAML, there have to be the field names that we want to keep indefinitely. Each field name must have the tag keep (retained as-is) or hash.

Analytics/instrumentation event tables must explicitly list all fields then want to keep or hash. Main production event tables are more permissive and may use the keep_all to keep all fields for the table. There are two separate allowlists for this purpose.

Example:

table_name:
    event:
        field_name1: keep
        field_name2: keep
        identifier1: hash
Because of rotating salts, hashed identifiers will be linkable within the same quarter, but not across quarters. In other words, you will not be able to group events by the identifier across quarters, only within one quarter.

The allowlist supports partially allowlisting nested fields.

If you decide to hash (and salt) an identifier field, then all other identifiers of the same schema have to be hashed as well. This applies even for temporary identifiers like session tokens. Otherwise, those identifiers can be used to match hashed (and salted) fields around the period of salt rotation. And this would invalidate the protection that salting and hashing offers.

Important notes:

  • For analytics/instrumentation events, using the keep label for nested fields or for whole schemas is not allowed.
  • The allowlist is table-centric meaning it serves for all versions of a given event schema. This way, when an event schema is altered, the allowlist continues to work.
  • The event sanitization process has a feature that allows for string fields that are privacy sensitive to be automatically hashed when copied over to event_sanitized. To do that, instead of 'keep', use the 'hash' label in the allowlist. All fields hashed this way will also be salted (appended a cryptographic salt before applying hash function) to increase the security of the hash. The event sanitization salt is rotated (replaced by a new one) every 3 months, coinciding with the start of quarter, and the old salt is thrown away.

Modifying the allowlist

  • Submit a Gerrit patch to

an allowlist YAML file where you add the table and fields you want to keep indefinitely. Please, take the sanitization rationale into account when selecting them. Then add someone in the Data Engineering team to review the patch, who will review and merge it – and that's it!

  • Alternatively, create a Phabricator task named i.e. "Add table_name fields to event sanitization allowlist" and tag it with the "Data-Engineering" project. In the task description, mention which field you'd like to keep. Data Engineering team will update the allowlist for you. This option might take a bit longer, because it might take a couple days until it gets looked at, prioritized, and worked on from the backlog.
  • Allowlist updates are automatically deployed on the weekly train. If you need an update to be deployed sooner, you can ask Data Engineering to do a manual deploy.
  • Data Engineering team will reach out to Legal if they have concerns about retaining any specific fields.

F.A.Q.

What is the default purging strategy for new schemas?

The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside event databases. This means, if you create a new schema and don't take action to allowlist its fields, the events produced to that schema are going to be purged after 90 days.

Should I allowlist all fields of my schema every time I modify it?

No. The allowlist is table-centric – meaning it does not now about schema versions. All the fields that are in the allowlist for previous revisions of your schema will also apply to the new version.

When are allowlist changes effective?

After being merged changes need to be deployed with analytics refinery code, this normally happens on a weekly cadence on Wednesdays but it might not happen all weeks if there are no sufficient changes or if a significant part of the team is not available due to ops issues/holidays/offsites.

If I add a new field to an existing schema, what will happen?

The new field won't be in the allowlist, because it's new. So by default, it will be purged after 90 days. Note, that all other allowlisted fields will still be kept. If you want to keep the new field, follow the steps to allowlist it described below.

If I remove fields from my schema, should I remove them from the allowlist?

You cannot remove fields from your schema; this would be a backwards incompatible change. Technically, this can be done but it requires a lot of manual intervention and migration planning.

If this does happen for some reason, no you should not remove the fields from the allowlist. The older fields, will continue to allowlist events created with older version of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the allowlist.

More correctly: because the allowlist applies to the Hive table, NOT the source event schema, the allowlist should match the Hive table schema.

What happens when I rename fields in a schema?

Renaming fields is not possible for event schemas. Hive does also accept field renames, but it does not actually rename the previous field, it considers event schema renames as a deletion of the original field plus a creation of a new field. The resulting refined table will have both old and new names as columns. If you decide to rename a schema field anyway, please remember to update the sanitization allowlist accordingly, otherwise the newly named field will be purged.