Analytics/Systems/EventLogging/Data retention and auto-purging/AppInstallId

From Wikitech
Jump to navigation Jump to search

This page describes a proposal to anonymize the appInstallId field while keeping a good part of its value.

Current status

There are 17 EventLogging MobileWiki schemas that are using the appInstallId field. Their data is stored in 2 databases: MySQL (log) and Hive (event). MySQL stores EL data since the beginning of EventLogging, while Hive stores EL data since November 2017. Some of the schemas (MobileWikiAppiOS*) are only stored in Hive, because MySQL is not able to keep up with their high throughput.

Nature of appInstallId

The appInstallId is a random uuid that is generated in the user's device at their first run of the Wikipedia app. It's private and it's only stored in the user's device, so given an appInstallId, there's no practical way of determining who or where the corresponding device owner is.

Value of appInstallId

The appInstallId field has a significant value for EL data analysis, because it allows to group of events collected from the same device thoughout time. Also, it is a cross-schema identifier, so it allows to relate events collected from the same device that belong to different schemas.

Threat model

  • Suppose we have 3 data sets: S1, S2 and S3. S1 has only one partial identifier: country. S2 also has only one partial identifier: os_major. S3 has also only one partial identifier: edit_count. We assume each one of those data sets is non-privacy-sensitive by itself, because one single partial identifier field will not be a significant risk. However, if S1, S2 and S3 share the appInstallId field, then one could combine events from S1, S2 and S3 to relate country, os_major and edit_count which could be considered a significant risk. So, the appInstallId field, for being cross-schema, can make N non-privacy-sensitive data sets into one privacy-sensitive data set.
  • By nature, an appInstallId does not point to a device/user, but it points to a set of events collected from the same device. If a person is capable of accessing a user's device and also our data, that person could easily access all the events generated by that user's device.

Proposed solution

FOR HIVE: Currently there's a job in the Analytics Hadoop cluster that reads the events in the event (unsanitized) database, sanitizes them by applying the EL whitelist, and then writes the resulting anonymized events to the event_sanitized database. As part of that sanitization job, the appInstallId would be hashed with a salt that rotates every 90 days (or every quarter). The chosen algorithm would be An HMAC-SHA-256 using the salt as private key. The old rotated salts would be thrown away (deleted), so that the threat model could not be executed on top of events older than 90 days. The salts would be controlled (created, rotated, disposed) externally, maybe by puppet, or another cluster job. NOTE: We are still working on the script that will delete the data older than 90 days from the event (unsanitized) database, but we'll wait until end of Q1 FY2018-2019 to activate that script.

FOR MYSQL: Currently, there's a python script triggered by a cron job that updates rows older than 90 days in MySQL EL databases and sanitizes them according to the whitelist. As part of this sanitization script, the appInstallId would be hashed and salted in the same way as with Hive.

BOTH: Both databases would require some backwards sanitization of the appInstallId, to apply hashing and salting to historical data.

Effect of hasing + salting + rotating

  • Hashing appInstallIds by itself is not enough to invalidate the threat model. Given a user's appInstallId, one can hash it and then get all the user's data.
  • Hashing + salting appInstallIds by themselves is not enough (because we assume one has access to data - including the salt), given a user's appInstallId, one can hash it together with the salt and get all the user's data.
  • Hashing + salting + rotating (throwing away the old salts) makes it impossible to recompute hashes for events older than 90 days, because no-one has access to the salt that was used to generate those hashes.
  • The negative side effect on the data, from the perspective of data analysis, is that after hashing+salting+rotating, one will only be able to group events that were generated by the same user during the same quarter. Events generated by the same user during different quarters will not be groupable.

Other considerations

  • Even with hash+salt+rotate it's theoretically possible in some cases to determine which hashes from different quarters represent the same user. This could be done by extracting statistics on browsing patterns for all appInstallIds and matching different hashes that have similar patterns. This would me more likely to happen for high volume users, and would be made easier by partial identifiers like os_minor, country, etc.
  • One effective way to avoid this threat, would be to implement "rotating" sampling, which would re-raffle sampled devices every quarter, to be in synch with the salt rotation. This way, sampled users would be (usually) different each quarter.
  • Although backfilling would be possible within a 90-day window (the period where the raw unsanitized logs are still present in the event database), in some cases like at the beginning of a quarter, we might have already forgotten the salt that was used to generate the target hashes for backfilling. So we'll be able to backfill, but with a new salt, so for backfilled periods, appInstallId hashed won't match.

Timeline

We Analytics plan to productionize appInstallId hashing and salting, plus applying backwards historical sanitization to both Hive and MySQL by the end of this quarter Q1 FY2018-2019, also deleting non-sanitized data that is older than 90 days.