Data Engineering/Systems/Event Data retention/AppInstallId

This page describes a proposal to anonymize the appInstallId field while keeping a good part of its value.

Nature of appInstallId

The appInstallId is a random uuid that is generated in the user's device at their first run of the Wikipedia app. It's private and it's only stored in the user's device, so given an appInstallId, there's no practical way of determining who or where the corresponding device owner is.

Value of appInstallId

The appInstallId field has a significant value for EL data analysis, because it allows to group of events collected from the same device thoughout time. Also, it is a cross-schema identifier, so it allows to relate events collected from the same device that belong to different schemas.

Threat model

Suppose we have 3 data sets: S1, S2 and S3. S1 has only one partial identifier: country. S2 also has only one partial identifier: os_major. S3 has also only one partial identifier: edit_count. We assume each one of those data sets is non-privacy-sensitive by itself, because one single partial identifier field will not be a significant risk. However, if S1, S2 and S3 share the appInstallId field, then one could combine events from S1, S2 and S3 to relate country, os_major and edit_count which could be considered a significant risk. So, the appInstallId field, for being cross-schema, can make N non-privacy-sensitive data sets into one privacy-sensitive data set.
By nature, an appInstallId does not point to a device/user, but it points to a set of events collected from the same device. If a person is capable of accessing a user's device and also our data, that person could easily access all the events generated by that user's device.

Proposed solution

Currently there's a job in the Analytics Hadoop cluster that reads the events in the event (unsanitized) database, sanitizes them by applying the EL whitelist, and then writes the resulting anonymized events to the event_sanitized database. As part of that sanitization job, the appInstallId would be hashed with a salt that rotates every 90 days (or every quarter). The chosen algorithm would be An HMAC-SHA-256 using a salt to add entropy to increase the keyspace of the hash input. The old rotated salts would be thrown away (deleted), so that the threat model could not be executed on top of events older than 90 days. The salts are controlled (created, rotated, disposed) externally by puppet, or another cluster job.

Pseudonymization / obfuscation: effect of hasing + salting + rotating

Hashing appInstallIds by itself is not enough to invalidate the threat model. Given a user's appInstallId, one can hash it and then get all the user's data.
Hashing + salting appInstallIds by themselves is not enough (because we assume one has access to data - including the salt), given a user's appInstallId, one can hash it together with the salt and get all the user's data. A fixed salt does not increase the entropy space of the hashing function.
Hashing + salting + rotating (throwing away the old salts) makes it impossible to recompute hashes for events older than 90 days, because no-one has access to the salt that was used to generate those hashes.
The negative side effect on the data, from the perspective of data analysis, is that after hashing+salting+rotating, one will only be able to group events that were generated by the same user during the same quarter. Events generated by the same user during different quarters will not be groupable. Note that this strategy reduces the long-term retention risk by reducing the "likelihood" of identity reconstruction but it does not eliminate this risk.

Other considerations

Even with hash+salt+rotate it's theoretically possible in some cases to determine which hashes from different quarters represent the same user. This could be done by extracting statistics on browsing patterns for all appInstallIds and matching different hashes that have similar patterns. This would me more likely to happen for high volume users, and would be made easier by partial identifiers like os_minor, country, etc.
One effective way to avoid this threat, would be to implement "rotating" sampling, which would re-raffle sampled devices every quarter, to be in synch with the salt rotation. This way, sampled users would be (usually) different each quarter.
Although backfilling would be possible within a 90-day window (the period where the raw unsanitized logs are still present in the event database), in some cases like at the beginning of a quarter, we might have already forgotten the salt that was used to generate the target hashes for backfilling. So we'll be able to backfill, but with a new salt, so for backfilled periods, appInstallId hashed won't match.