Jump to content

Data Platform/Systems/EventLogging/Sanitization vs Aggregation

From Wikitech

This page explains two strategies in sanitizing/aggregating EventLogging (EL) schema tables, so that they comply with the Data Retention Guidelines. Specifically it will describe the sanitization/aggregation of the field editCount, which is a numerical field that many EL schemas share. Note that both options are not exclusive, and also that this example could be extended to other fields that share the same properties as editCount.

1) Sanitization of editCount field

Consider the following EventLogging table:

editCount field1 ... fieldN
0 a ... 10
4 b ... 20
78 c ... 30
435 d ... 40
17840 e ... 50

The idea is to add a new field named editCountBucket (or similar). Its value would be a bucketification of the editCount field, an enum like: "0 edits", "1-4 edits", "5-99 edits", "100-999 edits", "1000+ edits".

editCount field1 ... fieldN editCountBucket
0 a ... 10 0 edits
4 b ... 20 1-4 edits
78 c ... 30 5-99 edits
435 d ... 40 100-999 edits
17840 e ... 50 1000+ edits

After 90 days, only the editCount would be deleted, leaving the table with the other original fields plus editCountBucket:

editCount field1 ... fieldN editCountBucket
a ... 10 0 edits
b ... 20 1-4 edits
c ... 30 5-99 edits
d ... 40 100-999 edits
e ... 50 1000+ edits

Thus, maintaining the data set non-aggregated (keeping all the non-sensitive data as is), and still permitting queries on a safe simplification of the editCount field.

Work needed

Here's a list of tasks to implement this solution:

  1. [Analytics team] Create a new version of the schema with the field editCountBucket.
  2. [Product team] Modify the instrumentation to use the new schema and populate the editCountBucket field.
  3. [Analytics team] Create a SQL script to update all the tables for the old revisions of the schema, adding the new field.
  4. [DBA] Run/schedule the update script and activate auto-purging of the editCount field after 90 days.

If the schema is populated from a single-version system, there would be no problems. But if the schema is populated from different versions of a mobile app, there will be always events coming to old revision tables, without the editCountBucket information. So the update of the old schema tables (step 3) should be executed periodically, maximum every 90 days, to ensure that no editCount fields get deleted without their respective editCountBucket fields having a value.

2) Aggregation upon editCount field

The idea here is to use EventLogging report schedulers (generate.py, reportupdater) to daily store the desired custom SQL metrics to report files. Provided they do not persist sensitive information, they can be kept indefinitely. Note that the query reads the still entire non-sanitized data, because it executes within the last 90 days of events. And also, that after that period, the sensitive data in the tables should be purged. For example, given the same table as in the former example:

editCount field1 ... fieldN
0 a ... 10
4 b ... 20
78 c ... 30
435 d ... 40
17840 e ... 50

You could write a SQL query like:

SELECT
    DATE(timestamp) as day,
    SUM( IF( event_editCount = 0, 1, 0 ) ) AS "0 edits",
    SUM( IF( event_editCount > 0 AND event_editCount < 5, 1, 0 ) ) AS "1-4 edits",
    SUM( IF( event_editCount >= 5 AND event_editCount < 100, 1, 0 ) ) AS "5-99 edits",
    SUM( IF( event_editCount >= 100 AND event_editCount < 1000, 1, 0 ) ) AS "100-999 edits",
    SUM( IF( event_editCount >= 1000, 1, 0 ) ) AS "1000+ edits"
FROM <schema_table>
GROUP BY day
ORDER BY day
;

And schedule it via EventLogging schedulers. They would create a CSV/TSV report file that would look like this (numbers make no sense):

day, 0 edits, 1-4 edits, 5-99 edits, 100-999 edits, 1000+ edits
2015-01-01, 2846, 325, 27, 3, 1
2015-01-02, 2476, 292, 25, 4, 1
2015-01-03, 3012, 321, 19, 3, 2
...

These kind of reports have the advantage that they can be very easily displayed in a Dashboard using Dashiki. Also, the disadvantage they can not be queried via SQL.

Work needed

Here's a list of tasks to implement this solution:

  1. [Product team] Write the SQL queries in the context of EL schedulers.
  2. [Analytics team] If the product team doesn't have an instance of the schedulers running yet, create a new repository for it and add puppetize the execution of it