Analytics/Cluster/Data deletion and sanitization

From Wikitech
Jump to navigation Jump to search

Some of the data sets stored in the Analytics Hadoop cluster contain privacy-sensitive information. According to WMF's privacy policy and data retention guidelines, those need to be deleted or sanitized after at most 90 days after collection. The single source of truth to tell which data sets are deleted or sanitized is the file data_purge.pp in the operations/puppet repository. Other Hadoop data sets that are not privacy-sensitive but need deletion after a given period of time may also appear in that file.

Retention period and timer execution interval

Every deletion job that is set up in data_purge.pp needs a retention period and a timer execution interval. Depending on the way the data set is partitioned, we want to specify them differently.

  • For hourly data sets, we make the timer execute once an hour, or once a day (if it is fine for us to keep the data for 1 extra day). And the retention period can be 90 days (or 89 if you want to execute the timer once a day, but don't want to keep data for those extra 24 hours).
  • For daily data sets, we make the timer execute once a day. And the retention period can be 90 days (or 89 if you don't want to keep data for that extra day).
  • For monthly data sets, we make the timer execute once a day. Yes, this way the script will work well for longer sets of months (31 days) as well as shorter sets of months (28/29/30 days). Note that the deletion scripts will not affect the data that is too recent to be deleted! Now, the whole month of i.e. May is not going to be deleted until the last day of May is older than the given threshold. So the retention period has to account for the extra month that is going to be kept between deletions. For example, if we want to keep at most 90 days of data for a monthly data set, we can have a retention period of 60 days. That guarantees that 60 days will be kept at all times, and at most we'll store 60 days + 1 full month, meaning at most 91 days of data.
  • Snapshot data sets are a bit different, usually its deletion is not related to privacy, because snapshot data sets span since the beginning of wiki-time. The reason to delete them is storage space. And actually the script that deletes them will remove the oldest snapshots, not based on time, but rather on number of snapshots. So there's no real retention period, and the execution interval has not a big importance.

Deletion scripts

The scripts that are used to delete data older than a given threshold are the following:


This is a generic script that can delete any Hadoop data set that is partitioned by time, i.e. (year, month, day, hour), (year, month), etc. It can delete both directories from HDFS and also Hive partitions. See:


This script is used to automatically delete old druid deep-storage data from HDFS. See:


This script is designed to delete Druid data that is organized in snapshots. More specifically to delete old snapshots of a given data set that aren't valid any more, and keep a few recent snapshots in case the most recent data turns out to be defective and a revert is necessary. See:


This script is specifically designed to delete old snapshots of the mediawiki data set family. See: Note that the tables affected by this script are hardcoded in the same. This is so, because otherwise, the script would need too many parameters, given the different partitioning of the data sets within the mediawiki group.

Sanitization scripts

The scripts that sanitize certain fields of a given data set, by nullifying them, bucketing, hashing them, etc. are the following:


This Spark/Scala job uses Refine to select the partitions of a given data set that need to be sanitized and applies a given white-list to the database in question, in this case the event database. See: