Wiki Replica redaction

From Wikitech

This page is to document how the data is sanitized for the Wiki Replica public databases that Wikimedia Cloud Services provides.

Main admin docs for this are in Portal:Data Services/Admin/Wiki Replicas.

Step 1 Sanitarium

Each sanitarium host has a MariaDB instance to replicate each db shard. The replication into the sanitarium host uses triggers and filters to remove sensitive columns, tables and databases in the simple case where there are no conditions (e.g. Ensures user_password does not go into Cloud Services).

  • For tables that should not be replicated, the replicate-wild-ignore-table mysql config option is set with the $private_tables puppet variable
  • For databases that should not be replicated (private wikis), replicate-wild-ignore-table is set with the databases from the $private_wikis puppet variable (Note, this is separate from private.dblist)
  • For columns that should be redacted, they are redacted via triggers that are set based on the list of columns at modules/role/files/mariadb/filtered_tables.txt

Data from this host is then replicated on to the labsdb hosts. Having this redaction done on a separate host outside of Cloud Services helps isolate the security of the data and ensure a privilege escalation via the Cloud Services access does not compromise the most sensitive data in the db.

There is also a check_private_data_report script to make sure redaction happened properly. This runs weekly via cron and emails the DBAs the results when a mismatch is found.

The code related to sanitarium currently lives in operations/puppet.git's modules/role/files/mariadb directory.

  • modules/role/files/mariadb/ Add triggers to redact the appropriate columns
  • modules/role/files/mariadb/filtered_tables.txt What columns to filter
  • modules/role/files/mariadb/check_private_data_report and Audit to make sure no private data is there
  • $private_wikis and $private_tables in manifests/realm.pp

Formerly this used to be part of operations/software/redactron.git, but that repo is no longer used.

Step 2 Wiki Replica views

In operations/puppet.git modules/profile/templates/wmcs/db/wikireplicas/maintain-views.yaml contains views that define what is public. This contains conditional redactions that cannot be done at sanitarium (e.g. revision delete), and also serves as defense in depth in case one of the sanitarium redactions fail.

Document redaction decisions

TODO: include documentation/rationale on any info publicly exposed that is not publically exposed by MW.


Note: operations/software/redactron.git and operations/software/labsdb-auditor.git contain historical software which is no longer used.