This page is currently a draft.
This page is to document how the data is sanitized for the Wiki Replica public databases that Wikimedia Cloud Services provides.
Main admin docs for this are in Portal:Data Services/Admin/Wiki Replicas.
Step 1 Sanitarium
Each sanitarium host has a MariaDB instance to replicate each db shard. The replication into the sanitarium host uses triggers and filters to remove sensitive columns, tables and databases in the simple case where there are no conditions (e.g. Ensures user_password does not go into Cloud Services).
- For tables that should not be replicated, the
replicate-wild-ignore-tablemysql config option is set with the $private_tables puppet variable
- For databases that should not be replicated (private wikis),
replicate-wild-ignore-tableis set with the databases from the $private_wikis puppet variable (Note, this is separate from private.dblist)
- For columns that should be redacted, they are redacted via triggers that are set based on the list of columns at modules/role/files/mariadb/filtered_tables.txt
Data from this host is then replicated on to the labsdb hosts. Having this redaction done on a separate host outside of Cloud Services helps isolate the security of the data and ensure a privilege escalation via the Cloud Services access does not compromise the most sensitive data in the db.
There is also a
check_private_data_report script to make sure redaction happened properly. This runs weekly via cron and emails the DBAs the results when a mismatch is found.
- modules/role/files/mariadb/redact_sanitarium.sh Add triggers to redact the appropriate columns
- modules/role/files/mariadb/filtered_tables.txt What columns to filter
- modules/role/files/mariadb/check_private_data_report and check_private_data.py Audit to make sure no private data is there
Formerly this used to be part of operations/software/redactron.git, but that repo is no longer used.
Step 2 Labsdb views
In operations/puppet.git modules/role/templates/labs/db/views/maintain-views.yaml contains views that define what is public. This contains conditional redactions that cannot be done at sanitarium (e.g. revision delete), and also serves as defense in depth in case one of the sanitarium redactions fail.
Document redaction decisions
TODO: include documentation/rationale on any info publicly exposed that is not publically exposed by MW.
Note: operations/software/redactron.git and operations/software/labsdb-auditor.git contain historical software which is no longer used.