Jump to content

Wiki Replica redaction

From Wikitech

This page is to document how the data is sanitized for the Wiki Replicas public databases that Wikimedia Cloud Services provides.

Main admin docs for this are in Portal:Data Services/Admin/Wiki Replicas.

Sanitariums

The replication into the Sanitarium hosts uses triggers and filters to remove sensitive columns, tables and databases in the simple case where there are no conditions (e.g. ensures user_password does not end up in Wiki Replicas).

More technical details can be found at Portal:Data_Services/Admin/Wiki_Replicas#Step_1:_sanitization.

There is also a check_private_data_report script to make sure redaction happened properly. This runs weekly via cron and emails the DBAs the results when a mismatch is found.

More technical details can be found at Portal:Data_Services/Admin/Wiki_Replicas#Step_2:_evaluation

Wiki Replica views

In operations/puppet.git, modules/profile/templates/wmcs/db/wikireplicas/maintain-views.yaml contains views that define what is public. This contains conditional redactions that cannot be done at sanitarium (e.g. revision delete), and also serves as defense in depth in case one of the sanitarium redactions fail.

More technical details can be found at Portal:Data_Services/Admin/Wiki_Replicas#Step_6:_setting_views.

Document redaction decisions

TODO: include documentation/rationale on any info publicly exposed that is not publically exposed by MW.

Other

Note: operations/software/redactron.git and operations/software/labsdb-auditor.git contain historical software which is no longer used.