MariaDB/PII
This page describes the procedure for removing Personally Identifiable Information (PII) from Wiki Replicas.
This procedure must be followed every time a new wiki is added to the database. The full procedure for adding a new wiki is described at Add a wiki.
Sanitize the Wiki Data
Note: As of Jun 2024 - the default section for new wikis is s5 but this might not be the case in the future.
Go to the sanitarium hosts in both data centers and run the following command to clean up the data:
redact_sanitarium.sh -d $NEW_WIKI_NAME -S /run/mysqld/mysqld.s5.sock | mysql -S /run/mysqld/mysqld.s5.sock
Run the Private Data Check Script
Execute the check_private_data.py script to identify which table/columns need to be dropped:
check_private_data.py -S /run/mysqld/mysqld.s5.sock
If the output makes sense, you can pipe it directly to MySQL to drop the necessary table/columns:
check_private_data.py -S /run/mysqld/mysqld.s5.sock | mysql -S /run/mysqld/mysqld.s5.sock
Verify changes on each wikireplicas host
clouddb10[13-20]
(owned by the Cloud Services team), plus an-redacteddb1001
(owned by the Data Platform SRE team)Ensure that you run check_private_data.py on each replicas host after the whole process on your side is done to avoid leaking Personally Identifiable Information:
check_private_data.py -S /path/to/socket
Grant Permissions for SQL Views
Identify the wikireplicas hosts that belong to the relevant database section (e.g., s5). Grant the labsdbuser role the appropriate grants by running:
GRANT SELECT, SHOW VIEW ON `NEW_WIKI_NAME_p`.* TO `labsdbuser`;
This needs to be done on all wikireplicas hosts that have the relevant database section.
Create the View Database
Create the view database on all the wikireplicas hosts that belong to the relevant database section:
CREATE DATABASE NEW_WIKI_NAME_p;
Notify Relevant Teams
Once the database is sanitized and the view database is created, the work managed by DBAs is complete.
The next step is to create the views, and that is managed by the Wikimedia Cloud Services team.
Until we define a better process, assign the task to fnegri.