Portal:Data Services/Admin/From production to Wiki Replicas
This page contained a detailed explanation of how data flows from the production databases to Wiki Replicas and how data redaction is performed.
This page is mostly targeted at Wiki Replicas admins. For a shorter high-level overview targeted at end users, read Wiki Replica redaction.
(The editable file for this diagram is here)
Step 0: databases in production
The starting situation is that there are databases in production for wiki projects (like wikipedia, wikidata, wikictionary, and friends). We would like to provide this same databases for WMCS users. Due to privacy reasons, some data needs to be redacted or deleted. That's why users can't directly access this database.
So, we choose what databases to copy to wiki-replicas, which are all of them.
Every time a new database is created in production (for example, a new language for a wiki) we are in this step 0.
The identification of new database candidates for migrating to wiki-replicas is done under request by someone. Right now there aren't any mechanisms to notify pending migrations or the like.
Step 1: sanitization
Sanitariums are special database hosts managed by the Data Persistence team that contain a copy of production databases but without any private information (e.g. emails, passwords, etc.)
There are 2 Sanitarium hosts in Eqiad and 2 in Codfw. Each Sanitarium host runs 4 MariaDB services on different ports (one service per db section). For example, db1154
contains the sections s1
, s3
, s5
and s8
.
In Sanitariums, MariaDB is set to replicate from a production database using filters to avoid replicating private databases and tables, and triggers to redact the values of private columns during INSERTs and UPDATEs:
- Databases that should not be replicated (private wikis) are added to
replicate-wild-ignore-table
using the$private_wikis
puppet variable. (Note: this is separate fromprivate.dblist
) - Tables that should not be replicated are added to
replicate-wild-ignore-table
using the$private_tables
puppet variable. - Columns that contain private information are redacted via triggers that are set based on the list of columns at filtered_tables.txt, using the redact_sanitarium script.
Having this redaction done on a separate host outside of Cloud Services helps isolate the security of the data and ensure a privilege escalation via the Cloud Services access does not compromise the most sensitive data in the production databases.
Please note that some additional filtering of private information is done in the downstream Wiki Replica hosts (see Step 6 below).
Step 2: evaluation
Once data is in sanitarium boxes, some cron jobs and manual scripts check whether data is actually redacted. For example, check that a given column is NULL.
Involved code:
- modules/role/files/mariadb/check_private_data.py - main checker script
- modules/role/files/mariadb/check_private_data_report - wrapper to report and send notification emails
- modules/profile/manifests/mariadb/check_private_data.pp - cron / systemd job to run wrapper daily
This evaluation also happens in the wiki-replica servers, and alerts in case some private data is detected.
The main production team, DBAs, are in charge of this step.
Step 3: filling up wiki-replicas
Data is finally copied to wiki-replica servers. The DB servers are currently using row-based replication from sanitization boxes to wiki-replicas.
This is done in real time by the DB daemons, so there is always a data flow, which should be sanitized due to previous steps.
Step 4: setting up GRANTs
At this point, database GRANTS are created by main operations team DBAs.
This is done by hand, no automation using puppet, but there is a file where GRANTS are being tracked: modules/role/templates/mariadb/grants/wiki-replicas.sql.
The content of the file is something like the following:
GRANTS tracking file |
---|
-- Initial grants and grants added at T178128 GRANT labsdbuser TO 'labsdbadmin'@'10.64.37.19' WITH ADMIN OPTION GRANT SELECT, INSERT, UPDATE ON `mysql`.* TO 'labsdbadmin'@'10.64.37.19' GRANT SELECT, SHOW VIEW ON `%\\_p`.* TO 'labsdbadmin'@'10.64.37.19' WITH GRANT OPTION GRANT SELECT, SHOW VIEW ON `%wik%`.* TO 'labsdbadmin'@'10.64.37.19' GRANT labsdbuser TO 'labsdbadmin'@'10.64.37.20' WITH ADMIN OPTION GRANT SELECT, INSERT, UPDATE ON `mysql`.* TO 'labsdbadmin'@'10.64.37.20' GRANT SELECT, SHOW VIEW ON `%wik%`.* TO 'labsdbadmin'@'10.64.37.20' GRANT SELECT, SHOW VIEW ON `%\\_p`.* TO 'labsdbadmin'@'10.64.37.20' WITH GRANT OPTION -- Labsdbuser is a role with privileges for all views like -- GRANT SELECT, SHOW VIEW ON `rowikiquote\_p`.* TO 'labsdbuser' GRANT USAGE ON *.* TO 'labsdbuser' -- maintainviews user used by cloud services team GRANT ALL PRIVILEGES ON `heartbeat\\_p`.* TO 'maintainviews'@'localhost' GRANT ALL PRIVILEGES ON `meta\\_p`.* TO 'maintainviews'@'localhost' GRANT ALL PRIVILEGES ON `centralauth\\_p`.* TO 'maintainviews'@'localhost' GRANT SELECT ON `centralauth`.* TO 'maintainviews'@'localhost' GRANT SELECT ON `heartbeat`.* TO 'maintainviews'@'localhost' GRANT ALL PRIVILEGES ON `%wik%\\_p`.* TO 'maintainviews'@'localhost' GRANT ALL PRIVILEGES ON `%\\_p`.* TO 'maintainviews'@'localhost' GRANT SELECT, DROP, CREATE VIEW ON `%wik%`.* TO 'maintainviews'@'localhost' GRANT SELECT (user, host) ON `mysql`.`user` TO 'maintainviews'@'localhost' -- viewmater user GRANT SELECT ON *.* TO 'viewmaster'@'%' -- quarry user granted 48 connections #T180141 GRANT USAGE ON *.* TO 's52788'@'%' WITH MAX_USER_CONNECTIONS 48; |
Related tickets:
- T178128 - Access to raw database tables on labsdb* for wmcs-admin users
- T166310 - Grant root access for Bryan Davis on labstore* and admin for maintain scripts for labsdb*
Step 5: setting indexes
Create indexes that only exist on the wiki replicas via the maintain-replica-indexes script. This script manages the indexes in an idempotent fashion (if you run it multiple times it only changes what doesn't match the configuration file here index-conf.yaml). This could use the ability to handle only a single database, but it skips tables it finds comply with the definitions in the config, so it is fairly quick to run, despite that it runs across the whole set.
On each wiki-replica server, run:
% sudo /usr/local/sbin/maintain-replica-indexes
Step 6: setting views
Create _p (public) views, which are intermediate views which leave out private data.
This is done by means of the maintain-views.py script.
In each wiki-replica server, it's executed like this:
% sudo maintain-views --databases $wiki --debug
Running the maintain-views script/cookbook is handled by the WMCS team, with Data Engineering providing input on what should and should not be exposed.
TODO: elaborate info on what is this doing
Step 7: setting up metadata
Insert a new row in meta_p.wiki for the new wiki by running the maintain-meta_p script. This is only relevant for S7 databases.
The execution is like this:
% sudo /usr/local/sbin/maintain-meta_p --databases $wiki.
If the meta_p doesn't exist, you need to run it with --boostrap option.
For example, in case of a full bootstrap:
user@cumin1001:~$ sudo cumin "P{R:Profile::Mariadb::Section = 's7'} and P{P:wmcs::db::wikireplicas::mariadb_multiinstance}" "/usr/local/sbin/maintain-meta_p --all-databases --bootstrap"
3 hosts will be targeted:
clouddb[1014,1018,1021].eqiad.wmnet
[..]
This step is handled by the WMCS team.
TODO: elaborate info on what is this doing
Step 8: setting up DNS
From an openstack control server (for example cloudcontrol1004.wikimedia.org), run:
$ sudo /usr/local/sbin/wikireplica_dns --aliases --shard <sN>
To update the service IP used for the analytics or web servers:
- Edit modules/openstack/files/util/wikireplica_dns.yaml to change the service IP
- Run /usr/local/sbin/wikireplica_dns
root@cloudcontrol1007:~# /usr/local/sbin/wikireplica_dns -v --zone web.db.svc.eqiad.wmflabs. 2018-03-16T14:02:41Z mwopenstackclients.DnsManager INFO : Updating s8.web.db.svc.eqiad.wmflabs. 2018-03-16T14:02:42Z mwopenstackclients.DnsManager INFO : Updating s3.web.db.svc.eqiad.wmflabs. 2018-03-16T14:07:17Z mwopenstackclients.DnsManager INFO : Updating s2.web.db.svc.eqiad.wmflabs. 2018-03-16T14:07:23Z mwopenstackclients.DnsManager INFO : Updating s1.web.db.svc.eqiad.wmflabs. 2018-03-16T14:07:24Z mwopenstackclients.DnsManager INFO : Updating s7.web.db.svc.eqiad.wmflabs. 2018-03-16T14:07:28Z mwopenstackclients.DnsManager INFO : Updating s6.web.db.svc.eqiad.wmflabs. 2018-03-16T14:07:29Z mwopenstackclients.DnsManager INFO : Updating s5.web.db.svc.eqiad.wmflabs. 2018-03-16T14:07:30Z mwopenstackclients.DnsManager INFO : Updating s4.web.db.svc.eqiad.wmflabs.
This step is done by the WMCS team.
TODO: elaborate info on what is this doing
Step 9: all is done
All is done, wiki-replica contains a mirror of the production database. Finally WCMS users/tools/projects are able to query the database/tables.
This is usually done by using the sql wrapper script.
TODO: the benefits of using the sql script.