MariaDB/Sanitarium and clouddb instances

From Wikitech
This page contains historical information. Needs to be updated

Sanitarium is the name used for the MySQL/MariaDB instances and configuration that redact private information from the production wiki databases before replicating to the Wiki Replicas instances which are accessible from the Cloud Services network. Main article is Portal:Data Services/Admin/Wiki Replicas.

Sanitarium

role::mariadb::sanitarium_multiinstance

role::mariadb::sanitarium_multiinstance As of Dec 2023 we have 2 sanitarium hosts per DC (db1154, db1155, db2186, db2187) each of the running 4 instances of mariadb each of one port, so they are split across both hosts.

The running sections are specified on the MOTD of each host

2022-12-14:

DB section s1 (alias: mysql.s1)
DB section s3 (alias: mysql.s3)
DB section s5 (alias: mysql.s5)
DB section s8 (alias: mysql.s8)

datadir is in /srv/sqldata.s[1358] and tmpdir in /srv/tmp.s[1358]

All but certain filtered tables are replicated.

Why separate instances

Before having multi-instance sanitarium hosts, we had multi-source hosts (one mariadb process with multiple replication threads in it), but we migrated to multi-instance for the following reasons:

  • Easier to handle per-section maintenance
  • Corruption on one instance doesn't mean corruption on all of them
  • Cloning a section doesn't mean bringing up all the sections
  • Possibility to enable GTID (GTID didn't work on multi instance see: https://jira.mariadb.org/browse/MDEV-12012)

Triggers

The filters (triggers) used to redact the data are created by Redactatron (old repo, unmaintained), specifically redact_standard_output.sh. The following triggers are defined for each wiki's schema:

root@db1154.eqiad.wmnet[enwiki]> select TRIGGER_NAME,EVENT_OBJECT_TABLE,ACTION_TIMING from information_schema.triggers;
+----------------------------+--------------------+---------------+
| TRIGGER_NAME               | EVENT_OBJECT_TABLE | ACTION_TIMING |
+----------------------------+--------------------+---------------+
| sys_config_insert_set_user | sys_config         | BEFORE        |
| sys_config_update_set_user | sys_config         | BEFORE        |
| abuse_filter_log_insert    | abuse_filter_log   | BEFORE        |
| abuse_filter_log_update    | abuse_filter_log   | BEFORE        |
| recentchanges_insert       | recentchanges      | BEFORE        |
| recentchanges_update       | recentchanges      | BEFORE        |
| user_insert                | user               | BEFORE        |
| user_update                | user               | BEFORE        |
| archive_insert             | archive            | BEFORE        |
| archive_update             | archive            | BEFORE        |
+----------------------------+--------------------+---------------+
10 rows in set (0.005 sec)

Each trigger includes several tasks, e.g. SET NEW.user_password = ''

  • Monitor that all schemas have their filters in place—produce an error on icinga otherwise. There were some wikis that didn't have the filter.
 SELECT SCHEMATA.SCHEMA_NAME FROM SCHEMATA WHERE SCHEMATA.SCHEMA_NAME NOT IN (SELECT EVENT_OBJECT_SCHEMA FROM TRIGGERS)

Private data checks

Run daily by check_private_data.py. Check check_private_data.pp for deployment details.

Wiki Replica views

The Wiki Replica hosts filter again by using custom views which set certain fields to NULL and preventing normal users from seeing the underlying tables directly. For example:

customviews:
  abuse_filter:
    source: abuse_filter
    view: >
      select af_id, if(af_hidden,null,af_pattern) as af_pattern,
      af_user, af_user_text, af_timestamp, af_enabled,
      if(af_hidden,null,af_comments) as af_comments, af_public_comments,
      af_hidden, af_hit_count, af_throttled, af_deleted, af_actions, af_global, af_group
CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `abuse_filter` AS select `enwiki`.`abuse_filter`.`af_id` AS `af_id`,if(`enwiki`.`abuse_filter`.`af_hidden`,NULL,`enwiki`.`abuse_filter`.`af_pattern`) AS `af_pattern`,`enwiki`.`abuse_filter`.`af_user` AS `af_user`,`enwiki`.`abuse_filter`.`af_user_text` AS `af_user_text`,`enwiki`.`abuse_filter`.`af_timestamp` AS `af_timestamp`,`enwiki`.`abuse_filter`.`af_enabled` AS `af_enabled`,if(`enwiki`.`abuse_filter`.`af_hidden`,NULL,`enwiki`.`abuse_filter`.`af_comments`) AS `af_comments`,`enwiki`.`abuse_filter`.`af_public_comments` AS `af_public_comments`,`enwiki`.`abuse_filter`.`af_hidden` AS `af_hidden`,`enwiki`.`abuse_filter`.`af_hit_count` AS `af_hit_count`,`enwiki`.`abuse_filter`.`af_throttled` AS `af_throttled`,`enwiki`.`abuse_filter`.`af_deleted` AS `af_deleted`,`enwiki`.`abuse_filter`.`af_actions` AS `af_actions`,`enwiki`.`abuse_filter`.`af_global` AS `af_global`,`enwiki`.`abuse_filter`.`af_group` AS `af_group` from `enwiki`.`abuse_filter`

When a clouddb host has to be rebuilt from zero or if a new wiki is created, maintain-views is used to create/update the views.

Sanitarium's master failover

This is an example of failing over a server that acts as a Sanitarium master for s1.

Current master: db1066

Future master: db1106

Sanitarium server: db1155 (s1 uses port 3311)

  • Downtime db1065 and db1106
  • Depool db1065 and db1106 via dbctl
  • Log the maintenance on #wikimedia-operations
    !log Stop db1065 and db1106 in sync - this will generate lag on labs
    
  • stop db1065 and db1106 in sync:
    ./repl.pl --stop-siblings-in-sync --host1=db1065.eqiad.wmnet:3306 --host2=db1106.eqiad.wmnet:3306
    
  • Verify they stopped on the same position
    mysql --skip-ssl -hdb1106 -e "show slave status\G" | grep Exec_Master ; mysql --skip-ssl -hdb1065 -e "show slave status\G" | grep Exec_Master
    
  • Check db1106 master log: 
    mysql --skip-ssl -hdb1106 -e "show master status\G"
    
  • Connect to db1155 to the needed port (in this example 3311)
stop slave; reset slave all;
change master to master_host='db1106.eqiad.wmnet', master_user='repl', master_password='xx' ,master_port=3306, master_log_file='db1106-bin.000009', master_log_pos=474802220, master_ssl=1;
  • check filters on the new s1 thread
  • start slave s1 on db1155
  • start slave on db1106
  • start slave on db1065
  • Slowly repool db1106 via dbctl