Portal:Data Services/Admin/Wiki Replicas

This page holds all the knowledge we have regarding Wiki Replicas for admins.

Service architecture by layers

Description of how the service is currently deployed.

Here is a general diagram of how things are:


Physical layer

This service is build on top of several physical servers:

All servers are the same (except labsdb1012, which is dedicated to the Analytics team):

  • HP ProLiant DL380 Gen9
  • 16 CPU Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
  • 500GB RAM
  • 1 hard-disc in /dev/sda with 11.7 TiB (probaby RAID, using hpsa kernel driver)
  • 4 x 1GB ethernet interfaces (tg3)

All monitoring, including RAID status, is done by icinga.

labsdb1012 is a HP ProLiant DL380 Gen10 (2U) and is using Intel(R) Xeon(R) Silver 4112 CPUs instead.

Storage layer

The service has a concrete storage configuration.

As seen by fdisk:

As seen by the LVM stack:

As seen by df:

All 3 servers should have more or less the same storage configuration.

The definition for this storage layout is done at install time:

DB layer

The databases in this service have a concrete layout/configuration.

TODO: fill info

Higher layers

Such as applications running on top, proxies, caches, et al.


main article Help:Toolforge/Database#Naming conventions

The DNS names of the wiki-replicas databases as seen from the SQL client point of view are things like:

  • ${PROJECT}.{analytics,web}.db.svc.eqiad.wmflabs. Example: enwiki.web.db.svc.eqiad.wmflabs
  • s${SHARD_NUMBER}.{analytics,web}.db.svc.eqiad.wmflabs. Example: s1.analytics.db.svc.eqiad.wmflabs

Puppet deployment

The cluster is currently deployed using operations/puppet.git.

All the servers are given the labs::db::replica role, which currently includes:

  • standard configs
  • mariadb packages + configs
  • ferm firewall + rules to allow the cluster to inter-comm
  • deployment of scripts, like maintain-views and friends

From production to wiki-replicas

Wiki databases from production are copied and sanitized to serve wiki-replicas.

Step 0: databases in production

The starting situation is that there are databases in production for wiki projects (like wikipedia, wikidata, wikictionary, and friends). We would like to provide this same databases for WMCS users. Due to privacy reasons, some data needs to be redacted or deleted. That's why users can't directly access this database.

So, we choose what databases to copy to wiki-replicas, which are all of them.

Every time a new database is created in production (for example, a new language for a wiki) we are in this step 0.

The identification of new database candidates for migrating to wiki-replicas is done under request by someone. Right now there aren't any mechanisms to notify pending migrations or the like.

Step 1: sanitization

Main article: MariaDB/Sanitarium and Labsdbs

The production database is copied to sanitarium boxes by means of MariaDB replication mechanisms (TODO: is this true? give more info if possible).

Each sanitarium host has a MariaDB instance to replicate each db shard. The replication into the sanitarium host uses triggers and filters to remove sensitive columns, tables and databases in the simple case where there are no conditions (e.g. Ensures user_password does not go into Cloud Services).

Having this redaction done on a separate host outside of Cloud Services helps isolate the security of the data and ensure a privilege escalation via the Cloud Services access does not compromise the most sensitive data in the db.

Some triggers are added by means of:

This step is handled by main operations teams, DBAs.

Step 2: evaluation

Main article: Labsdb redaction

Once data is in sanitarium boxes, some cron jobs and manual scripts check whether data is actually redacted. For example, check that a given column is NULL.

Involved code:

This evaluation also happens in the wiki-replica servers, and alerts in case some private data is detected.

The main production team, DBAs, are in charge of this step.

TODO: where are the cron jobs?

Step 3: filling up wiki-replicas

Data is finally copied to wiki-replica servers. The DB servers are currently using row-based replication from sanitization boxes to wiki-replicas.

This is done in real time by the DB daemons, so there is always a data flow, which should be sanitized due to previous steps.

Step 4: setting up GRANTs

At this point, database GRANTS are created by main operations team DBAs.

This is done by hand, no automation using puppet, but there is a file where GRANTS are being tracked: modules/role/templates/mariadb/grants/wiki-replicas.sql.

The content of the file is something like the following:

Related tickets:

Step 5: setting indexes

Create indexes that only exist on the wiki replicas via the maintain-replica-indexes script. This script manages the indexes in an idempotent fashion (if you run it multiple times it only changes what doesn't match the configuration file here index-conf.yaml). This could use the ability to handle only a single database, but it skips tables it finds comply with the definitions in the config, so it is fairly quick to run, despite that it runs across the whole set.

On each wiki-replica server, run:

% sudo /usr/local/sbin/maintain-replica-indexes

Step 6: setting views

Create _p views, which are intermediate views which left behind private data.

This is done by means of the maintain-views.py script.

In each wiki-replica server, it's executed like this:

% sudo maintain-views --databases $wiki --debug

This step is handled by the WMCS team.

TODO: elaborate info on what is this doing

Step 7: setting up metadata

Insert a new row in meta_p.wiki for the new wiki by running the maintain-meta_p script.

The execution is like this:

% sudo /usr/local/sbin/maintain-meta_p --databases $wiki. 

This step is handled by the WMCS team.

TODO: elaborate info on what is this doing

Step 8: setting up DNS

From an openstack control server (for example cloudcontrol1004.wikimedia.org), run:

$ sudo /usr/local/sbin/wikireplica_dns --aliases --shard <sN>

To update the service IP used for the analytics or web servers:

  • Edit modules/openstack/files/util/wikireplica_dns.yaml to change the service IP
  • Run /usr/local/sbin/wikireplica_dns
root@labcontrol1001:~# /usr/local/sbin/wikireplica_dns -v --zone web.db.svc.eqiad.wmflabs.
2018-03-16T14:02:41Z mwopenstackclients.DnsManager INFO    : Updating s8.web.db.svc.eqiad.wmflabs.
2018-03-16T14:02:42Z mwopenstackclients.DnsManager INFO    : Updating s3.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:17Z mwopenstackclients.DnsManager INFO    : Updating s2.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:23Z mwopenstackclients.DnsManager INFO    : Updating s1.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:24Z mwopenstackclients.DnsManager INFO    : Updating s7.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:28Z mwopenstackclients.DnsManager INFO    : Updating s6.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:29Z mwopenstackclients.DnsManager INFO    : Updating s5.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:30Z mwopenstackclients.DnsManager INFO    : Updating s4.web.db.svc.eqiad.wmflabs.

This step is done by the WMCS team.

TODO: elaborate info on what is this doing

Step 9: all is done

All is done, wiki-replica contains a mirror of the production database. Finally WCMS users/tools/projects are able to query the database/tables.

This is usually done by using the sql wrapper script.

TODO: the benefits of using the sql script.

Admin guide

Docs to perform common tasks related to this service. As detailed as possible.

Who admins what

  • main productions cluster: main operations team, DBAs
  • sanitarium cluster: main operations team, DBAs
  • wiki-replicas cluster: WMCS, with some support from DBAs
  • wiki-replicas DNS: WMCS

Adding a wiki

User connection limits

There are two important aspects to this.

  1. The review process, which is that requests for expanding the number of connections per tool be discussed at the weekly WMCS planning meeting for approval with a requirement of a +1 from a member of the DBA team after WMCS approval. This should be a restrictive process that errs on the side of denial. Only tools with a pretty good reason and significant user base should use additional connections beyond the default limit of 10.
  2. Keeping a record of the change in modules/role/templates/mariadb/grants/wiki-replicas.sql. In the file as of this writing, line 38 provides a model for a comment (including the Phabricator task number and user) as well as the exact command that is run on the replicas. For consistency, the command should be run on all four replica hosts.

The process after approval is:

  • Determine the mysql username of the tool, which should be in any approved request. This will be something similar to s52788 and not something like tools.flubber. It can be found in the tool's $HOME/replica.my.cnf.
  • The clinic-duty person logs into the root mysql shell for each replica and runs GRANT USAGE ON *.* TO '$mysql_username'@'%' WITH MAX_USER_CONNECTIONS $number; Please note that you can always view grants on the server you are on with SHOW GRANTS FOR '$mysql_username';
  • Record the change in modules/role/templates/mariadb/grants/wiki-replicas.sql
  • Communicate with the user that they should be all set!

De-pooling Replica Servers

  • This is accomplished on the proxy level (currently dbproxy1018 for analytics and dbproxy1019 for web)
  • To actually depool each server:
  • The actual merge and reload of the proxy should be coordinated with the DBAs. They may have more information than we do about other maintenance that is already in progress which needs to be considered before taking new actions. It is usually enough to check with them via irc to see if there is any blocker to activating the change.
  • To repool, simply revert the patch, merge, and reload. Communication with the DBAs at this point is advised as well so that they can track the global state of things more easily.

Making significant changes to the views and the indexes

View and index scripts perform DDL commands on the databases, which can cause lock contention issues both for the scripts and the users of the replicas. Problems with the views can even cause data that the view system is supposed to keep out to be exposed. These are some guidelines to prevent problems.

  • De-pool a host before doing index changes and view changes that are not simply adding a new wiki, table or database. This reduces production impact as well as fallout from mistakes.
  • If there is a concern that the data may not be in a consistent or safely-scrubbed state, find a user who is trusted (as in NDA-signed) to take a look if you aren't able to come up with some good examples yourself to go after.

Updating views

  1. Confirm with DBAs that they don't have work in progress that might interfere
  2. Merge associated gerrit change, and force puppet runs on all affected replica servers
  3. On each replica server:
    1. If modifying an existing view, these steps on each replica server
      1. Depool
        1. Write, merge, apply hiera change
        2. systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
        3. wait for depool to take effect (check with echo "show stat" | socat /run/haproxy/haproxy.sock stdio)
      2. Kill running user processes on the labsbd* server you depooled if necessary. Processes holding locks on tables you are changing must finish or be killed to proceed or the maintain-views script will fail with a traceback about timing out waiting for a lock.
        1. Look at the processes running with a root mysql shell (sudo -i mysql --skip-ssl) by running show processlist;. If you need to be able to see all of them, and they won't show on the screen, you can use pager less to set the shell to show output in a pager.
        2. Only usernames like u##### (a user) and s##### (a tool) should be killed. Other usernames are admin or system processes. Please don't kill those. You probably only need to kill queries against tables you are changing things about, and often you can just give it an hour after depooling to let people's work finish. Sometimes you need to kill more because of subqueries or DB connection pools holding table locks with connections that don't even seem to be doing anything. Never feel bad about killing an old, inactive connection because we don't allow that anyway.
        3. If you have identified a process to kill, copy the process ID and run kill $pid;. It might not die right away, but typically, unless things are in bad shape on that replica, it will die without too much delay after closing up tables.
    2. If it is just one new table, update with: sudo maintain-views --table <affected table> --all-databases, otherwise, drop the --table argument to update all views. If this is for existing tables/views, you will want to use the --replace-all argument to avoid typing "Y" 800 times.
    3. Verify changes via the mysql commandline: sudo -i mysql --skip-ssl enwiki_p
      1. Repool
        1. Revert depooling hiera change
        2. systemctl reload haproxy
        3. Check that the original state is restored with echo "show stat" | socat /run/haproxy/haproxy.sock stdio

Update wikis location

Although it is a rare operation, sometimes production wikis get moved from one section to a different one. For example, if two wikis get moved from s3 to s5, the following steps need to taken on the wikireplicas:

  • Update modules/profile/files/openstack/base/pdns/recursor/labsdb.zone, follow this example: https://gerrit.wikimedia.org/r/619627
  • Run /usr/local/sbin/maintain-meta_p on each wikireplica host
    • Double check that the slices look good and the new wikis are on the correct slice, in this case, s5:
sql meta_p
select dbname, slice from wiki where dbname in ('muswiki', 'mhwiktionary');
  • From a cloudcontrol node (like cloudcontrol1004.wikimedia.org) run:
wmcs-wikireplica-dns --aliases --shard s3
wmcs-wikireplica-dns --aliases --shard s5


Before this new cluster, we had an old cluster composed of physical servers: labsdb1001, labsdb1002 and labsdb1003.

At the time of this writing, the last standing server is labsdb1003, which is going down in December 2017.

See also