Jump to content

Portal:Data Services/Admin/Wiki Replicas

From Wikitech

This page holds most of the knowledge we have regarding Wiki Replicas for admins.

Service architecture by layers

From production to Wiki Replicas

Data from production databases is replicated to Wiki Replicas through intermediate "Sanitarium" hosts that sanitize the data removing sensitive information. The data flow is described in detail in the page From production to Wiki Replicas.

Physical database server layer

There are currently 8 servers in this cluster. Each of the nodes hosts two database sections. Since there are 8 sections in the scope of the Wikireplicas service, that means that they are able to function as sets of four that can act as redundant standby for one another. These are all Dell PowerEdge R440s, built to Data Persistence specs.

Web (realtime workloads)

Analytics (long-processing workloads)

DB layer

All database servers run MariaDB.

The multi-instance servers have a somewhat complex layout that is documented in each database server's host hiera file (eg. clouddb1013.yaml). They are multi-instance in the sense that each server has two mariadb database instances running at the same time on nonstandard ports (see: profile::mariadb::section_ports). Each instance is named for and hosts one database section.

Proxy layer

Access to the wiki replicas happens via the cloudlb load balancer. Each section/type pair gets assigned its own service VIP in the cloud-private service VIPs range.

Pooling status of the backend database servers is managed via conftool and has a dedicated runbook available.

DNS

DNS is an important component of helping our users find their server in multi-instance. Each wiki database has a CNAME record in the form of:

${WIKI-DATABASE}.{analytics,web}.db.svc.wikimedia.cloud.

The CNAME points to the appropriate section's DNS, in the form of:

s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud.

The section's DNS is an A record which points to a distinct IP address on the proxy layer (cloudlbXXXX).

Example:

$ dig +short eswiki.analytics.db.svc.wikimedia.cloud
s7.analytics.db.svc.wikimedia.cloud.
172.20.255.8

Updating the DNS configuration

If you need to update the DNS configuration for Wiki Replicas, e.g. to add a new wikidb alias, refer to the page Wiki Replica DNS.

Account management (maintain-dbusers)

We maintain the list of accounts to access on the cloudcontrols (only one acting as primary at a time). The script writes out a $HOME/replica.my.cnf file to each Toolforge and PAWS user and Toolforge tool home containing MariaDB connection credentials by using an API that each project NFS server is running. This uses LDAP to get a list of accounts to create.

The credential files are created with the immutable bit set with chattr to prevent deletion by the Tool account.

The code pattern here is that you have a central data store (the db), that is then read/written to by various independent functions. These functions are not 'pure' - they could even be separate scripts. They mutate the DB in some way. They are also supposed to be idempotent - if they have nothing to do, they should not do anything.

Most of these functions should be run in a continuous loop, maintaining mysql accounts for new tool/user accounts as they appear.

Some stats can be found in grafana.

populate_new_accounts

  • Find list of tools/users (From LDAP) that aren't in the `accounts` table
  • Create a replica.my.cnf for each of these tools/users
  • Make an entry in the `accounts` table for each of these tools/users
  • Make entries in `account_host` for each of these tools/users, marking them as absent

create_accounts

  • Look through `account_host` table for accounts that are marked as absent
  • Create those accounts, and mark them as present.

If we need to add a new labsdb, we can do so the following way:

  • Add it to the config file
  • Insert entries into `account_host` for each tool/user with the new host.
  • Run `create_accounts`

In normal usage, just a continuous process running `populate_new_accounts` and `create_accounts` in a loop will suffice.

TODO:

 - Support for maintaining per-tool restrictions (number of connections + time)

Who admins what

Wiki Replicas are jointly managed by the following teams: WMCS, Data persistence, Data Engineering and Data Platform SRE.

The following table describes the current split of responsibilities.

Area Ownership Notes
Communication with customers WMCS
Hardware spec definition / review Data Persistence
Hardware provisioning WMCS Includes DC Ops interactions
Hardware maintenance WMCS Includes DC Ops interactions
Hardware decommissioning WMCS Includes DC Ops interactions
Monitoring - new hosts Data Persistence Zarcillo, Tendril. When Orchestrator is in place this will be automated.
Monitoring - DB generic Data Persistence
Monitoring - CloudDB-specific WMCS
OS install verification against spec WMCS
Puppet updates - DB generic Data Persistence
Puppet updates - CloudDB-specific WMCS
Data migration Data Persistence Populating new hosts with data
Replication set up Data Persistence
Alarm triage WMCS
Alarm adjustments WMCS
OS-level troubleshooting WMCS
DB-level troubleshooting WMCS, escalate to Data Persistence if needed
Schema updates Data Persistence Through replication
View creation / updates Data Engineering Data Engineering assisting with defining views and what data should be redacted
Applying view changes Data Platform SRE Apply changes to views to the various wiki replicas
Performance tuning WMCS, escalate to Data Persistence if needed
Backup / Restore Data Persistence Restore from prod via Sanitarium
Removing PII data Data Persistence WMCS cc'd on the alert
OS updates (to level tested in production) WMCS, coordinated with Data Persistence Coordination to ensure we're running compatible versions
DB version updates (to level tested in production) Data Persistence, coordinated with WMCS Coordination to ensure we're running compatible versions

Admin guide

Docs to perform common tasks related to this service. As detailed as possible.

Adding a wiki

When a new wiki is added to the production database (see Add a wiki), there are some additional steps to make it available in Wiki Replicas too.

Run the cookbook sre.wikireplicas.add-wiki on the any of the cluster management hosts (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet). The following example is for smnwiki and it's task. Replace those values with the wiki database name you are adding and the correct task ID:

$ sudo secure-cookbook sre.wikireplicas.add-wiki --task-id T264900 smnwiki
START - Cookbook sre.wikireplicas.add-wiki
Generating views...
Adding DNS
Finalizing meta_p
Added views for new wiki: smnwiki T264900
END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
The "Adding DNS" step takes quite a while to complete. Do not panic! If you are adding more than one wiki, you only need to run the DNS step once. After you've added the first wiki, you can save yourself lots of time by adding the --skip-dns option to the cookbook:
$ sudo secure-cookbook sre.wikireplicas.add-wiki --task-id Txxxxxx wiki1
$ sudo secure-cookbook sre.wikireplicas.add-wiki --task-id Tyyyyyy --skip-dns wiki2
$ sudo secure-cookbook sre.wikireplicas.add-wiki --task-id Tzzzzzz --skip-dns wiki3

Note Note:If in doubt you can always use the --dry-run option on the secure-cookbook command so that you just see what it would do. For example:

$ sudo secure-cookbook --dry-run sre.wikireplicas.add-wiki --task-id T260551 thankyouwiki
DRY-RUN: Executing cookbook sre.wikireplicas.add-wiki with args: ['--task-id', 'T260551', 'thankyouwiki']
DRY-RUN: START - Cookbook sre.wikireplicas.add-wiki
DRY-RUN: Generating views...
DRY-RUN: Executing commands ['/usr/local/sbin/maintain-replica-indexes --database thankyouwiki', '/usr/local/sbin/maintain-views --databases thankyouwiki'] on 4 hosts: labsdb[1009-1012].eqiad.wmnet
DRY-RUN: Adding DNS
DRY-RUN: Executing commands ['source /root/novaenv.sh; wmcs-wikireplica-dns --aliases'] on 1 hosts: cloudcontrol1003.wikimedia.org
DRY-RUN: Finalizing meta_p
DRY-RUN: Executing commands ['/usr/local/sbin/maintain-meta_p --databases thankyouwiki'] on 4 hosts: labsdb[1009-1012].eqiad.wmnet
DRY-RUN: Added views for new wiki: thankyouwiki T260551
DRY-RUN: END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)

Manual method in case the cookbook fails

Finish up for either method

Before resolving the ticket, log into a Toolforge bastion as yourself and run:

localhost:~$ ssh login.toolforge.org
tools-sgebastion-07:~$ sql $wiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 470351380
Server version: 10.1.43-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [$wiki_p]> select * from page limit 2;

If you get an error from the select * from page limit 2; statement, you may have missed a step or need to do some troubleshooting. You should just get a couple records.

User connection limits

There are two important aspects to this.

  1. The review process, which is that requests for expanding the number of connections per tool be discussed at the weekly WMCS planning meeting for approval with a requirement of a +1 from a member of the DBA team after WMCS approval. This should be a restrictive process that errs on the side of denial. Only tools with a pretty good reason and significant user base should use additional connections beyond the default limit of 10.
  2. Keeping a record of the change in modules/role/templates/mariadb/grants/wiki-replicas.sql. In the file as of this writing, line 38 provides a model for a comment (including the Phabricator task number and user) as well as the exact command that is run on the replicas. For consistency, the command should be run on all four replica hosts.

The process after approval is:

  • Determine the mysql username of the tool, which should be in any approved request. This will be something similar to s52788 and not something like tools.flubber. It can be found in the tool's $HOME/replica.my.cnf.
  • The clinic-duty person logs into the root mysql shell for each replica and runs GRANT USAGE ON *.* TO '$mysql_username'@'%' WITH MAX_USER_CONNECTIONS $number; Please note that you can always view grants on the server you are on with SHOW GRANTS FOR '$mysql_username';
  • Record the change in modules/role/templates/mariadb/grants/wiki-replicas.sql
  • To persist the change on rebuilds, add the new value to hieradata/role/common/wmcs/nfs/primary.yaml
  • Communicate with the user that they should be all set!

De-pooling Replica Servers

See Portal:Data Services/Admin/Runbooks/Depool wikireplicas

Updating views

See Portal:Data Services/Admin/Runbooks/Deploy wiki replicas view change

Update a wiki's section

Although it is a rare operation, sometimes production wikis get moved from one section to a different one. For example, if two wikis get moved from s3 to s5, the following steps need to taken on the wikireplicas:

  • Update modules/profile/files/openstack/base/pdns/recursor/labsdb.zone, follow this example: https://gerrit.wikimedia.org/r/619627
  • Run /usr/local/sbin/maintain-meta_p on each wikireplica host
    • Double check that the slices look good and the new wikis are on the correct slice, in this case, s5:
sql meta_p
select dbname, slice from wiki where dbname in ('muswiki', 'mhwiktionary');
  • From a cloudcontrol node (like cloudcontrol1004.wikimedia.org) run:
wmcs-wikireplica-dns --aliases --shard s3
wmcs-wikireplica-dns --aliases --shard s5

History

Before September 2017 we had a cluster composed of three physical servers: labsdb1001, labsdb1002 and labsdb1003. Each host contained a copy of all database sections combined into a single database service which also allowed users to add their own custom tables to each wiki's namespace. DNS names for connecting to these servers used a ${wikidb}.labsdb naming convention. These legacy service names survive as CNAME pointers to the appropriate current (2020) s${SECTION_NUMBER}.analytics.db.svc.wikimedia.cloud name. Records are only maintained for wikis which existed at the time of the migration to the 2017 system.

In 2017 we deployed a new cluster of three physical servers with faster hardware, improved replication, and new service names. Each host still contained a merged view of all database sections under a single MariaDB process. User created tables were not allowed in this redesign to improve the ability to shift request traffic from one instance to another. This redesign also introduced the "web" vs "analytics" split for performance tuning. Read more details in the New Wiki Replica servers ready for use blog post from the launch announcement.

In 2021 we deployed the "2020" cluster of 8 physical servers with a new "multiinstance" design. Multiinstance separates database sections into distinct MariaDB instances with two instances hosted on each physical server. See News/2020 Wiki Replicas Redesign for more information about the user facing changes this required.

In 2023-2024 we updated the load balancing layer to the cloudlb setup, previously there were lots of more layers and manual steps involved.

See also