Portal:Data Services/Admin/Wiki Replicas

This page holds most of the knowledge we have regarding Wiki Replicas for admins.

Service architecture by layers

From production to Wiki Replicas

Data from production databases is replicated to Wiki Replicas through intermediate "Sanitarium" hosts that sanitize the data removing sensitive information. The data flow is described in detail in the page From production to Wiki Replicas.

Physical database server layer

There are currently 8 servers in this cluster. Each of the nodes hosts two database sections. Since there are 8 sections in the scope of the Wikireplicas service, that means that they are able to function as sets of four that can act as redundant standby for one another. These are all Dell PowerEdge R440s, built to Data Persistence specs.

Web (realtime workloads)

Analytics (long-processing workloads)

Despite using the same clouddb* prefix, clouddb2002-dev.codfw.wmnet is not a Wiki Replica host. It is used by labtestwikitech and it should be eventually decommissioned, tracked in phab:T369308

DB layer

All database servers run MariaDB.

The multi-instance servers have a somewhat complex layout that is documented in each database server's host hiera file (eg. clouddb1013.yaml). They are multi-instance in the sense that each server has two mariadb database instances running at the same time on nonstandard ports (see: profile::mariadb::section_ports). Each instance is named for and hosts one database section.

Proxy layer

Access to the wiki replicas happens via the cloudlb load balancer. Each section/type pair gets assigned its own service VIP in the cloud-private service VIPs range.

Pooling status of the backend database servers is managed via conftool and has a dedicated runbook available.

DNS

DNS is an important component of helping our users find their server in multi-instance. Each wiki database has a CNAME record in the form of:

${WIKI-DATABASE}.{analytics,web}.db.svc.wikimedia.cloud.

The CNAME points to the appropriate section's DNS, in the form of:

s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud.

The section's DNS is an A record which points to a distinct IP address on the proxy layer (cloudlbXXXX).

Example:

$ dig +short eswiki.analytics.db.svc.wikimedia.cloud
s7.analytics.db.svc.wikimedia.cloud.
172.20.255.8

Updating the DNS configuration

If you need to update the DNS configuration for Wiki Replicas, e.g. to add a new wikidb alias, refer to the page Wiki Replica DNS.

Account management (maintain-dbusers)

Debugging info at: Portal:Toolforge/Admin#Regenerate_replica.my.cnf

We maintain the list of accounts to access on the cloudcontrols (only one acting as primary at a time). The script writes out a $HOME/replica.my.cnf file to each Toolforge and PAWS user and Toolforge tool home containing MariaDB connection credentials by using an API that each project NFS server is running. This uses LDAP to get a list of accounts to create.

The credential files are created with the immutable bit set with chattr to prevent deletion by the Tool account.

The code pattern here is that you have a central data store (the db), that is then read/written to by various independent functions. These functions are not 'pure' - they could even be separate scripts. They mutate the DB in some way. They are also supposed to be idempotent - if they have nothing to do, they should not do anything.

Most of these functions should be run in a continuous loop, maintaining mysql accounts for new tool/user accounts as they appear.

Some stats can be found in grafana.

populate_new_accounts

Find list of tools/users (From LDAP) that aren't in the `accounts` table
Create a replica.my.cnf for each of these tools/users
Make an entry in the `accounts` table for each of these tools/users
Make entries in `account_host` for each of these tools/users, marking them as absent

create_accounts

Look through `account_host` table for accounts that are marked as absent
Create those accounts, and mark them as present.

If we need to add a new labsdb, we can do so the following way:

Add it to the config file
Insert entries into `account_host` for each tool/user with the new host.
Run `create_accounts`

In normal usage, just a continuous process running `populate_new_accounts` and `create_accounts` in a loop will suffice.

TODO:

 - Support for maintaining per-tool restrictions (number of connections + time)

Who admins what

Wiki Replicas are jointly managed by the following teams: WMCS, Data persistence and Data Engineering.

The following table describes the current split of responsibilities.


Area	Ownership	Notes
Communication with customers	WMCS
Hardware spec definition / review	Data Persistence
Hardware provisioning	WMCS	Includes DC Ops interactions
Hardware maintenance	WMCS	Includes DC Ops interactions
Hardware decommissioning	WMCS	Includes DC Ops interactions
Monitoring - new hosts	Data Persistence	Zarcillo, Tendril. When Orchestrator is in place this will be automated.
Monitoring - DB generic	Data Persistence
Monitoring - CloudDB-specific	WMCS
OS install verification against spec	WMCS
Puppet updates - DB generic	Data Persistence
Puppet updates - CloudDB-specific	WMCS
Data migration	Data Persistence	Populating new hosts with data
Replication set up	Data Persistence
Alarm triage	WMCS
Alarm adjustments	WMCS
OS-level troubleshooting	WMCS
DB-level troubleshooting	WMCS, escalate to Data Persistence if needed
Schema updates	Data Persistence	Through replication
View creation / updates	Data Engineering	Data Engineering assisting with defining views and what data should be redacted
Applying view changes	Data Platforms SRE	Apply changes to views to the various wiki replicas
Performance tuning	WMCS, escalate to Data Persistence if needed
Backup / Restore	Data Persistence	Restore from prod via Sanitarium
Removing PII data	Data Persistence	WMCS cc'd on the alert
OS updates (to level tested in production)	WMCS, coordinated with Data Persistence	Coordination to ensure we're running compatible versions
DB version updates (to level tested in production)	Data Persistence, coordinated with WMCS	Coordination to ensure we're running compatible versions

Admin guide

Docs to perform common tasks related to this service. As detailed as possible.

Adding a wiki

Documented at Add_a_wiki#Cloud_Services

User connection limits

There are two important aspects to this.

The review process, which is that requests for expanding the number of connections per tool be discussed at the weekly WMCS planning meeting for approval with a requirement of a +1 from a member of the DBA team after WMCS approval. This should be a restrictive process that errs on the side of denial. Only tools with a pretty good reason and significant user base should use additional connections beyond the default limit of 10.
Keeping a record of the change in modules/role/templates/mariadb/grants/wiki-replicas.sql. In the file as of this writing, line 38 provides a model for a comment (including the Phabricator task number and user) as well as the exact command that is run on the replicas. For consistency, the command should be run on all four replica hosts.

The process after approval is:

Determine the mysql username of the tool, which should be in any approved request. This will be something similar to s52788 and not something like tools.flubber. It can be found in the tool's $HOME/replica.my.cnf.
The clinic-duty person logs into the root mysql shell for each replica and runs GRANT USAGE ON *.* TO '$mysql_username'@'%' WITH MAX_USER_CONNECTIONS $number; Please note that you can always view grants on the server you are on with SHOW GRANTS FOR '$mysql_username';
Record the change in modules/role/templates/mariadb/grants/wiki-replicas.sql
To persist the change on rebuilds, add the new value to hieradata/role/common/wmcs/nfs/primary.yaml
Communicate with the user that they should be all set!

De-pooling Replica Servers

See Portal:Data Services/Admin/Runbooks/Depool wikireplicas

Updating views

See Portal:Data Services/Admin/Runbooks/Deploy wiki replicas view change

Update a wiki's section

Although it is a rare operation, sometimes production wikis get moved from one section to a different one. For example, if two wikis get moved from s3 to s5, the following steps need to taken on the wikireplicas:

Update modules/profile/files/openstack/base/pdns/recursor/labsdb.zone, follow this example: https://gerrit.wikimedia.org/r/619627
Run /usr/local/sbin/maintain-meta_p on each wikireplica host
- Double check that the slices look good and the new wikis are on the correct slice, in this case, s5:

sql meta_p
select dbname, slice from wiki where dbname in ('muswiki', 'mhwiktionary');

From a cloudcontrol node (like cloudcontrol1004.wikimedia.org) run:

wmcs-wikireplica-dns --aliases --shard s3
wmcs-wikireplica-dns --aliases --shard s5

Double check that the wikis appear on their new section using https://replag.toolforge.org/

History

Before September 2017 we had a cluster composed of three physical servers: labsdb1001, labsdb1002 and labsdb1003. Each host contained a copy of all database sections combined into a single database service which also allowed users to add their own custom tables to each wiki's namespace. DNS names for connecting to these servers used a ${wikidb}.labsdb naming convention. These legacy service names survive as CNAME pointers to the appropriate current (2020) s${SECTION_NUMBER}.analytics.db.svc.wikimedia.cloud name. Records are only maintained for wikis which existed at the time of the migration to the 2017 system.

In 2017 we deployed a new cluster of three physical servers with faster hardware, improved replication, and new service names. Each host still contained a merged view of all database sections under a single MariaDB process. User created tables were not allowed in this redesign to improve the ability to shift request traffic from one instance to another. This redesign also introduced the "web" vs "analytics" split for performance tuning. Read more details in the New Wiki Replica servers ready for use blog post from the launch announcement.

In 2021 we deployed the "2020" cluster of 8 physical servers with a new "multiinstance" design. Multiinstance separates database sections into distinct MariaDB instances with two instances hosted on each physical server. See News/Wiki Replicas 2020 Redesign for more information about the user facing changes this required.

In 2023-2024 we updated the load balancing layer to the cloudlb setup, previously there were lots of more layers and manual steps involved.