News/Wiki Replicas 2020 Redesign

From Wikitech

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What is changing?

Unfortunately, we need to make this architectural change in order to continue to provide a stable and performant service. The main changes are:

Timeline

  • October 2020
    • Hardware purchased; prepping for installation
  • October 2020 - January 2021
    • New hardware is brought online, critical services updated
  • February 2021
    • Both clusters operational on February 1st
    • Testing with early adopters
    • Begin user migration
  • March 2021
    • Announce new cluster domains for all users - March 12
    • Migrate Quarry to use the new cluster - March 23
  • April 2021
    • PAWS migration - Estimated first week of April
    • Migrate the old cluster to utilize new replication hosts. Replication may stop. - April 15
    • Redirect old hostnames to the new cluster - April 28
    • Old cluster shutdown - Some time after the redirect


NOTE: While no specific date is yet given, the old cluster will be shut down as soon as possible after the new cluster is made the default. Even while the old cluster remains online, replication is not guaranteed. It is possible the old cluster stops receiving updates. It is encouraged to migrate to the new cluster, and raise potential issues founds in doing so, as soon as possible.

What should I do?

Update existing tools to ensure queries are executed against the proper database connection

Because you Can only query a database if you are connected to it explicitly, you should utilize the database hostnames in the form of <db_name>.<web or analytics>.db.svc.wikimedia.cloud.

Don't connect to enwiki and then query against frwiki for example. Instead connect to frwiki.web.db.svc.wikimedia.cloud, then execute the query.

Check your existing tools and services queries for cross database JOINs

Because Cross database JOINs will no longer work as written, check your services/tools and if you find cross JOINs you will have to decide what to substitute them with. The answer is usually very context-dependent, so if you have doubts you can try asking in chat or mailing list.

In general, the idea is to separate the queries and use a connection for each database. Then once you have the results, do the required data processing or filtering with code.

If the datasets and queries are so big that the previous approach doesn't work, you can try paginating a query, and doing subqueries with the results. For example, query DB 1, iterate over the local data, paginating any way you want. Then, for every N items on query 1, query the DB 2 for those N items based on the join conditions (if you expect many results, paginate this subquery too as needed).

There are other options for how to tackle this, unfortunately all of them will be more complex than an straightforward SQL query. You can see examples in the Solutions to common problems section, and please reach out if you need help.

What are the primary changes with the new architecture?

Given the large change to the underlying architecture, our existing tools and queries will be impacted. In particular, each instance will now contain only a single database. So:

Cross database JOINs will no longer work as written

Unfortunately, this means cross joins from, for instance, enwiki to commonswiki or wikidatawiki are no longer possible.

There are different ways to accomplish the same results given the limitations, but they vary based on the query and where it was being run from. See Solutions to common problems for migration suggestions.

Can only query a database if you are connected to it explicitly

For example, connecting to enwiki to query against avwiki will no longer work.

You will need to make a specific connection to each database you want to query against. See Update existing tools to ensure queries are executed against the proper database connection and Solutions to common problems for examples.

New host names

For the new replicas, new hostnames have been added. They are region-independent and allow early testers to access the new replicas while the existing cluster is running.

Old New
${PROJECT}.{analytics,web}.db.svc.eqiad.wmflabs ${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud
eswiki.web.db.svc.eqiad.wmflabs eswiki.web.db.svc.wikimedia.cloud
s${SECTION_NUMBER}.{analytics,web}.db.svc.eqiad.wmflabs s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud
s7.web.db.svc.eqiad.wmflabs s7.web.db.svc.wikimedia.cloud

The *.db.svc.eqiad.wmflabs names will continue working and will be redirected to the new ones at the time mentioned in the timeline.

Solutions to common problems

Help us contributing code snippets or how you migrated your code

If you migrate your tools or code and you had to do some changes to adapt to the new Wiki Replicas architecture, specific examples are very helpful. Feel free to edit them in above this heading, or post them in the talk page or mailing list.

How can I test the new replicas before the switchover?

Basically everything is the same, except for the host names. See #New host names.

Here is an example when logged in on Toolforge:

$ ssh login.toolforge.org

$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud enwiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
ERROR 1049 (42000): Unknown database 'enwiki_p'

$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|     1207 |
+----------+

$ mysql --defaults-file=replica.my.cnf -h eswiki.analytics.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
|     1207 |
+----------+

How do I cross reference data between wikis like I do with cross joins today?

All existing data can still be queried, but you will have to use specific connections to each database to do so.

The resulting datasets can then be processed in code as desired as part of a tool or service. The key will be to migrate the filtering in the SQL query to the codebase instead.

Unfortunately, the diversity of tools, services, and queries means there isn't a direct answer that can be provided. Browse around the Solutions to common problems to see if you can find an example that helps you, or otherwise reach out for support.

I have a tool that queries all of the wikis and I use only one connection

In production, creating new connections on-the-fly is the way to go. For local development, ssh tunnels are suggested. If you want to test locally against every DB, opening that many ssh tunnels may not be practical. Another option can be using a ssh tunnel per section, and looking at the mappings at runtime to know which section host connection to use to query a certain DB. More details can be read in this email thread.

How will the wikis map to physical nodes/sections?

By connecting directly to the database for a query, this implementation detail shouldn't be something to worry about. However, the current mappings are at noc's db.php. These mappings are not expected to radically change as part of this process.

How should I connect to a database in Quarry?

Given databases will now be in different hosts, you will have to select the database you want to use before writing a query in Quarry so that it knows where to route your query.

There will be a new input field above the query text area to write the DB name you want to use (like eswiki_p), instead of writing use statements in the SQL (like use eswiki_p;).

I use Quarry to perform cross join queries, what do I do?

Since Quarry only provides SQL as the interface, to migrate your code the easiest path is to move it to a PAWS notebook, which will involve some coding in Python. If you need help adapting your queries to a PAWS notebook, please reach out for support, share your queries and attempts on a notebook.

How should I connect to databases in PAWS?

Due to the databases being placed in separate sections, it is not possible to use the proxy that PAWS used with the environment variable MYSQL_HOST, so instead we need to connect to the host URLs the same way it is done from Toolforge.

There is a credentials file in $HOME/.my.cnf that you can use when connecting, instead of environment variables.

Here are two notebooks with comparison and examples:

Here is a short Python example with how to connect:

import pymysql

conn = pymysql.connect(
    # Host urls are like {wiki}.{analytics,web}.db.svc.wikimedia.cloud
    host="eswiki.analytics.db.svc.wikimedia.cloud",
    read_default_file=".my.cnf",
    database="eswiki_p"
)

with conn.cursor() as cur:
    cur.execute(
        "SELECT page_title FROM page WHERE page_title LIKE %s LIMIT 5;",
        args=('%Alicante%',)
    )
    for row in cur.fetchall():
        print(str(row[0], encoding='utf-8'))

We will be updating the documentation through April with the new method.

Every existing PAWS notebook has its output cached, so they are still viewable without having to change anything. When you want to re-run them, you will have to update the DB connection method as mentioned above.

Why are we doing this?

Wiki Replicas existing setup relies on a single MariaDB instance for all Wikimedia data. As of October 2020, this includes over 900 wikis, comprised of 6,000 tables, and totalling over 7 TB’s after full compression.

This setup limits our ability to scale and maintain a performant and stable service. Based on community feedback and our own observations running the service:

  1. Given the large amount of data, the service struggles to keep nodes in sync. This results in poor performance
    • Recovering from data corruption is a concern, given the current performance
    • Even maintaining current capacity within the cluster has become challenging and more hardware is needed
  2. Usage and query complexity continue to increase, resulting in reduced performance
  3. Our current architectural design limits our ability to optimize for complex queries, such as those used to do analytics and OLAP

New architecture

Given the challenges outlined above, we plan to change the architectural model from "Multi-source" (lots of sections smashed into one MariaDB instance) to "Multi-instance" (multiple MariaDB instances on a single physical node with each instance only handling a single sections).

The new physical layout will be 8 sections * 2 types (web + analytics) * 2 instances each == 32 database instances. Deployed as multi-instance with 4 instances per physical node == 8 nodes.

Impact on other services

Services like Quarry and PAWS will need to be updated to understand where to route a query as there will now be multiple MariaDB instances.

How can I help?

Mitigation for cross database JOIN's will require changes to existing tools and services. Providing help and guidance to developers and other community members is crucial to ensuring existing workflows can continue uninterrupted. Consider:

  • Sharing examples / descriptions of how a tool or service was updated
  • Writing a common solution or some example code others can utilize and reference
  • Helping others on IRC and the mailing lists

Related Tickets

Other links