News/Wiki Replicas 2020 Redesign

From Wikitech
Jump to navigation Jump to search

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What is changing?

Unfortunately, we need to make this architectural change in order to continue to provide a stable and performant service. The main changes are:


  • October 2020
    • Hardware purchased; prepping for installation
  • October 2020 - January 2021
    • New hardware is brought online, critical services updated
  • February 2021
    • Both clusters operational on February 1st
    • Testing with early adopters
    • Begin user migration
  • March 2021
    • Transition DNS to point to new cluster as default
    • Plan old cluster shutdown

NOTE: While no specific date is yet given, the old cluster will be shut down as soon as possible after the new cluster is made the default. Even while the old cluster remains online, replication is not guaranteed. It is possible the old cluster stops receiving updates. It is encouraged to migrate to the new cluster, and raise potential issues founds in doing so, as soon as possible.

What should I do?

Update existing tools to ensure queries are executed against the proper database connection

Because you Can only query a database if you are connected to it explicitly, you should utilize the database hostnames in the form of <db_name>.db.svc.eqiad.wmflabs.

Don't connect to enwiki and then query against frwiki for example. Instead connect to frwiki.db.svc.eqiad.wmflabs, then execute the query.

Check your existing tools and services queries for cross database JOINs

Because Cross database JOINs will no longer work as written, check your services/tools and if you find cross JOINs you will have to decide what to substitute them with. The answer is usually very context-dependent, so if you have doubts you can try asking in chat or mailing list.

In general, the idea is to separate the queries and use a connection for each database. Then once you have the results, do the required data processing or filtering with code.

If the datasets and queries are so big that the previous approach doesn't work, you can try paginating a query, and doing subqueries with the results. For example, query DB 1, iterate over the local data, paginating any way you want. Then, for every N items on query 1, query the DB 2 for those N items based on the join conditions (if you expect many results, paginate this subquery too as needed).

There are other options for how to tackle this, unfortunately all of them will be more complex than an straightforward SQL query. You can see examples in the Solutions to common problems section, and please reach out if you need help.

What are the primary changes with the new architecture?

Given the large change to the underlying architecture, our existing tools and queries will be impacted. In particular, each instance will now contain only a single database. So:

Cross database JOINs will no longer work as written

Unfortunately, this means cross joins from, for instance, enwiki to commonswiki or wikidatawiki are no longer possible.

There are different ways to accomplish the same results given the limitations, but they vary based on the query and where it was being run from. See Solutions to common problems for migration suggestions.

Can only query a database if you are connected to it explicitly

For example, connecting to enwiki to query against avwiki will no longer work.

You will need to make a specific connection to each database you want to query against. See Update existing tools to ensure queries are executed against the proper database connection and Solutions to common problems for examples.

Solutions to common problems

How do I cross reference data between wikis like I do with cross joins today?

All existing data can still be queried, but you will have to use specific connections to each database to do so.

The resulting datasets can then be processed in code as desired as part of a tool or service. The key will be to migrate the filtering in the SQL query to the codebase instead.

Unfortunately, the diversity of tools, services, and queries means there isn't a direct answer that can be provided. Browse around the Solutions to common problems to see if you can find an example that helps you, or otherwise reach out for support.

I use Quarry to perform cross join queries, what do I do?

Since Quarry only provides SQL as the interface, to migrate your code the easiest path is to move it to a PAWS notebook, which will involve some coding in Python. If you need help adapting your queries to a PAWS notebook, please reach out for support, share your queries and attempts on a notebook.

I have a tool that queries all of the wikis and I use only one connection

In production, creating new connections on-the-fly is the way to go. For local development, ssh tunnels are suggested. If you want to test locally against every DB, opening that many ssh tunnels may not be practical. Another option can be using a ssh tunnel per section, and looking at the mappings at runtime to know which section host connection to use to query a certain DB. More details can be read in this email thread.

How will the wikis map to physical nodes/sections?

By connecting directly to the database for a query, this implementation detail shouldn't be something to worry about. However, the current mappings are here. These mappings are not expected to radically change as part of this process.

Help us contributing code snippets or how you migrated your code

If you migrate your tools or code and you had to do some changes to adapt to the new Wiki Replicas architecture, specific examples are very helpful. Feel free to edit them in above this heading, or post them in the talk page or mailing list.

Why are we doing this?

Wiki Replicas existing setup relies on a single MariaDB instance for all Wikimedia data. As of October 2020, this includes over 900 wikis, comprised of 6,000 tables, and totalling over 7 TB’s after full compression.

This setup limits our ability to scale and maintain a performant and stable service. Based on community feedback and our own observations running the service:

  1. Given the large amount of data, the service struggles to keep nodes in sync. This results in poor performance
    • Recovering from data corruption is a concern, given the current performance
    • Even maintaining current capacity within the cluster has become challenging and more hardware is needed
  2. Usage and query complexity continue to increase, resulting in reduced performance
  3. Our current architectural design limits our ability to optimize for complex queries, such as those used to do analytics and OLAP

New architecture

Given the challenges outlined above, we plan to change the architectural model from "Multi-source" (lots of sections smashed into one MariaDB instance) to "Multi-instance" (multiple MariaDB instances on a single physical node with each instance only handling a single sections).

The new physical layout will be 8 sections * 2 types (web + analytics) * 2 instances each == 32 database instances. Deployed as multi-instance with 4 instances per physical node == 8 nodes.

Impact on other services

Services like Quarry and PAWS will need to be updated to understand where to route a query as there will now be multiple MariaDB instances.

How can I help?

Mitigation for cross database JOIN's will require changes to existing tools and services. Providing help and guidance to developers and other community members is crucial to ensuring existing workflows can continue uninterrupted. Consider:

  • Sharing examples / descriptions of how a tool or service was updated
  • Writing a common solution or some example code others can utilize and reference
  • Helping others on IRC and the mailing lists

Related Tickets