News/Wikireplicas 2020 Redesign

From Wikitech
Jump to navigation Jump to search

Problem

Wiki Replicas existing setup relies on a single MariaDB instance for all Wikimedia data. As of October 2020, this includes over 900 wikis, comprised of 6,000 tables, and totaling over 7 TB’s after full compression. This imposes limits on the ability to scale and maintain a performant and stable service. In particular, based on the community’s feedback and our own observations, two primary problems have arisen.

  1. Given the large amount of data, our ability to keep nodes in sync has suffered. This results in poor performance. It also becomes even more of a concern should we have to recover from data corruption. Indeed, even maintaining capacity within the cluster has become challenging and more hardware is needed.
  2. Usage and query complexity continue to increase, resulting in reduced performance. However, our current architectural design limits our ability to optimize for these complex queries, such as those used to do analytics and OLAP.

So in short, our existing design causes performance issues, and limits our ability to scale.

New architecture

Given the challenges outlined above, we plan to change the architectural model from "Multi-source" (lots of slices smashed into one MariaDB instance) to "Multi-instance" (multiple MariaDB instances on a single physical node with each instance only handling a single slice). Unfortunately, we need to make this architectural change in order to continue to provide a stable and performant service.

The new physical layout will be 8 slices * 2 types (web + analytics) * 2 instances each == 32 database instances. Deployed as multi-instance with 4 instances per physical node == 8 nodes.

Impacts

Given the large change to the underlying architecture, unfortunately, our existing tools and queries will be impacted. In particular, each instance will now contain only a single slice. Therefore;

  • For those who utilize cross slice JOIN's, the current query will likely no longer work as written. Unfortunately, this means cross joins from, for instance, enwiki to commonswiki or wikidatawiki are no longer possible.
  • Queries against a database need to connect to that database. In other words, connecting to enwiki to query against avwiki will no longer work. Please make a specific connection to each database to query against.
  • Services like Quarry and PAWS will need to be updated to understand where to route a query as there will now be multiple MariaDB instances.

What do I need to do?

  1. Update existing tools to ensure queries are executed against the proper database connection. Specifically, utilize the database hostnames in the form of <db_name>.db.svc.eqiad.wmflabs. Don't connect to enwiki and then query against frwiki for example. Instead connect to frwiki.db.svc.eqiad.wmflabs, then execute the query.
  2. Check existing tools and services queries for cross slice JOIN's. If found separate the queries and do the required data processing or filtering within the tool or service.

How can I help?

Mitigation for cross slice JOIN's will require changes to existing tools and services. Providing help and guidance to developers and other community members is crucial to ensuring existing workflows can continue uninterrupted. Consider:

  • Sharing examples / descriptions of how a tool or service was updated
  • Writing a common solution or some example code others can utilize
  • Helping others

FAQ

  • How will the wikis map to physical nodes / slices?
    • By connecting to the proper database for a query, this implementation detail shouldn't be something to worry about. However, the current mappings are here. These mappings are not expected to radically change as part of this process.
  • I need to cross reference data between wikis; How can I do this if I'm unable to run cross joins as I do today?
    • All existing data can still be queried (be sure to use specific connections to each database to do so). The resulting datasets can then be processed as desired as part of a tool or service. Unfortunately, the diversity of tools, services, and queries means there isn't a direct answer that can be provided. The key will be to replicate the filtering employed by the sql query inside the codebase instead.

Timeline

  • October 2020
    • Hardware purchased; prepping for installation
  • October - December 2020
    • New hardware is brought online, critical services updated
    • Testing with early adopters
  • January 2021
    • At this point, both clusters should be operational, allowing for a transition period
    • Begin user migration
  • February 2021
    • Old hardware is decommissioned. All tools and services have been transitioned.

Related Tickets

https://phabricator.wikimedia.org/T265135

https://phabricator.wikimedia.org/T260389

https://phabricator.wikimedia.org/T253134

https://phabricator.wikimedia.org/T260843

https://phabricator.wikimedia.org/T264254