News/2020 Wiki Replicas Redesign
In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.
What is changing?
Unfortunately, we need to make this architectural change in order to continue to provide a stable and performant service. The main changes are:
- Can only query a database if you are connected to it explicitly
- Cross database JOINs will no longer work as written
Timeline
- October 2020
- Hardware purchased; prepping for installation
- October 2020 - January 2021
- New hardware is brought online, critical services updated
- February 2021
- Both clusters operational on February 1st
- Testing with early adopters
- Begin user migration
- March 2021
- Announce new cluster domains for all users - March 12
- Migrate Quarry to use the new cluster - March 23
- April 2021
- PAWS migration - Estimated first week of April
- Migrate the old cluster to utilize new replication hosts. Replication may stop. - April 15
- Redirect old hostnames to the new cluster - April 28
- Old cluster shutdown - Some time after the redirect
NOTE: While no specific date is yet given, the old cluster will be shut down as soon as possible after the new cluster is made the default. Even while the old cluster remains online, replication is not guaranteed. It is possible the old cluster stops receiving updates. It is encouraged to migrate to the new cluster, and raise potential issues founds in doing so, as soon as possible.
What should I do?
Update existing tools to ensure queries are executed against the proper database connection
Because you Can only query a database if you are connected to it explicitly, you should utilize the database hostnames in the form of <db_name>.<web or analytics>.db.svc.wikimedia.cloud
.
Don't connect to enwiki
and then query against frwiki
for example. Instead connect to frwiki.web.db.svc.wikimedia.cloud
, then execute the query.
Check your existing tools and services queries for cross database JOIN
s
Because Cross database JOINs will no longer work as written, check your services/tools and if you find cross JOINs you will have to decide what to substitute them with. The answer is usually very context-dependent, so if you have doubts you can try asking in chat or mailing list.
In general, the idea is to separate the queries and use a connection for each database. Then once you have the results, do the required data processing or filtering with code.
If the datasets and queries are so big that the previous approach doesn't work, you can try paginating a query, and doing subqueries with the results. For example, query DB 1, iterate over the local data, paginating any way you want. Then, for every N items on query 1, query the DB 2 for those N items based on the join conditions (if you expect many results, paginate this subquery too as needed).
There are other options for how to tackle this, unfortunately all of them will be more complex than an straightforward SQL query. You can see examples in the Solutions to common problems section, and please reach out if you need help.
What are the primary changes with the new architecture?
Given the large change to the underlying architecture, our existing tools and queries will be impacted. In particular, each instance will now contain only a single database. So:
Cross database JOIN
s will no longer work as written
Unfortunately, this means cross joins from, for instance, enwiki
to commonswiki
or wikidatawiki
are no longer possible.
There are different ways to accomplish the same results given the limitations, but they vary based on the query and where it was being run from. See Solutions to common problems for migration suggestions.
Can only query a database if you are connected to it explicitly
For example, connecting to enwiki
to query against avwiki
will no longer work.
You will need to make a specific connection to each database you want to query against. See Update existing tools to ensure queries are executed against the proper database connection and Solutions to common problems for examples.
New host names
For the new replicas, new hostnames have been added. They are region-independent and allow early testers to access the new replicas while the existing cluster is running.
Old | New |
---|---|
${PROJECT}.{analytics,web}.db.svc.eqiad.wmflabs | ${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud |
eswiki.web.db.svc.eqiad.wmflabs | eswiki.web.db.svc.wikimedia.cloud |
s${SECTION_NUMBER}.{analytics,web}.db.svc.eqiad.wmflabs | s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud |
s7.web.db.svc.eqiad.wmflabs | s7.web.db.svc.wikimedia.cloud |
The *.db.svc.eqiad.wmflabs
names will continue working and will be redirected to the new ones at the time mentioned in the timeline.
Solutions to common problems
Help us contributing code snippets or how you migrated your code
If you migrate your tools or code and you had to do some changes to adapt to the new Wiki Replicas architecture, specific examples are very helpful. Feel free to edit them in above this heading, or post them in the talk page or mailing list.
How can I test the new replicas before the switchover?
Basically everything is the same, except for the host names. See #New host names.
Here is an example when logged in on Toolforge:
$ ssh login.toolforge.org
$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud enwiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
ERROR 1049 (42000): Unknown database 'enwiki_p'
$ mysql --defaults-file=replica.my.cnf -h eswiki.web.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
| 1207 |
+----------+
$ mysql --defaults-file=replica.my.cnf -h eswiki.analytics.db.svc.wikimedia.cloud eswiki_p -e "select count(*) from page where page_title like \"%Alicante%\";"
+----------+
| count(*) |
+----------+
| 1207 |
+----------+
How do I cross reference data between wikis like I do with cross joins today?
All existing data can still be queried, but you will have to use specific connections to each database to do so.
The resulting datasets can then be processed in code as desired as part of a tool or service. The key will be to migrate the filtering in the SQL query to the codebase instead.
Unfortunately, the diversity of tools, services, and queries means there isn't a direct answer that can be provided. Browse around the Solutions to common problems to see if you can find an example that helps you, or otherwise reach out for support.
I have a tool that queries all of the wikis and I use only one connection
In production, creating new connections on-the-fly is the way to go. For local development, ssh tunnels are suggested. If you want to test locally against every DB, opening that many ssh tunnels may not be practical. Another option can be using a ssh tunnel per section, and looking at the mappings at runtime to know which section host connection to use to query a certain DB. More details can be read in this email thread.
How will the wikis map to physical nodes/sections?
By connecting directly to the database for a query, this implementation detail shouldn't be something to worry about. However, the current mappings are at noc's db.php. These mappings are not expected to radically change as part of this process.
How should I connect to a database in Quarry?
Given databases will now be in different hosts, you will have to select the database you want to use before writing a query in Quarry so that it knows where to route your query.
There will be a new input field above the query text area to write the DB name you want to use (like eswiki_p
), instead of writing use
statements in the SQL (like use eswiki_p;
).
I use Quarry to perform cross join queries, what do I do?
Since Quarry only provides SQL as the interface, to migrate your code the easiest path is to move it to a PAWS notebook, which will involve some coding in Python. If you need help adapting your queries to a PAWS notebook, please reach out for support, share your queries and attempts on a notebook.
How should I connect to databases in PAWS?
Due to the databases being placed in separate sections, it is not possible to use the proxy that PAWS used with the environment variable MYSQL_HOST
, so instead we need to connect to the host URLs the same way it is done from Toolforge.
There is a credentials file in $HOME/.my.cnf
that you can use when connecting, instead of environment variables.
Here are two notebooks with comparison and examples:
- Accessing the new replicas, changes from the previous cluster
- Using Wikireplicas from PAWS with Python
Here is a short Python example with how to connect:
import pymysql
conn = pymysql.connect(
# Host urls are like {wiki}.{analytics,web}.db.svc.wikimedia.cloud
host="eswiki.analytics.db.svc.wikimedia.cloud",
read_default_file=".my.cnf",
database="eswiki_p"
)
with conn.cursor() as cur:
cur.execute(
"SELECT page_title FROM page WHERE page_title LIKE %s LIMIT 5;",
args=('%Alicante%',)
)
for row in cur.fetchall():
print(str(row[0], encoding='utf-8'))
We will be updating the documentation through April with the new method.
Every existing PAWS notebook has its output cached, so they are still viewable without having to change anything. When you want to re-run them, you will have to update the DB connection method as mentioned above.
Why are we doing this?
Wiki Replicas existing setup relies on a single MariaDB instance for all Wikimedia data. As of October 2020, this includes over 900 wikis, comprised of 6,000 tables, and totalling over 7 TB’s after full compression.
This setup limits our ability to scale and maintain a performant and stable service. Based on community feedback and our own observations running the service:
- Given the large amount of data, the service struggles to keep nodes in sync. This results in poor performance
- Recovering from data corruption is a concern, given the current performance
- Even maintaining current capacity within the cluster has become challenging and more hardware is needed
- Usage and query complexity continue to increase, resulting in reduced performance
- Our current architectural design limits our ability to optimize for complex queries, such as those used to do analytics and OLAP
New architecture
Given the challenges outlined above, we plan to change the architectural model from "Multi-source" (lots of sections smashed into one MariaDB instance) to "Multi-instance" (multiple MariaDB instances on a single physical node with each instance only handling a single sections).
The new physical layout will be 8 sections * 2 types (web + analytics) * 2 instances each == 32 database instances. Deployed as multi-instance with 4 instances per physical node == 8 nodes.
Impact on other services
Services like Quarry and PAWS will need to be updated to understand where to route a query as there will now be multiple MariaDB instances.
How can I help?
Mitigation for cross database JOIN's will require changes to existing tools and services. Providing help and guidance to developers and other community members is crucial to ensuring existing workflows can continue uninterrupted. Consider:
- Sharing examples / descriptions of how a tool or service was updated
- Writing a common solution or some example code others can utilize and reference
- Helping others on IRC and the mailing lists
Related Tickets
- T260389 Redesign and rebuild the wikireplicas service using a multi-instance architecture
- T265135 wikireplicas: Define MW sections per host
- T253134 Find an alternative solution for the mysql-proxy in PAWS
- T260843 Set up roles for new wiki replicas layout
- T264254 Prepare Quarry for multiinstance wiki replicas
- T272523 Early testing of the new Wiki Replicas multi-instance architecture
- T276284 Establish a working setup for PAWS with multi-instance wikireplicas
- T282557 Global user contributions (GUC) shows no entries