Performance/Multi-DC MediaWiki

From Wikitech

Multi-DC MediaWiki (also known as, Active-active MediaWiki) is a cross-cutting project driven by the Performance Team to give MediaWiki the ability to serve read requests from multiple datacenters.

Historically, Wikimedia Foundation served requests to MediaWiki from its primary datacenter only, where the secondary data center was effectively a cold stand-by exclusively for disaster recovery. By deploying and actively serving MediaWiki from two (or more) datacenters during normal operations, we achieve higher resilience in case of network loss, datacenter failure, removes most switchover costs and complexity, and eases regular maintenance.

Throughout the project, Multi-DC initiatives have brought about performance, availability, and resiliancy improvements across the MediaWiki codebase and at every level of our infrastructure. These gains were effectively and advantagous even to the then-single DC operations. These gains were often due to restructuring business logic and implementing async and eventual-consistency solutions.

The project's final switch (T279664) improved page load speed by reducing latencies to geographies west of Texas, USA (Codfw data center), including for Asia and US West coast. This impacted all logged-in traffic as well as cache misses from logged-out traffic. The project brings a promise of future performance potential through serving more traffic from nearby DCs.

Initial work (2014-2016)

RFC

The project was formalised via the Multi-DC strategy RFC in 2015, led by Aaron Schulz (Performance Team).

JobQueue

Changes led by Aaron in collaboration with Ori (Performance Team), including:

  • Develop JobQueueAggregator in MediaWiki, using Memcached and Redis for coordination.
  • Develop new "jobrunner" microservice in 2014.
  • Develop the "jobchron" service in 2015.
  • Infrastructure changes to replicate the queue cross-dc by default, through Redis.
  • New interfaces in MediaWiki to support independent queuing of local jobs in either DC, which eventually replicate to and execute in the primary DC.

See also History of job queue runners at WMF.

WANObjectCache

Designed and implemented by Aaron. Read more about this in:

The relay interface exposed by WANObjectCache was fulfilled in 2016 by Mcrouter, which replaced WMF's prior Twemproxy and Nutcracker infrastructure. Packaging, configuring and deployment led by Giuseppe Lavagetto (SRE Service Ops). T132317

MariaDB lag indication

The default mechanism in MySQL to let web applications measure database replication lag, is to fetch the Seconds_behind_master value using "SHOW STATUS" queries. This works well for small sites, and is what MediaWiki does by default as well. At our scale, however, this can be inaccurate or unreliable. For example, DBAs tend to prefer a replication topology where a chain is used rather than all replicas sourcing a single primary. This eases cross-dc switchovers and day to day intra-dc maintenance. A number of other issues and use cases are detailed at T111266.

To mitigate this we adopted the pt-heartbeat service. Deployment led by Jaime Crespo (SRE Data Persistence) with the MediaWiki client (within the Rdbms library) written by Aaron.

Media storage

See also Media storage#History. MediaWiki's FileBackend abstraction layers was developed in 2010-2012 by Aaron to facilitate WMF's transition from NFS to Swift, and was extended and excercised further during the migration from the Pmtpa data center to, our present-day primary DC, Eqiad.

In 2015, further work took place on SwiftFileBackend and FileBackendMultiwrite in order to not rely on periodic out-of-bound replication but rather have MediaWiki directly write to both data centers (T91869, T112708, T89184).

In 2016, the encryption for cross-dc traffic between the two Swift clusters was finalized by SRE, involving Nginx as TLS proxy. T127455

Search

Devise strategy for near-realtime Elasticsearch replication so that search suggestions and search queries work independently. Led by Erik Bernhardson (Search Platform). T91870

Incremental progress

DC independence

Throughout the codebase, individual components had to be improved, refactored, or even rewritten entirely to be independent of a (slow and possibly overloaded) connection to the primary DC. This largely took place between 2014 and 2017, led by Aaron Schulz, tracked under T88445 and T92357. Including:

  • Introduce a TransactionProfiler in the Rdbms library, this automatically brings performance issues to the attention of developers. In particular, to identify unexpected primary DB writes and reads during "GET" requests. T137326
  • Reduce write() calls in MediaWiki's SessionStore.
  • Introduce "sticky-DC" cookie, set during requests that save or change data (e.g. edits, changing user preference, etc.)
  • Convert the localisation cache and MessageCache to WANObjectCache. (T99208)
  • Refactor Wikibase extension to avoid eager connections to the primary DB.
  • Refactor Flow extension to reduce database queries during GET requests.
  • Revise Flow caching strategy from caching primary data to caching data from DB replicas instead. More about this in Perf Matters 2016 § One step closer to Multi DC.

CDN purging

A long standing problem has been the reliability of CDN purges. The foundation's CDN by its very nature has always operated from multiple DCs and so this isn't a new problem for the Multi-DC initiative around backend services. Incremental improvements to Multicast HTCP purging took place mostly 2017-2018, led by Emanuele Rocca and Brandon Black (SRE Traffic). Details at T133821. The Purged service (based on Kafka) was developed and deployed in 2020 to solve the reliability issue long-term.

Thumbnail serving

Thumbor and storing thumbnails in Swft in both DCs. Carried out in 2018 with traffic routing implemented by Filippo (SRE Infrastructure Foundations), and application layer work by Gilles (Performance Team). T201858

Media originals serving

In 2019, we began to serve originals from upload.wikimedia.org from multiple data centers, implemented by Alexandros Kosiaris and Filippo (SRE). T204245

Remaining work (2020-2022)

Aaron Schulz has driven the effort of improving, upgrading, and porting the various production systems around MediaWiki to work in an active-active context with multiple datacenters serving MediaWiki web requests. You can see the history of subtasks on Phabricator.

This document focuses on remaining work left as of December 2020 – the major blockers left before enabling the active-active serving of MediaWiki.

ChronologyProtector

ChronologyProtector is the system ensuring that editors see the result of their own actions in subsequent interactions.

The remaining work is deciding where and how to store the data going forward, to deploy any infra and software changes as needed, and to enable these.

Updates:

  • September 2020: an architectural solution has been decided on and the Performance Team, in collaboration with SRE ServiceOps, will migrate ChronologyProtector to a new data storage (either Memcached or Redis), during Oct-Dec 2020 (FY 2020-2021 Q2).
  • February 2021: code simplification and backend configuration for Multi-DC ChronologyProtector have been implemented and deployed to production for all wikis.
  • March 2021: Documented CP store requirements for third-parties.
  • March 2021: Task closed.
  • Yes Done

Session storage

The session store holds temporary data required for authenticating and authorization procedures such as logging in, creating accounts, and security checks before actions such as editing pages.

The older data storage system has various short-comings beyond mere incompatibility with a multi-DC operation. Even in our current single-DC deployment the annual switchovers are cumbersome, and a replacement has been underway for some time.

The remaining work is to finish the the data storage migration from Redis (non-replicated) to Kask (Cassandra-based).

Updates:

  • 2018-2020 (T206016): Develop and deploy Kask, gradually roll out to all Beta and production wikis.
  • Dec 2020: Performance Team realize that requirements appear unmet, citing multiple unresolved "TODOs" in the code for primary requirements, internally inconsistent claims about service interface. T270225
  • Jan 2021: CPT triages task from Inbox.
  • Feb 2021: CPT moves task to "Platform Engineering Roadmap > Later".
  • March 2021: Future optimisation identified by Tim (two-level session storage) - T277834.
  • July 2022: Performance helps with T270225 within the limited scope of completing Multi-DC needs.
  • July 2022: Fulfilled unresolved TODOs at T270225.
  • July 2022: Straighten out internally inconsistent interface guarantees at T270225.
  • August 2022: Task closed, Yes Done.

CentralAuth storage

A special kind of session storage for the central login system and cross-wiki "auto login" and "stay logged in" mechanism.

The last part of that work, migrating CentralAuth sessions, is currently scheduled for completion in Oct-Dec 2020 (2020-2021 Q2).

Updates:

  • Nov 2020: Initial assessment done by CPT.
  • Jan 2021: Assessment concluded.
  • Feb 2021: Assessment re-opened.
  • Jul 2022: Decided on Kask-sessions (Cassandra) as backend. Should not be separate from core sessions. TTL mismatch considered a bug and also fixed by Tim. T313496
  • Jul 2022: Yes Done

Main Stash store

The Redis cluster previously used for session storage is also host to other miscellaneous application data through the Main Stash interface. This has different needs than session storage which become more prominent in a multi-DC deployment which make it unsuitable for Cassandra/Kask.

The remaining work is to survey the consumers and needs of Main Stash, decide how to accomodate them going forward. E.g. would it help if we migrated some of its consumers elsewhere and have a simpler replacement for the rest? Also: carry out any software and infra changes as needed.

Updates:

  • Sept 2019: Audit all current MainStash usage (Google Doc).
  • June 2020: The plan is to move this data to a new small MariaDB cluster. This project requires fixing "makeGlobalKey"" in SqlBagOStuff, and new hardware. This is being procured and set up in Q2 2020-2021 by the Data Persistence Team. The Performance Team will take care of migrating the Main Stash as soon as the new database cluster is available, i.e. between Oct 2020 and Mar 2021 (FY 2020-2021 Q2 or Q3).
  • July 2020: SqlBagOStuff now supports makeGlobalKey and can work with separate DB connections outside the local wiki. - T229062
  • Sep 2020: Hardware procurement submitted. Oct 2020: Procurement approved as part of larger order. Dec 2020: Hardware arrived. - T264584
  • Jan 2021: Hardware racked and being provisioned. - T269324
  • Feb 2021: MySQL service online and replication configured. - T269324
  • June 2022: Test config in production.
  • June 2022: Enable on all wikis.
  • Yes Done

MariaDB cross-datacenter secure writes

MediaWiki being active-active means that writes still only go to the primary datacenter, however a fallback is required for edge cases where a write is attempted in a secondary datacenter. In order to preserve our users' privacy, writes need to be sent encrypted across datacenters. Multiple solutions are being considered, but a decision has yet to be made on which one will be implemented. This work will be a collaboration between the SRE Data Persistence Team and the Performance Team. We hope for it to happen during fiscal year 2020-2021.

Updates:

  • July 2020: Potential solutions evoked so far: Connect with TLS to MariaDB from PHP directly, ProxySQL, dumb TCP tunnel, Envoy as TCP tunnel?, HAProxy in TCP mode.
  • Dec 2020: Leaning toward a tunnel approach, ProxySQL would take too long to set up and test from scratch.
  • May 2022: Decision is reached, led by Tim. TLS connection to be established directly from MediaWiki without additional proxies or tunnels.
  • May 2022: Configuration written.
  • June 2022: MariaDB-TLS tested and enabled for all wikis.
  • Yes Done

ResourceLoader file dependency store

Currently written to a core wiki table using a primary DB connection, must be structured such that writes are done within a secondary DC and then replicated. The plain is to migrate it toward the Main Stash instead.

  • Lead: Performance Team (Timo, Aaron).
  • Task: T113916

Updates:

  • July 2019: Implement DepStore abstraction, decoupled from using primary DB, and now includes a KeyValue implementation that supports Main Stash.
  • May 2020: Rolled out to Beta Cluster and mediawiki.org.
  • June 2022: MainStashDB went live. Roll out to group0 wikis.
  • July 2022: Gradually rolled out to all wikis (details on task).
  • Yes Done

CDN routing

Remaining work is to agree on the MW requirements, and then write, test and deploy the traffic routing configuration.

Updates:

  • May 2020: Aaron and Timo have thought through all relevant scenarioes and drafted the requirements at T91820.
  • June 2020: Audit confirms that relevant routing cookies and headers are in place on the MW side.
  • May 2022: Traffic routing logic being developed by Tim.
  • June 2022: ATS routing logic deployed to Beta and prod, no-op but enabled.
  • Sept 2022: Gradually roll out based on a percentage of cache-miss traffic between ATS and appserver-ro (MediaWiki servers).
  • Yes Done

History