Jump to content

MediaWiki Engineering/Guides/Backend performance practices

From Wikitech

These are the performance guidelines for MediaWiki backend development aimed at being deployed to Wikimedia Foundation wikis. Use the below guidelines together with the Architecture guidelines and Security guidelines.

Getting started

These guidelines focus both on individually perceived performance (your code runs relatively fast) and scalability (your code can run fast, even on large wikis and when run many times concurrently).

General principles

  1. Identify hot spots. Which parts of your code take the most time to execute? These may deserve extra care during code review in the future. The team should maintain awareness of these hot spots by regularly (e.g. monthly) verifying and measuring your code's performance in production. When making choices in the interest of performance, measure instead of relying on intuition. Results can be surprising and may change over time. See also: How to measure backend performance.
  2. Identify relevant latency metrics, and take responsibility for them. As developers you know best how your code is exercised and what high-level actions are meaning to your product. Monitoring their latency will let you know when an experience becomes slower and/or when there are capacity problems. Consider documenting your latency metrics, and review them on a monthly or quarterly basis to prioritise maintenance as-needed to uphold the quality of service. Example: Save Timing documentation.
  3. Treat the infrastructure as one that we share with others. Your code can become a high-traffic workload in Wikimedia production! Each web request has a strict HTTP timeout. SQL query may be automatically stopped after 30s. Perform long-running tasks on a dedicated server instead, e.g. via the JobQueue. See also #Long-running queries.
  4. Tune your database queries. Wikimedia heavily depends on its caching layers to survive. However, our caching exists primarily to increase capacity, not speed. Hitting Memcached first means we need fewer database servers. Remember that a cached result eventually has to be refreshed. A slow database query behind Memcached, is still a slow query! See also #Design for cache miss
  5. Choose the right persistence layer for your needs. Only cache data if your code can also performantly respond when the cached data is missing; otherwise, use stronger persistence for your data. See also #Persistence layer. When designing tables, or adding new database queries, consider that every query must use an index (including write queries!). See also: #Index.
  6. The cache hit ratio should be as high as possible. When you introduce new batching calls or combined requests, underlying requests should retain effective cache use. Highly variable URLs or data that is cached under multiple permutations of cache keys, tend to worsen cache-hit ratio. See #Leverage the platform! for platform capabilities that also enable high cache-hit ratios.
  7. Subpar performance can be indicative of a deeper issue in how the code is solving a particular need. Think about the root cause, and whether certain costs can be structurally avoided or deferred. See also: #You are not alone.
  8. Avoid cookies whenever possible. Cookies create significant risks for CDN stability, and add unnecessary data transfers for clients. Consider using sessionStorage or localStorage (via mw.storage) as an alternative to cookies. Before adding cookies, consult with the MediaWiki Platform Team or SRE Traffic. See also #Cookies.

Ballpark numbers

  • When accessing information (e.g. a view or API request over GET), aim for your backend to respond within 50ms at the median and within 200ms at the p99. In other words, common requests to popular data that benefit from internally warm caches respond within 50ms (e.g. database server cache, or Memcached/Apcu), and requests that encounter internal cache misses still gather and render all data within 200ms.
  • When performing write actions, respond within 500ms at the p99 (e.g. a POST request). Make sure that the amount of work a request performs is naturally and deterministically limited (e.g. do not rely on server-level memory limits or timeouts). It is encouraged to deny and incentivize against usage patterns that we cannot maintain at scale. Remember that you can schedule tasks via the Job queue and Deferred updates which let you run code asynchronously either on the same server after the response has been sent ("post-send deferred update"), or a few seconds later in a separate cluster ("jobqueue job").

Percentiles

For backend code, look at high percentiles (such as the p95th and 99th percentile) instead of mean averages or medians (learn why: [1][2][3]). Backend latencies tend to vary based on factors like cache warmth and server load (instead of varying based on what a person is doing), which means even an issue with 1% of requests may affect everyone on a regular basis (though perhaps not at the same time). Monitoring the average would systematically ignore well over half the audience. Monitoring the 99th percentile tells you how your code behaves when it matters most.

Performance data often hides two different stories. One for users accessing the application on a warm cache, and another with a cold cache. Calculating averages on such a dataset is deceptive. Benchmark with at least 10,000 iterations to calculate a 50th and 90th percentile. If these numbers differ greatly, that would indicate a performance problem. For example, if the code requires network fetches and you have many resources to fetch, there may be a group that leverages cached resources (thus avoiding network roundtrips) and a group that does not. A rule of thumb about statistical significance in performance context is that one needs 10,000 data points to calculate a 90th percentile; and 100,000 for a 99th percentile; and 1 million for a 99.9th.

Latency

Aim for your software to provide a reasonably fast experience, regardless of network latency. Client-side latency depends on numerous factors, including: your backend response time, the CDN response time, the round-trip time from the client device to our servers and back (RTT), and the transfer rate (bandwidth) that the client's network is capable of.

The RTT and bandwidth are not always related. For example, a Gigabit connection with an RTT of 2 seconds will not transfer anything in less than 2 seconds, regardless of payload size. You can think of bandwidth as the size of a truck on a highway (or the number of highway lanes), and RTT is how fast one may travel. If 1000 kg are transferred by the truck in 1 hour, this does not mean that 1 kg will arrive in 3 seconds. We recommend the free ebook High Performance Browser Networking by Ilya Grigorik. In particular, mobile devices take more time to reactivate their network connection after a period of inactivity.

Strategies for reducing or masking latency:

  • Responses to unregistered users should be cacheable by the CDN (i.e. requests from browsers without an active login or edit session).
    • For page views, page actions, and special pages, this is controlled by OutputPage.
    • A good example of a cacheable Action API module is ApiOpenSearch.
    • A good example of a cacheable REST API route is Rest\Handler\SearchHandler.
  • Split high-level operations into smaller re-usable methods. This allows different parts of your code to only ask for and compute what is needed, instead of accumulating ever more computations that "always" happen whenever anything interacts with your feature (which would mean no high-level actions are optimal, not even the high-traffic ones).
  • Set a tight timeout. Once you've set a p99 latency objective for your backend responses, think about what might cause a response to fall in that last 1%. For example, if your feature includes internal requests to other services, what timeouts do they have? What retries do they allow? Consider what happens if suddenly a majority of web traffic starts to exercise your 1% scenario. Do we start to exhaust all backend capacity? Or do we fail quickly and free up resources so that other parts of the site stay up? Once you've set and achieved a latency objective, put a limit in place to shift outliers from "slow" toward "error". This way callers can quickly re-try, which some services may even do automatically.

See also Page load performance#Latency for advice on preloading and stale revalidation.

Leverage the platform and benefit by building atop prior work for scaling MediaWiki to our performance needs

How often will my code run?

Think about how often the server will execute your code. We tend to categorise server-side concerns in these buckets:

  1. Always.
    • Critical code running unconditionally on every backend request. This should be kept minimal with clearly low cost. This includes early Setup hooks, extension.json callbacks, and service wiring.
    • The typical budget for new needs from core or extension hooks during the Setup phase is 0.1ms.
  2. On page views.
    • Code running on all HTML web responses. This is almost always, but not for action=raw requests, or HTTP 304 "Not Modified" responses to a page view URL.
    • The typical budget for new logic on a page view is 1ms. We aim to respond within 50ms to a majority of requests (#Ballpark numbers).
  3. When parsing page content.
    • Most backend requests for page views fetch page content from the ParserCache and render a skinned page view. The parsing of page content happens on a relatively small portion of page views only (e.g. cache miss), as well as during the response to saving of edits. For that reason, it is acceptable to perform a limited number of more expensive operations during the parser run, e.g. from a parser hook. These hooks generally do not execute while someone is waiting during a page view, as the result of these operations is retained in the parser cache.
    • The typical budget for components that render content is 10ms (for most pages using your feature), or up to 100ms for pages that use your feature in an unusually complex manner.
  4. When performing a write action.
    • Editing is the most common non-read action. The latency of processing an edit is actively monitored as Save Timing.
      • The typical budget for components that parse content or otherwise synchronously hook into edit submissions is 10ms (for most pages), or up to 100ms for unusually complex pages. We aim to respond within 1 second for write actions, and there are many extensions participating in that shared budget (#Ballpark numbers).

Caching layers

Caching layers to be aware of:

  1. Browser caches. In addition to the browser's HTTP cache, we also cache JavaScript and CSS modules in HTML5 LocalStorage which acts as a defragmented browser cache that significantly reduces network requests (details at ResourceLoader/Architecture#Store).
  2. CDN cache (aka "Edge caching" or "Varnish frontend"). The Varnish caches stores entire HTTP responses, including thumbnails of images, frequently-requested article content, ResourceLoader modules, and most anything else that can be retrieved by URL. Wikimedia operates CDN front-ends in multiple data centers around the world to reduce latency between browser and server. See MediaWiki at WMF for some details.
  3. Object cache. Our object caching layers scale from small ephemeral memory on individual web servers (e.g. PHP-APCU, which is uncoordinated and unreplicated), to large Memcached clusters, and even multi-DC replicated clusters backed by SQL databases (e.g. ParserCache). To learn which object cache to use, refer to Object cache on mediawiki.org. To learn more more about the principles and expectations of Memcached, refer to Memcached at WMF.
  4. MySQL. Database servers have their own internal query caches and buffer pools. This works transparently and is not directly controllable. It means that repeat queries for the same data are generally faster even with no other caching logic written by you.

Think about how you will invalidate or expire content from the various caching layers. Is it by purging a specific cache key? Or by changing part of a URL or cache key such that the old cache is naturally not used? Your application needs will determine your cache purging strategy.

Since the CDN cache serves content by URL, URLs ought to be deterministic -- that is, they should not serve different content from the same URL. Different content belongs at a different URL. This should be true especially for anonymous users.

Cookies

Avoid cookies whenever possible. Cookies create significant risks for CDN stability (per #Caching layers above), and increase the payload of all web requests creating unnecessary data transfers for clients. Consider using sessionStorage or localStorage (via mw.storage) as an alternative to cookies. Before adding cookies, consult with the MediaWiki Platform Team or SRE Traffic.

Design for cache miss

Cache-on-demand: MediaWiki generally promotes a cache-on-demand strategy, formally known as a "cache aside" strategy. When your code creates or modifies something after a user hits "save" or "submit", then along with saving the new data to your primary database, you must also invalidate the relevant caches. (See "#Caching layers" for more.)

If there is a many-to-one relationship between a piece of data and the cache keys that may contain a copy of that data, or if the cache keys contain variables that you cannot easily know during a change event, then you can leverage WANObjectCache getWithSetCallback with a shared "check" key passed in. The check key can be "bumped" via touchCheckKey to perform invalidations instead of using delete(). This also avoids cache stampedes for hot keys, because when a specific key has a related check key that is touched/newer than the stored data, the stored data is still there. This allows WANObjectCache to automatically use the stale value for a short period of time if the site is under heavy load.

Cache misses are normal: Avoid writing code that, on cache miss, is very slow. For instance, a slow COUNT(*) database cached behind a fast Memcached key is still the same slow database query; cache misses and timeouts eat a lot of resources. Our web servers and database servers explicitly stop (most kinds of) requests if they exceed a 60 seconds timeout. There is no exemption for presumed-rare requests where your code is filling its caches with slow queries.

Write your queries such that an uncached computation will still meet your latency budget (see #Ballpark numbers).

If you can't make it fast, see if you can do it in the background. For example, MediaWiki core's ParserCache is for the most part not cache-on-demand, but rather pre-populated during edits. It also uses a database as its storage backend rather than Memcached. This has the benefit of almost never exposing a reader to a cache miss, however has the downside of placing a significant responsibility on application code to handle stale values. In the case of ParserCache, it inspects and "verifies" the ParserOutput values after retrieval from cache, against on the current revision ID, thus not relying on purges or replication, and thus being immune to race conditions between web requests writing to the same cache key.

Another example is the "Statistics" special page, which runs expensive queries offline via a maintenance script and stores its results in a database table, which acts as a non-expiring cache.

Watch out for cached HTML: HTML output may be cached for a long time and still needs to be supported by the CSS and JS. Problems where old JS/CSS hang around are in some ways easier to test, but stale HTML can be insidious!

Cache-on-save: Prior to 2014, it was not uncommon for certain MediaWiki features to try to keep the cache in sync during writes to the database, by proactively recomputing new cache values. This practice is no longer supported as it is unreliable at our scale for numerous reasons, including:

  • It wrongly assumes database writes never fail, time out, or rolled back later in the same request. It is harmless to do $db->insert() followed by $cache->purge(), because even if your database transaction later fails to commit or times out, the cache will simply repopulate. However, writing to the cache at this point would potentially cause unsaved data to stay in your cache for a long time, confusing subsequent requests.
  • It wrongly assumes there are no concurrent web requests that change the same database rows or cache keys. When two requests come in that edit the same information, MySQL allows them to work in parallel and later ensures the transactions are committed in a deterministic order. However, there is no guarantee that both requests take up the same amount of time and that the "first" one to win the database write, also reaches the cache first. Thus you may end up writing "edit 1" then "edit 2" to the database, but write "edit 2" and overwrite with "edit 1" in the cache, and it stays that way.
  • It wrongly assumes there is no replication lag. Chances are that if your code relies heavily on cache data reflecting the primary database state instantly, it probably does not handle well when there is a cache-miss and you instead read data from a replica database. There is of course no "instantly" anything, unless one would lock all requests together.
  • It wrongly assumes caches are always up and writable. For primary databases, we trade speed for high availability and replication of data (each added host replicates the same data) and thus we accept the (small) risk of failing a web request if the DBs are down. For caches, we focus almost exclusively on speed and maximising potential capacity (each added host is a shard with its own portion of the data, not stored elsewhere). We cannot take the site down every time a cache is slow, unresponsive, or undergoing maintenance; thus cache writes can and will fail in ways we ignore. The only guarantee is that cache attempts are fast (EmptyBagOStuff is a valid implementation of BagOStuff!).
  • It wrongly assumes the application runs from a single region or data center, with cache writes eventually "seen" by all subsequent requests. It is vital that each data center is (mostly) independent, which a requirement to write data to another DC's cache would violate. Each data center is responsible for lazily populating its own caches. Only purges are broadcasted. See also Multi-DC MediaWiki, Perf Matters 2015: Hello WANObjectCache, and Perf Matters 2016: One step closer.

Practical examples about caching layers

  • The WikidataClient extension was fetching a large object from memcached containing the complete metadata for all wikis, when it needed information about 1 wiki. This led to significant network congestion. This was first reduced by moving the cache from Memcached to php-apcu. Change 93773 later solved it by instead storing the precomputed information as a configuration variable, thus removing the need for any computations at runtime. Background at T58602.
  • The GuidedTour extension was performing uncached template parses on every page view. This led to significant database load due to not taking advantage of ParserCache. (change 67230)
  • The GettingStarted extension introduced a new cookie containing the word Token, which matches a regular expression that Varnish CDN recognises as indicating logged-in page views to never cache. See the code, an initial revert, another early fix, another revert commit, the Varnish layer workaround, the followup fix, the GettingStarted fix part 1 and part 2, and the regex fix.
  • The TwnMainPage extension offloads expensive statistics to the JobQueue. The extension periodically queues a job to recalculate statistics shortly before the cache would expire. In the event that the cache data disappears, it temporarily shows no statistics until the job is completed. This a good compromise to keep pages fast and avoid slow queries during a cache miss. See CachedStat.php. It sets a limit of 1 second for calculating stats in SpecialTwnMainPage.php.
  • A change to the RecentChanges feed in MediaWiki core accidentally removed Cache-Control. The feed was previously cached for only 10 seconds by the CDN, but that was enough to absorb the vast majority of feed crawling. The cache was promptly restored (T65249).

Leverage the platform!

  • Move work to post-send Deferred updates or Jobs if it doesn't have to happen in the critical path of a web response.
  • When interacting with Memcached, use the getWithSet idiom and use WANObjectCache. Avoid calling Memcached directly. WANCache automatically takes care of numerous "at scale" needs such as stampede protection, purging, mutex locks, and warms your caches by automatically regenerating values before they expire. This reduces the chances of hot keys ever getting a cache miss. WANCache also integrates with dev tooling such as rate and latency stats on Grafana: WANObjectCache.
  • Use core service classes whenever possible and look for batch methods that can process several of your requests in parallel. Our database abstraction layer, cache interfaces, and HTTP clients all support batching. For example: IDatabase::select, IDatabase::selectFieldValues, BagOStuff::getMulti, HttpRequestFactory::createMultiClient, and FileBackend::doOperations.

You are not alone

If your code is amplifying a pre-existing performance issue in another component or service, identify these and ensure relevant teams are made aware. The MediaWiki Platform Team can help you in finding and/or advocating for these cross-component issues.

Work with the team to understand general performance targets before you start architecting your system. For example, a user-facing application might have an acceptable latency of 200ms while a database might have something like 20ms or less, especially if further access is decided based on the results of previous queries. You don't want to prematurely optimise, but you need to understand if your targets are physically possible.

You might not need to design your own backend; consider using an existing service, or having someone design an interface for you. Consider modularization. Low-level performance is hard, avoid reinventing the wheel!

Persistence layer

Choose the right persistence layer for your needs. In some cases, a cache can be used instead of a persistence layer.

MediaWiki is configured in production to leverage local services that include Memcached, JobQueue, ExternalStore and MainStash (MariaDB), and Swift. These reside on a low-latency network in the same data center.

The most appropriate layer is the lightest store that meets your requirements. Avoid incurring overhead that serves unneeded guarantees. Here's when we recommend each:

  1. Cache - short-term storage of ephemeral unstructured data of small or medium size (up to 800KB).
  2. MainStash - replicated storage of ephemeral unstructured data and small to large blobs (avoid larger than 2MB), backed by MySQL/MariaDB.
  3. JobQueue - reliable short-term persistence of a job description. Once the job is completed, the job parameter data is automatically pruned.
  4. Database - long-term storage of structured data and blobs, backed by MySQL/MariaDB via dedicated table schemas.
  5. FileRepo - long-term storage of binary media files, backed by Swift. This is used by user file uploads, but can also be used for generated files like the Score extension (for MIDI files) and ConfirmEdit extension (for captchas). See wikitech:Media storage for details.

Most caches are backed by Memcached, see Object cache on mediawiki.org for guidance of when and how to WANObjectCache and the other cache interfaces. Use a cache for data that may persist between web requests, but that is okay to quietly fail and be pruned at any time. Use memcached to store objects if the database could recreate them but it would be computationally expensive to do so — you don't want to recreate them too often. You can imagine a spectrum between caches and stores, varying on how long developers expect objects to live in the service before getting evicted; refer to #Caching layers for more details.

Schema and API design

  • Store files and metadata under stable identifiers, e.g. revision IDs, page IDs, user IDs won't change; whereas page titles and user names do. Nearly all tables should have an immutable "ID" column, especially if there is also a "name" column. ID columns should be unsigned integers with values assigned by the database (e.g. AUTOINCREMENT). In rare cases, a UUID column can be used instead. ID columns are essential for efficient indexes, low-maintenance foreign keys, and simpler schema migration scripts.
  • Large immutable blobs typically go in ExternalStore. We must avoid large and numerous blob inserts on the main database clusters (e.g. s1-s8) as this can lead to contention and poor performance for many reasons (e.g. replication lag, poor use of buffer pool, poor use of disk space). This is why we store revision content (e.g. wikitext) in ExternalStore, which we use an immutable append-only key-value database. If the blobs are ephemeral, use a cache layer instead such as Memcached or MainStashDB.
  • Try to design your schema and API around single-record access for create/update/delete and multi-record access for fetching/querying. If the internal classes or web API needs to support multi-record write operations, then a loop will often suffice. If hundreds of records are often involved or if the entry point becomes high traffic, then batched write query optimisations become justified. Most batching optimisations should be focused on APIs that can fetch multiple records by ID/name or page through records by ID/name/timestamp (possibly involving a JOIN with other records).

Practical examples of persistence

Example: Permanent names

MediaWiki stores metadata about uploaded files in the database, using the display title as identifier. This requires a slow database operation when someone renames an uploaded file title, with updates to several database tables, and updates to a potentially large number of rows (every matching and associated table row). With a stable numerical identifier,  title changes can be applied to the original row only. All other data remains associated by the stable ID.[4]

Example: Large object size

Avoid storing multi-megabyte data in Memcached, this is discarded automatically. Verify whether caching is effective in both hit ratio, saving time, and saving backend load. For example, prior to 2015, ResourceLoader stored minified JavaScript code in Memcached, while the cache had a high hit-ratio and saved redundant minification computations, we proved that it was faster to run the minifier than to fetch and wait for cached data over the network from a Memcached server. It was cheaper and faster to recalculate than to retrieve, and doing so freed up web server load, Memcached load, and network traffic. Another example is the Watchlist data, which performs a lookup like "is this page watched by current user", which we choose not to cache, because it's indexed and just as fast to query from a database as from Memcached, and more importantly, caching it would make things worse because the wide range of page-user combinations means it's almost never a cache-hit and thus you then wait for both Memcached, and then after that also wait for a database server still. WANObjectCache in Grafana provides statistics about how effective your cache keys are in cache-hit ratio, fetch time, and computation time.

Example: Job queue use cases

  • The TimedMediaHandler extension uses jobs to transcode large video uploads into variants for specific resolutions and video codecs.
  • The UploadWizard extension uses an Action API to upload large files in chunks. The API queues a job to take care of reassembling the chunks into a single file for downloading and streaming. The uploader can meanwhile start writing the file description, metadata, etc., while the UploadWizard frontend uploads the file one chunk at a time in the background.
  • The core HTMLCacheUpdate job purges articles from the CDN (Varnish) after a given template was edited. It also queues a job to invalidate the page_touched field of matching rows the MySQL database, which instructs ParserCache to re-parser the article next time someone views it.
  • Other extensions that use the job queue include RenameUser, TranslationNotification, Translate, GWToolset, and MassMessage.

Example: Send a notification

Suppose we need to store a new notification. Good solution: run the actual notification action (sending an email) via the jobqueue. Bad solution: Send the email during the web request, thus delaying the browser response, and holding open the database transaction until the email is sent (and with no automatic re-try).

Example: Count active users

The Beta features extension lets a user opt-in to a "Beta feature" and displays, to the user, how many users have opted in. The preference data is stored in the user_properties table. Counting the number of opted-in users every time the count is displayed would not have acceptable performance. Thus, the BetaFeatures extension for MediaWiki stores the count in a dedicated betafeatures_user_counts database table. It's important for a change in preference to immediately apply to a user's own experience on subsequent page views, but, it's not important to immediately increase or decrease the count. Therefore, BetaFeatures instead uses the job queue to recompute the count ever a preference change. Specifically, the extension queues a job that runs a SELECT query. This query can take a long time on large wikis - up to several minutes! Once done, the next time someone views the Beta features preferences page, the new count is fetched from the "counts" table where the job stored it. Code: UpdateBetaFeatureUserCountsJob.php and BetaFeaturesHooks.php.

Indexing

When designing new tables, or adding new database queries, consider that all queries must use an index (including write queries!). Use EXPLAIN on your queries and create new indices where required.

Unless you're dealing with a tiny table, you need to index writes (similarly to reads). Watch out for deadlocks and for lock-wait timeouts. Try to do updates and deletes by primary query, rather than some secondary key. Try to avoid UPDATE/DELETE queries on rows that do not exist. Make sure join conditions are cleanly indexed.

You cannot index blobs, but you can index blob prefixes (the substring comprising the first several characters of the blob).

Compound keys - namespace-title pairs are all over the database. You need to order your query by asking for namespace first, then title!

Use EXPLAIN & MYSQL DESCRIBE query to find out which indexes are affected by a specific query. If it says "Using temporary table" or "Using filesort" in the EXTRA column, that's often bad! If "possible_keys" is NULL, that's often bad (small sorts and temporary tables are tolerable though). An "obvious" index may not actually be used due to poor "selectivity". See the Measure backend performance in production guide, and for more details, see Roan Kattouw's 2010 talk on security, scalability and performance for extension developers, Roan's MySQL optimization tutorial from 2012 (slides), and Tim Starling's 2013 performance talk.

Indexing is not a silver bullet; more isn't always better. Once an index gets big enough that it doesn't fit into RAM anymore, it slows down dramatically. Additionally, an index can make reads faster, but writes slower.

Good example: See the block and block_target (formerly known as ipblocks) and page_props tables. One of them also offers a reverse index, which gives you a cheap alternative to SORT BY.

Bad example: See this changeset (a fix). As the note states, "needs to be id/type, not type/id, according to the definition of the relevant index in wikibase.sql: wb_entity_per_page (epp_entity_id, epp_entity_type)". Rather than using the index that was built on the id-and-type combination, the previous code (that this is fixing) attempted to specify an index that was "type-and-id", which did not exist. Thus, MariaDB did not use the index, and thus instead tried to order the table without using the index, which caused the database to try to sort 20 million rows with no index.

Multiple datacenters

MediaWiki at WMF is actively responding to requests from multiple data centers.

MediaWiki requests should only contact other services within the local datacenter, e.g. reading data from local database replicas.

During HTTP GET/HEAD/OPTIONS requests to MediaWiki, avoid writing to MySQL (aside from DBMainStash), FileRepo/FileBackend (Swift), and LockManager (redis). Exceptional MySQL/Swift writes can be done via the JobQueue. We have configured the JobQueue such that it can independently accept new jobs locally in each data center. Behind the scenes, asynchronous workers will eventually relay and execute the job in the primary DC. That is, the logic of Job::run() executes in the primary DC only where you can safely perform writes to primary database tables.

Other applications, should be design to handle GET/HEAD/OPTIONS requests in any data center and other HTTP requests (POST/PUT) in at least the primary datacenter (if not any datacenter). Any writes should be synchronously committed within the primary datacenter, with any replication to remote data centers occurring asynchronously. An independent non-MediaWiki API service might be able to run write APIs correctly in multiple data centers at once if it has limited semantics and has no relational integrity dependencies on other source data persistence layers. For example, if the service simply takes end-user input and stores blobs keyed under new UUIDs, with no way for writes to conflict. If updates or deletions are later added as a feature, then Last-Write-Wins might be considered a "correct" approach to handling write conflicts between data centers (e.g. if only one user has permission to change any given blob then all conflicts are self-inflicted). If write conflicts are not manageable, then your API requests need to be routed by our CDN to the primary datacenter only.

Shared resources

Be mindful that your code is running in an environment with finite resources used to service numerous APIs, some of which most handle high traffic with low latency. Each application server and intermediary proxy layer in handling a request has limited CPU, disk space, disk bandwidth, network bandwidth, and worker threads. There is a finite pool of application and proxy servers. The same holds true of persistence layers such as databases, object stores, queues, and cache servers.

We recommend thinking of locks and semaphores as an intentionally limited resource as well. When code sits idly waiting for a lock or semaphore during a web response, that code still occupies one of the workers on the backend servers, as well a worker thread on each intermediary layer that is waiting to proxy the response back to the browser (e.g. Apache, Varnish, Envoy, etc.). If too many workers are waiting for locks, this can cause overloads or even outages.

Transactions

Every web request and every database operation, in general, should occur within a transaction. MediaWiki automatically creates a transaction for you during web requests when interacting with the main MySQL databases (we use the default "REPEATABLE-READ" isolation level). This offers you a simpler developer experience, where your code is atomic and consistent by default. For example, when you select an ID from one table, you can rely on being able to find the associated row in another table. This does mean MySQL must hold on to any locks until the web request (and thus the transaction) is completed, which makes us more prone to lock contention.

MySQL uses "record locks" to lock individual rows (or gaps) that prevent other requests from creating new rows that would be similar or conflict with one your database writes. Gap locks can be confusing to understand and hence cause surprising cases of contention. Refer to On MySQL locks (Kumar, 2013) to learn about gap locks.

Be careful when mixing the operations of a database transaction with significant operations to other services (e.g. those that take over 10ms). Also, think about the order in which rows are locked due to your writes (e.g. UPDATE and INSERT queries). When you update, delete, or insert rows in the database ask yourself:

  • What table index can mysql use to guarantee that your SELECT query is fast?
  • What row(s) and row gaps will this cause mysql to lock?
  • Are there other functions or classes that write to the same table?
  • What else will happen between the write query (and thus the lock) and the end of the web response? (when the transaction is committed).

You can choose to instruct MySQL to explicitly lock a row or gap during a SELECT and return the latest committed data (plus any changes pending in the current transaction as always). This can be useful when reading data that will be used to determine the next update queries. This can be done via "SELECT ... FOR UPDATE". Note that this can lock the same way that an UPDATE query with the same WHERE clause would. Be particularly wary of using FOR UPDATE in queries that may not match any records. In that scenario, MySQL will issue a next-key insert intention lock that is effectively a gap lock. If the table has few records or large gaps, the gap lock might effectively prevent all concurrent inserts;[5][6] for an example of this happening, see T381622. The use of LOCK IN SHARE MODE is strongly discouraged as it can lead to deadlocks. The use of LOCK TABLES is considered problematic and not supported on our platform.

Each individual write query should normally take less than 5ms to complete, even on the largest production databases. This may require designing and indexing for that during the planning phase, which DBAs can help you with.

Each overall transaction (all writes from one web request) should normally stay open less than 250ms in total. This is important to reduce contention and thus how long other requests may be required to wait. Try to move updates made to highly contentious tables to the end of the request (e.g. via DeferredUpdates). For example, an API request that stores an uploaded file in Swift (which may take a while if the file is large) and also updates a row in the database, we recommend performing the file operation first and then the database write. This reduces the time between the start of the database transaction and the end of the request.

Be mindful of hooks that often run at various points through the web response, which extensions can use to implement additional logic and that may run extra database queries as part of that hook. To minimise the risk of timeouts, deadlocks, and non-atomic updates, aim for speed and simplicity for any database writes during the main transaction round. Updates that take non-trivial time or are complex should use post-send DeferredUpdates or the JobQueue when possible, to better isolate different features from one another. Use purges to invalidate caches rather than in-place cache updates.

Example: Here's a common mistake that can cause inappropriate locking. The user_properties table follows the three-column "Entity-attribute-value" pattern:

  1. user ID (entity).
  2. user preference name (attribute).
  3. user preference value.
  4. Primary key: user ID + preference name.

When saving a change, it may be tempting to delete all existing rows for the user ID, and insert everything you know about the new state. But, this would cause database contention. What MediaWiki does instead is to only make changes to the table by the unique primary key for each row. First, select existing rows, then, we use "upsert" for the rows that have changed (which adds missing rows and replaces existing ones), and lastly "delete" the rows that are no longer needed. This is an important example where more code and more queries produces a significantly faster and more efficient outcome at scale, despite perhaps feeling slower when measured in isolation as a single action in an empty world.

Transaction guidelines:

  • Wrap write operations that are meant to be atomic in IDatabase::startAtomic() and IDatabase::endAtomic() for each affected database. These methods can nest, so atomic method A can invoke atomic methods B, C, and D. Declaring an atomic section is most needed is most needed when the code runs in a CLI mode (automatic transactions are disabled). Only use IDatabase::ATOMIC_CANCELABLE if you need to call IDatabase::cancelAtomic() in some cases as it incurs extra round trips.
  • When reading data via SELECT to determine what write query to issue, you should normally use "SELECT ... FOR UPDATE". See above for potential complications. Avoid using "LOCK IN SHARE MODE".
  • APIs for making multiple changes at once should take stable identifiers as input whenever possible. IDs are preferred over names, because operating on multiple (mutable) names is prone to race conditions and deadlocks. If deadlocks or lock wait timeouts become a problem, consider using one of the following strategies (in order of appearance).
    1. Consider moving high-contention row writes to the end of the transaction via IDatabase::onTransactionPrecommitOrIdle().
    2. Consider moving slow non-database operations to before high-contention database queries. If it is OK for the database operation to sometimes succeed while the non-database operation failed, consider using a DeferredUpdate.
    3. Consider moving slow database operations to a DeferredUpdate if they do not need to be highly atomic with respect to the main changes.
    4. If the records must be implicitly specified by a hierarchical relationship (e.g. "all comment rows for this thread row"), consider first locking the parent record via "SELECT ... FOR UPDATE" at the start of the operation. This can reduce deadlocks caused by concurrent operations to similar parts of the underlying B*-Trees in MySQL.
    5. If the records must be implicitly specified based on a broader conditional (e.g. inequalities, JOINs, sub-queries), consider selecting the ID of the matching rows first, without "FOR UPDATE" and then changing the rows with those IDs that still match the condition (e.g. restate any WHERE clause). This can reduce contention and deadlocks caused by MySQL "gap locking". Note other concurrent transactions can insert rows that would have matched the condition and commit before the transaction ends. Any associated risk to the API semantics should be considered. Sometimes, leaving out the restated WHERE in an UPDATE/DELETE query can reduce contention (e.g. if query planner index use is poor), but this increases the risk of anomalies (e.g. changing rows that no longer matched due to concurrent transaction writes).
    6. For tables with a parent/child relationship (e.g. page/revision or thread/comment), where the parent table has a "touched"/"CAS" column, a hybrid of the two above approaches can be used. If all operations to a child table row first do a "SELECT ... FOR UPDATE" of the corresponding parent table row, then check that the SELECT without "FOR UPDATE" yields the same values for "touched"/"CAS" column, and update that column by the end of the operation, then the child table writes could first SELECT the IDs of the matching rows, without FOR UPDATE, and then change them in second query.
    7. A variation of the above pattern uses ILoadBalancer::CONN_TRX_AUTOCOMMIT database handles and replaces FOR UPDATE with IDatabase::getScopedLock() calls on keys named after each specified parent row ID (e.g. "page-<ID>").

Long-running queries

We generally develop service classes and database queries with the expectation that they will run in response to a web request. Other contexts where your code may get called are JobQueue jobs and command-line Maintenance scripts.

MySQL uses snapshots for SELECT queries, and the snapshotting persists until the end of the database connection or transaction. Snapshots implement "REPEATABLE-READ" semantics which ensures that within your query session, you see the database as it existed at a single point in time (the time of the first SELECT). Keeping one connection open for more than a few seconds is generally problematic on regular database replicas. Long connections cause MySQL to create a temporary copy of all rows to remember during your connection, because you might query them later.

Queries that select or read data (5 seconds or more) must be run offline and via database hosts dedicated for that purpose.

Note that when assessing whether your queries will take "a long time" or cause contention, measure them. These numbers will always be relative to the performance of the server in general, and to how often it will be run. You can also search the DBPerformance channel on logstash for the URLs or table names handled by your code.

Good example:

  • Special pages that display data based on complex queries are generated periodically by a maintenance script, run via a cron job. The queries in question use the "vslow" query group, which directs the connection to a live database replica in production set aside for slow queries. See also: updateSpecialPages.php, QueryPage::isExpensive, and $wgMiserMode.
  • Analytics reports are standalone and separate from any deployed MediaWiki core. These are generated via periodic cron jobs on analytics clients (aka "stat" machine) that query the Analytics DB Replicas, which contain a full un-redacted near-realtime mirror of the production MediaWiki databases.

Advisory locks

Beware of common pitfalls around simple advisory locks (e.g. form LockManager, PoolCounter, BagOStuff::getScopedLock, or Database::getScopedLock).

Make your lock key as granular and narrow as possible to allow high concurrency of other similar operations that don't conflict.

Likewise, aim to hold the lock for a short duration by releasing it as soon as the critical operation is done. This allows high throughput of operations that modify the same data. When a burst of requests all need to modify the same data (e.g. traffic spike about the same subject, or from the same bot account), each of those will be waiting in their respective "lock" function call for the previous one. The sooner you can release a lock the better!

Blocking locks are prone to deadlocks, which is a race condition where two requests are both waiting and can't continue (e.g. request 1 waits for lock B held by request 2, and request 2 waits for lock A held by request 1). This is more likely to happen if you have multiple different locks throughout your code execution. If possible, organise the code such that its locks can be acquired up front and then later released all at once. This way, once you have the locks, your code can't get stuck half-way.

Blocking locks tend to be used when code is written such that each conflicting operation is performed one after the other (additive/incremental).

Whenever possible, prefer non-blocking locks. This is possible when the code is designed such that operations can be performed by a single server at any given time, and servers fallback to doing nothing or showing an error message. For example, can the code work such that multiple refreshes of the same data are redundant? Or changes that overwrite data would have one party lose either way.

Rate limiting

If your product exposes new user actions that make database modifications beyond the standard page creation / page editing mechanism, then firstly consider whether this is appropriate and scalable. You're going to have a lot less maintenance overhead and operational risk if you adopt "Everything is a wiki page". See also Choose boring technology by Dan McKinley.

If you do have to expose new "write" actions, make sure a rate limit is applied.

Example:

  • UrlShortener exposes API to create new short URLs, which needs a rate limit. Typically powered by User::pingLimiter. T133109

For slow or expensive computations that involve no database writes, consider implementing a throttle based on PoolCounter to limit overall server load.

Example:

  • Special:Contributions exposes a database read query that can be slow. This is rate limited by PoolCounter. See also T234450 and T160985.

Further reading

Articles

Talks

ResourceLoader

For frontend performance guidelines, see MediaWiki Engineering/Guides/Frontend performance practices.

Meta

Sources that helped influence these guidelines, and future drafts and ideas:

References

  1. Measuring Wikipedia page load times (2018), Timo Tijhof.
  2. HPBN Book, Chapter: Primer on Web Performance (2013), Ilya Grigorik.
  3. “How Not To Measure Latency” (2016), a tech talk by Gil Tene.
  4. For more information about the sad stable of media storage, refer to task T28741 and the various parent tasks and "Mentions" tasks detailing its problems.
  5. https://mysqlquicksand.wordpress.com/2019/12/20/select-for-update-on-non-existent-rows/
  6. MySQL bug #25847: Add predicate locking to avoid deadlocks due to locking non-existent rows