MediaWiki Engineering/Guides/Measure backend performance

This page describes how to measure performance of backend code.

If you are starting a new project or otherwise have not yet set performance objectives, read the Backend performance guide first.

Production impact

Once code is in production, the following methods are good starting points for assessing overall performance. For example, to help keep current awareness of your code's "hot" spots, or during an incident, or after deploying a major change to a large wiki.

New code is by default deployed once a week (as part of our Train deployment process), which rolls out over the course of a week, starting with smaller wikis, and moving to higher traffic sites like Wikipedia by Thursday. This naturally ramps up and load-tests changes to all code. See wikitech:Deployments/One week for more information.

Flame graphs

We continuously sample live traffic on MediaWiki backends in production, and generate aggregate flame graphs every few minutes.

https://performance.wikimedia.org/php-profiling/

The flamegraphs are organised by date and entrypoint. For example, if you work on the CirrusSearch extension and want to analyze the latency of the ApiOpenSearch route (api.php?action=opensearch), you'd navigate to a recent daily flamegraph for the "api" entrypoint, and use Ctrl-F to search for "ApiOpenSearch". You can then click on the pink highlighted area to zoom in.

Similarly, you can analyze the latency of index.php routes (like ViewAction, or SpecialWatchlist), rest.php route handlers, load.php module classes, etc.

Each flamegraphs displays the relevant deployment branch ("train week") at the bottom of the callstack as part of the file path. This indicates which version of the code it measured, and allows for week over week comparison by opening two flamegraphs in separate tabs. When comparing week-over-week, make sure to pick a weekday where the majority of the flame graph is from a single deployment branch. If the graph is clearly split between two major versions, pick a day earlier instead. We recommend comparing weekend days or Mondays.

WANObjectCache

Grafana: WANObjectCache by keygroup

We encourage all use of Memcached to use the WANObjectCache interface, specifically the getWithSet idiom.

Among many transparently ensured operational benefits, it also automatically provides you with rich real-time telemetry on each cache key group is behaving. Open the "Key group" drop down and select one of your cache keys. The important metrics to maintain general awareness of are the "Cache-hit rate", and "Regeneration callback time". These measures the outcome of getWithSetCallback, and the time taken by your data computation.

Server monitoring

Grafana: Application Servers RED, this provides an overview for the MW server cluster as a whole, in particular look for response duration ("latency").
Grafana: MediaWiki Exceptions, counts of production errors as reported to Logstash.

Database queries

Logstash: mediawiki-rdbms-performance (restricted), warnings about database queries from any MediaWiki core component or extension, where a performance threshold is exceeded (such as unexpected primary DB or cross-dc connections, and slow queries).
Logstash: slow queries (restricted), dedicated breakdown specifically for "slow" query warnings.

You can filter these to your area of code by searching for exception.trace:MyExtension or exception.trace:CoreComponent. Note that such filters work on any MediaWiki-related dashboard in Logstash, not just those about Performance. We recommend that engineering teams have a personal Logstash dashboard where a filter like this is applied to all warnings in Logstash, so that issues are automatically brought to your attention regardless of which component it comes from. Learn more: OpenSearch Dashboards. The home page of Logstash links to examples from e.g. Growth Team, Performance Team and Language Team.

Debug

In addition to the above continuous monitoring, you can also use WikimediaDebug to capture your own ad-hoc performance profile by performing user action or pageview.

Local development

You can capture detailed trace logs, timing measures, and flame graphs from your local MediaWiki install.

We recommend installing Excimer and follow Excimer#Per-request flame graph from MediaWiki to generate local flame graphs. Refer to Manual:Profiling on mediawiki.org for how to install and use other debugging tools.

It is recommended that you include the DevelopmentSettings.php preset in your LocalSettings.php file. (This is done automatically if you use Quickstart, or MediaWiki-Docker.) This enables a good base line of various debug modes. For even more verbose debugging, you can copy the commented-out Ad-hoc debugging section from DevelopmentSettings to your LocalSettings.php, and enable additional ones as-needed (such as $wgDebugToolbar or $wgDebugDumpSql, explained below).

Database queries

You can find out which exact queries are coming from your code by enabling $wgDebugToolbar in LocalSettings (see also Manual:How to debug). This provides an overview of all queries from a certain page. For API or other misc web requests, you can consult the debug log file which logs all SQL queries when $wgDebugDumpSql is enabled.

When developing non-trivial database queries, consider using a MySQL shell (such as maintenance/mysql.php locally, or sql <dbname> in production from a maintenance server) to EXPLAIN or DESCRIBE statements to find which indexes are involved in a particular query.

When adding a new query to your code (e.g. via the Database::select() helper from our Rdbms library), try to run a version of those queries at least once with the EXPLAIN statement, and make sure that it is effectively using indexes. While a select query without index may run fast for you locally, it is going to perform differently when there are several billion objects in the database.

With the Debug toolbar enabled, look for the following:

repeat queries, data should be queried authoritatively once by a service class and then re-used or passed around as needed. If two unrelated callers to a service class regularly need the same data, consider an in-class cache, and limit the size of this cache to avoid uncontrolled growth (e.g. in API batch requests, jobs, or CLI scripts). Even if you don't have a UI for batch operations, a higher level feature may still cause your code to be called in a loop. We provide MapCacheLRU and HashBagOStuff in MediaWiki core to make it easy to keep a limited number of key-value pairs in memory.
generated queries, if you see many similar queries with one different variable, this may be coming from a loop that should instead query the data in a batch upfront.

For more details, see also: Roan Kattouw's 2010 talk on security, scalability and performance for extension developers, Roan's MySQL optimization tutorial from 2012 (slides), and Tim Starling's 2013 performance talk.

You must consider the cache characteristics of your underlying systems and modify your testing methodology accordingly. For example, if your database has a 4 GB cache, you'll need to make sure that cache is cold as otherwise your data is likely still in the cache from previous queries.

Particularly with databases, but in general, performance is heavily dependent on the size of the data you are storing (as well as caching) -- make sure you do your testing with realistic data sizes.

Spinning disks are really slow; use cache or solid state whenever you can; However as the data size grows, the advantages of solid state (avoiding seek times) are reduced.

Benchmarking

Quantify a proposed performance improvement by measuring it.

In PHP, you can ad-hoc measure using microtime(), for example:

$t = microtime( true );
 
$instance = createMyInstance();
$instance->myMethod();
$instance->myOtherMethod();
 
print __METHOD__ . ':' . ( ( microtime( true)  - $t ) * 1000 );
 
//> 13.321235179901 milliseconds

Or from maintenance/eval.php:

> $t = microtime(true); for( $i = 0; $i < 100000000; $i++ ) { md5('testing'); } print microtime(true)-$t;
13.321235179901

MediaWiki has benchmarking scripts in maintenance/benchmarks, including the generic utility benchmarkEval.php:

php benchmarkEval.php --code="md5('testing')" --inner=1000000 --count=100

Multiple timing runs will vary substantially. To minimise the impact of this:

Use a large loop count to benchmark for a long time — at least 10 seconds.
Avoid any other system activity while the benchmark runs. If you are using your laptop, kill your browser and anything else that might wake up periodically.
Don't use a VM if there is any other activity on the same hardware.
Avoid unnecessary I/O within the benchmark. For example, disable logging.
Benchmark a small amount of code in a tight loop, so that the relative effect of the intervention will be larger.

Extremely accurate performance measurements can be done using hardware performance counters. On Linux you can use perf.

For example, running a benchmark under `perf stat -e instructions` will give a metric which is not affected by background activity on the same host. It tells you how much machine code is executed, which may be a decent model for cost depending on what you're measuring.

Benchmarking and load testing in production

You can use ApacheBench to load-test an application server, and output summary statistics. Use -l to accept a variable Content-Length (as most responses do slightly vary due to datetime and request ID metadata)

# Target local Apache directly over plain HTTP
ab -n 1000 -c 24 -l -X mw2377.codfw.wmnet:80 -H 'X-Forwarded-Proto: https' http://test2.wikipedia.org/wiki/Special:BlankPage

# == Output ==
#
# Document Path:          /wiki/Special:BlankPage
# Concurrency Level:      24
# Time taken for tests:   3.709 seconds
# Complete requests:      1000
# Failed requests:        0
# […]
# Percentage of the requests served within a certain time (ms)
#  50%     86
#  75%     91
#  99%    115
# 100%    274 (longest request)

# Target via local gateway, including HTTPS overhead
ab -n 1000 -c 24 -l -X mw2377.codfw.wmnet:443 https://test2.wikipedia.org/wiki/Foobar

For practical examples, see T323553#8475315 (Dec 2022) and T279664#8122195 (Aug 2022).

Or use perf to benchmark a command-line entrypoint:

$ perf stat -r100 sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php /srv/mediawiki/php-1.39.0-wmf.22/maintenance/getText.php --wiki=test2wiki Foobar >/dev/null

As usual, exercise care when subjecting production servers to synthetic load.

Beta Cluster

The Beta Cluster is hosted in Wikimedia Cloud. This is a good place to detect functional problems, but may not be a representative environment for performance measures as it runs in a virtualised multi-tennant environment. Meaning, the machines are less powerful than production, and often under heavy load. See also T67394.

Credits

Portions of this page were copied from "Performance profiling for Wikimedia" on mediawiki.org as written by Sharihareswara (WMF) in 2014.