Michael Jackson effect

From Wikitech
Jackson performing in June 1988.

The Michael Jackson effect (also Michael Jackson problem[1]) is a technical term used in the Wikimedia movement to refer to a cache stampede. A cache stampede is the system failure that results when there is high demand for a computed object that is presently uncomputed.

Event

The term was coined [when?][by whom?] after the death of Michael Jackson on 25 June 2009, which resulted in an unprecedented amount of page views and combined edit traffic.

The article received a record-breaking 5.9 million visits on a single day (26 June), of which one million were during a single hour.[2]

The article received 1.2 million visits on the day of Jackson's death (25 June), which with the combined edit traffic caused several server overloads that made Wikipedia intermittently unavailable to the public.[2]

Technical impact

Background

When an edit is saved in MediaWiki, it is allocated a revision ID, and the page record is updated to point to this as the "current" revision of that page. During the edit save, the submitted wikitext is parsed. For large and complex biographies, parsing used to be costly and time-intensive operation, involving a large amount of CPU work on the web server to process text markup with many (at the time) unoptimised templates and citation references. In 2009, it was not uncommon for a large article like this to take over 30 seconds to parse. After the save operation and wikitext conversion is completed, the page's entry in the ParserCache is overwritten with the new article HTML and associated metadata about the revision for which it was computed. After the ParserCache is written to, we purge the article URL from the edge cache (Wikimedia CDN).

When a page is viewed by URL, and there is no entry in the CDN, MediaWiki is accessed and we query the database for the now-current revision ID of the page, and look in the ParserCache for the requested article, and if ParserCache does not contain an entry for this article, or if the entry is not for the expected revision ID, then we consider this a "cache miss", at which point the wikitext is fetched from the database and it is parsed on-demand, similar to what would happen during a save operation.

Incident

Any given edit request (correctly) only purged the URL from the CDN after it had succesfully committed the database metadata and saved a ParserCache entry.

Wikimedia load spike on June 25, 2009.

But, the rapid editing of the article resulted in repeated purging of the article URL from the CDN, which thus invited a lot of traffic to the MediaWiki servers asking for the "current" version of the article.

The definition of "current" of course kept changing, creating race conditions where servers processing a page view could perceive the database as referring to a "current" revision ID that was now outdated (the ParserCache has been overwritten to be newer by another edit meanwhile), or too new (the ParserCache entry still pointing to a previous one). This resulted in the MediaWiki web servers essentially all being busy doing the exact same thing: parsing the wikitext content of the Michael Jackson article, often even the same exact revision.

This overload exceeded the combined CPU capacity of the web servers and resulted in reduced availability of Wikipedia overall.

Solution

Shortly after this incident the PoolCounter extension was developed by Tim Starling (together with its associated MediaWiki core interface, and server daemon written in C), which is designed to protect Wikimedia Foundation servers against massive spikes in views like this. And, to avoid massive wastage of CPU capacity due to parallel parsing and cache computation of the same value after it is invalidated.

PoolCounter provides a mutex-like functionality used by MediaWiki to request a lock before it attempts to parse an article.

If the server is the first and only one in line to parse this article, PoolCounter responds immediately with a success message ("LOCKED") and the server goes ahead and parses the article and releases the lock once it is saved to ParserCache.

If another server is already busy doing this, then PoolCounter will hold the server on the line for a while to allow the first one to complete its work first, which, if it completes and releases the lock within the timeout threshold, results in PoolCounter responding to the held client with "DONE", indicating the work from the other server is now completed and its result can be found in the ParserCache.

If there are too many servers waiting in line ("QUEUE_FULL") or if it took too long for the lock-holding server to complete its work ("TIMEOUT"), then MediaWiki will not permit itself to parse the article and will instead return a known-stale version of the article. If there is nothing in the ParserCache at all, it will display an error message to the user, asking to try again later.

This describes PoolCounter as it was around 2010. See PoolCounter for current API documentation.

Further reading

References

  1. PoolCounter documentation: first revision from March 2011, Tim Starling, mediawiki.org.
  2. 2.0 2.1 The King of Wikipedia Traffic (2009, The Wikipedian).