Michael Jackson effect
The Michael Jackson effect (also Michael Jackson problem[1]) is a technical term used in the Wikimedia movement to refer to a cache stampede. A cache stampede is the system failure that results when there is high demand for a computed object that is presently uncomputed.
Event
The term was coined [when?][by whom?] after the death of Michael Jackson on 25 June 2009, which resulted in an unprecedented amount of page views and combined edit traffic.
The article received a record-breaking 5.9 million visits on a single day (26 June), of which one million were during a single hour.[2]
The article received 1.2 million visits on the day of Jackson's death (25 June), which with the combined edit traffic caused several server overloads that made Wikipedia intermittently unavailable to the public.[2]
Technical impact
Background
When an edit is saved in MediaWiki, it is allocated a revision ID, and the page record is updated to point to this as the "current" revision of that page. During the edit save, the submitted wikitext is parsed. For large and complex biographies, parsing used to be costly and time-intensive operation, involving a large amount of CPU work on the web server to process text markup with many (at the time) unoptimised templates and citation references. In 2009, it was not uncommon for a large article like this to take over 30 seconds to parse. After the save operation and wikitext conversion is completed, the page's entry in the ParserCache is overwritten with the new article HTML and associated metadata about the revision for which it was computed. After the ParserCache is written to, we purge the article URL from the edge cache (Wikimedia CDN).
When a page is viewed by URL, and there is no entry in the CDN, MediaWiki is accessed and we query the database for the now-current revision ID of the page, and look in the ParserCache for the requested article, and if ParserCache does not contain an entry for this article, or if the entry is not for the expected revision ID, then we consider this a "cache miss", at which point the wikitext is fetched from the database and it is parsed on-demand, similar to what would happen during a save operation.
Incident
Any given edit request (correctly) only purged the URL from the CDN after it had succesfully committed the database metadata and saved a ParserCache entry.
But, the rapid editing of the article resulted in repeated purging of the article URL from the CDN, which thus invited a lot of traffic to the MediaWiki servers asking for the "current" version of the article.
The definition of "current" of course kept changing, creating race conditions where servers processing a page view could perceive the database as referring to a "current" revision ID that was now outdated (the ParserCache has been overwritten to be newer by another edit meanwhile), or too new (the ParserCache entry still pointing to a previous one). This resulted in the MediaWiki web servers essentially all being busy doing the exact same thing: parsing the wikitext content of the Michael Jackson article, often even the same exact revision.
This overload exceeded the combined CPU capacity of the web servers and resulted in reduced availability of Wikipedia overall.
Solution
Shortly after this incident the PoolCounter extension was developed by Tim Starling (together with its associated MediaWiki core interface, and server daemon written in C), which is designed to protect Wikimedia Foundation servers against massive spikes in views like this. And, to avoid massive wastage of CPU capacity due to parallel parsing and cache computation of the same value after it is invalidated.
PoolCounter provides a mutex-like functionality used by MediaWiki to request a lock before it attempts to parse an article.
If the server is among the first few in line to parse a given article, PoolCounter responds immediately with a success message ("LOCKED") and the server goes ahead and parses the article and releases the lock once it is saved to ParserCache.
If other servers are already busy doing this, then PoolCounter will hold the server on the line for a while to allow the first one to complete its work first, which, if it completes and releases the lock within the timeout threshold, results in PoolCounter responding to the held client with "DONE", indicating the work from the other server is now completed and its result can be found in the ParserCache.
If there are too many servers waiting in line ("QUEUE_FULL") or if it took too long for the lock-holding server to complete its work ("TIMEOUT"), then MediaWiki will not permit itself to parse the article and will instead return a known-stale version of the article. If there is nothing in the ParserCache at all, it will display an error message to the user, asking to try again later.
Further reading
- Current events (2009, Brion Vibber), reporting the traffic spike and technical impact.
- Embarrasment (2009, Domas Mituzas), technical details and stop-gap solution.
- Server Admin Log: 25 June 2009 (Brion Vibber), description of the stop-gap solution.
- Wikipedia major traffic (2009), public communications from WMF drawing attention to the event.
- The King of Wikipedia Traffic (2009, The Wikipedian), detailed analytics and traffic statistics.
- The King of Pop vs Wikipedia (2009, The Signpost), summary of the event and impact overall.
- Wikipedia May Have Set A Record (2009, NY Times), less technical summary with new info on the Jackson article and its edit history.
- Wikimedia engineering March 2011 report, PoolCounter was re-deployed after a multi-month takedown for maintenance.
- The impact of Prince's death on Wikipedia (April 2016), blog.wikimedia.org.
- PoolCounter by Tim Starling, the system developed in 2009 to mitigate the "Michael Jackson" effect.
- Wikipedia:Article traffic jumps, a maintained list of the largest traffic jumps on Wikipedia.
References
- ↑ PoolCounter documentation: first revision from March 2011, Tim Starling, mediawiki.org.
- ↑ 2.0 2.1 The King of Wikipedia Traffic (2009, The Wikipedian).