Incident documentation/20130315-AFTv5

From Wikitech
Jump to: navigation, search

Outage Summary

Wikimedia sites experienced an outage on 15th March 2013 from about 16:14 to about 16.23 PDT (23:14 to 23:23 UTC).

  • Duration: From about 23:14 UTC to 23:23 UTC; approximately 9 minutes
  • Impact: Wikimedia sites were not available for edits; Most users were able to continue use the sites for 'read'.
  • Cause: Database contention issues caused by the new version of AFTv5 that was deployed
  • Resolution: Reverted the deployed extension

Detail

(per Asher's email)

We experienced an site outage this afternoon across all projects. Service was degraded for non edge cached requests between approximately 4:13 - 4:23pm PDT.
Initial root cause appears to be database contention issues caused by the new version of AFTv5 that was deployed today. Once identifying that as the likely issue, I asked Ryan to disable the extension. The cluster returned to normal as soon as that deploy went out.
I actually thought that todays AFTv5 deploy was backed out, so this was a bit of a surprise. The new version of AFTv5 uses its very own database shard that currently does nothing else. The entire dataset for enwiki is only around 350Mb and 580k rows. Yet most apache children were tied up waiting on queries to the slave in that shard (db1030).
I saved the output from "show full processlist" at fenari:/home/asher/db/aftv5-outage-db1030-proclist. It's 22Mb. Kind of amazing! I checked explains for a few of the queries and those that I picked weren't covered by any index (i.e. searching for feedback by ip address on aft_user_text which is unindexed) and all included an ORDER BY + LIMIT. Not bad if that's covered by an index, not good if the query can't use one.

Response from Matthias

copy/paste from his email

The culprit was query hooking into "my contributions", one I had completely overlooked. If you were to attempt to construct a query avoiding all possible indexed columns, this would have to be the leading example.