Incidents/20140919-s1
Summary
At approx 2014-09-19 08:50:00 enwiki experienced a site outage. The apparent order of events was:
1. Three edits made to a template:
https://en.wikipedia.org/w/index.php?title=Template:Redirect_template&action=history
2. Jobrunner write activity[1] from wikiadmin user on enwiki master increased substantially, with cirrus in the spotlight:
https://logstash.wikimedia.org/#/dashboard/elasticsearch/LinksUpdate%20issues
Binlog showed substantial LinksUpdate hits (it is often in the top 10, but blends in with similar numbers to other traffic):
905809 LinksUpdate::incrTableUpdate 359301 LinksUpdate::updateLinksTimestamp 9372 Title::invalidateCache 8728 FlaggableWikiPage::clearStableVersion 8099 User::invalidateCache 5753 Revision::insertOn 4692 ArticleCompileProcessor::save 4607 `heartbeat`.`heartbeat` 4404 FlaggedRevs::clearStableOnlyDeps 3261 CheckUserHooks::updateCheckUserData 3257 SiteStatsUpdate::tryDBUpdateInternal 3185 RecentChange::save
The job activity occurred in waves with periods of very heavy writes, then minutes of nothing.
3. enwiki slaves started to experience intermittent replication lag. The main offender was:
DELETE /* LinksUpdate::incrTableUpdate
4. Surges of wikiuser DB connections to slaves began appearing after each write surge above in #2. These hit max_connections on all slaves simultaneously, and apaches went critical. Note that there were no slow queries involved; just an order of magnitude more connections and queries than normal.
5. Significant numbers of wikiadmin connections sat in SELECT MASTER_POS_WAIT due to #3, which reduced the available connections for #4.
6. We killed masses of wikiadmin and wikiuser sleeping connections to make way for new ones.
7. We stopped jobrunners.
8. Things recovered.
Observations and questions:
1. Batching the LinksUpdate DELETE and UPDATE queries would help with replag.
2. The storm of wikiuser traffic after the jobs was due to cache invalidation and presumably a lot of duplicated effort? Could that be mitigated in another layer above the DBs?
3. Can we throttle jobrunners more, or make them smarter in these situations?
Actionables
- Status: Done https://gerrit.wikimedia.org/r/161473
- Status: Done https://gerrit.wikimedia.org/r/161500
- Status: Done https://gerrit.wikimedia.org/r/161577
- Status: Done https://gerrit.wikimedia.org/r/161603
- Status: Done https://gerrit.wikimedia.org/r/161618
- Status: Done https://gerrit.wikimedia.org/r/161749