Incident documentation/20160713-ContentTranslation

From Wikitech
Jump to: navigation, search

Summary

ContentTranslation extension used up all available SQL connections causing issues for extensions using the shared database (at least Flow, Echo). This was caused up by thousands of hits in short time to the translation draft saving API by on user. What caused those requests is still unknown.

Timeline

See also Server admin log entries during this time.

  • 2016-07-12 [15:48:53] (UTC) icinga-wm> PROBLEM - MariaDB Slave SQL: x1 on db1031 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
  • [15:50] jynus notices x1 master is down
  • [15:52] jynus identified ContentTranslation\Translation::update / cx_translations as the cause
  • [15:59] https://phabricator.wikimedia.org/T140123 is created by jynus
  • [16:05] Language team developers were pinged in IRC
  • [16:10] Nikerabbit notices the ping and starts checking what is going on
  • [16:12] Patch to disable CX is deployed
  • [16:19] Nikerabbit checked the code and believes it can only be caused by high number of API hits (no loops in the relevant code etc.)
  • [16:28] Nikerabbit is able to confirm this is true via https://graphite.wikimedia.org/render/?width=586&height=308&target=MediaWiki.api.cxsave.executeTiming.count static copy. Based on other data we assume these are all coming from one or at most few users.
  • [16:30-16:35] Brainstorming for solutions: blocking the user, poolcounter, ping limiter
  • [16:40] Nikerabbit thinks ping limiter is the best option and starts working on the patches, ostriches indicates he can help with review and deploy
  • [16:54] Nikerabbit has finished the patches: https://gerrit.wikimedia.org/r/#/q/topic:cxsave,n,z
  • [16:55-17:45] Patches are being reviewed and deployed. Tests for CX extension patch takes 10+ minutes and they need to run multiple times.
  • [17:48-18:08] Patch to turn CX back on is created and deployed
  • [18:09-18:13] Initially CX does not work. Special pages only display user toolbar, no JavaScript errors. After few minutes it starts to work, indicating it was caused by the 5 min caching.

Conclusions

  • Surprised there is no general request limiting automatically applied, but code changes were required.
  • Language team has no monitoring in place that would have detected this issue.

Actionables

  • Status:    Done The ping limiter which was added can stay in place. The values should be verified. T140615
  • Status:    Done Audit the front-end code for possible causes of these requests. T140622
  • Status:    Done Try to contact the user for more information and advice T140619