Incident documentation/20170917-MediaWiki-API

From Wikitech
Jump to: navigation, search

Summary

The mediawiki cluster dedicated to serving the API had a sudden rise in cpu usage on Saturday September 16th; it was noticed from the restbase latency report and solved by a rolling restart of the HHVM service. The reason for the excessive CPU usage was due to multiple HHVM threads running at 100% CPU usage while not responding to clients. A quick analysis hinted this to be a deadlock between multiple threads in accessing/modify the Translation Cache.

The effect for user was a higher latencies when performing requests to the action API [1]

Timeline

  • September 15th 16:20 (approx): the load begins to rise on the API cluster [2]
  • September 16th 07:45: The CPU usage reaches its maximum across the cluster, peaking at about 87%, more than tripled with respect of the normal operating usage of about 25%. The latency of the mediawiki API begins being almost constantly very high, according to telemetry from restbase [3]. It is interesting to note that in all this time, no alert from pybal for depooling of API appservers happen.
  • September 17th 11:59: mobrovac notices the high latencies for the 99th percentile of requests for restbase, and notifies others on IRC.
  • September 17th 12:34: _joe_ acknowledges the problem, starts investigating
  • September 17th 12:40: Entry in SAL _joe_: restarting hhvm on some api appservers, to ease the cpu overlock
  • September 17th 13:00 _joe_ analyzing a backtrace notices a lot of references to HPHP::jit::enterTCImpl in the blocked threads, but decides to fight the fire at this point
  • September 17th 13:10: After assessing the restarts are effective, _joe_ rolling restarts the rest of the cluster, and by this time the restbase alerts have recovered
  • September 17th 13:25 After all appservers have been restarted the CPU load is back within normal parameters


Conclusions

While this seems like some serious issue with HHVM, such crises can and will happen from time to time: what didn't work at all is our alerting - the occurrence during the weekend delayed our response time and exposed the inadequancy of our monitoring pretty clearly. While the restbase telemetry was a useful and pretty obvious to a human eye, it is hardly a good metric to alarm as it can vary greatly. The metrics collected about API performance are stable enough to be acted upon, but they only show part of the API requests. We do need better metrics, and to be able to act upon them.

Actionables

  • Measure the overall performance of the Mediawiki application at the appserver level T131894
  • Create meaningful, actionable, non-flapping alerts based on backend timing information (see T123854, but the suggestions there don't really cut it for the kind of reliable alerting we want)