Incident documentation/20160503-Wikidata-Query-Service

From Wikitech
Jump to: navigation, search

Summary

Starting from May 03 2016 around 7:30 UTC, WDQS started to have occasionally increased response time, leading to HTTP 502 errors from Varnish. At that time, WDQS was running on a single server due to a reinstall and data reload in progress. Restarting Blazegraph restored the service. Multiple restart were done over the following days.

The issue was tracked to multiple causes: a known bug in the version of Blazegraph that we use and a file descriptor leak related to Jolokia (monitoring agent).

Timeline

  • 2016-05-01T19:37 enabled wdqs1002, put wdqs1001 in maintenance mode for reload
  • 2016-05-03T11:08 issue reported in https://phabricator.wikimedia.org/T134238 and IRC
  • 2016-05-03T12:28 wdqs-updater killed as it seems to leak pipes
  • 2016-05-03T13:01 restarting wdqs-updater and keeping it under close scrutiny
  • 2016-05-03T17:18 restarting wdqs1002
  • 2016-05-03T21:11 restarting wdqs1002
  • 2016-05-04T23:26 deployed additional Icinga check increase visibility on this issue
  • 2016-05-05T08:12 restarting wdqs1002
  • 2016-05-05T08:57 restarting wdqs1002
  • 2016-05-05T11:32 restarting wdqs1001
  • 2016-05-05T21:00 deploying fix to Jolokia
  • 2016-05-07T12:32 restarting wdqs1002
  • 2016-05-07T20:13 restarting wdqs1001 and wdqs1002
  • 2016-05-07T20:28 deploying updated Blazegraph version for WDQS to mitigate deadlock issue

Conclusions

  • Running on 2 servers when maintenance tasks (data reload) can take multiple days is not enough.
  • We were alerted by users, our monitoring is not sufficient.

Actionables