Incidents/20160503-Wikidata-Query-Service

Summary

Starting from May 03 2016 around 7:30 UTC, WDQS started to have occasionally increased response time, leading to HTTP 502 errors from Varnish. At that time, WDQS was running on a single server due to a reinstall and data reload in progress. Restarting Blazegraph restored the service. Multiple restart were done over the following days.

The issue was tracked to multiple causes: a known bug in the version of Blazegraph that we use and a file descriptor leak related to Jolokia (monitoring agent).

Timeline

2016-05-01T19:37 enabled wdqs1002, put wdqs1001 in maintenance mode for reload
2016-05-03T11:08 issue reported in https://phabricator.wikimedia.org/T134238 and IRC
2016-05-03T12:28 wdqs-updater killed as it seems to leak pipes
2016-05-03T13:01 restarting wdqs-updater and keeping it under close scrutiny
2016-05-03T17:18 restarting wdqs1002
2016-05-03T21:11 restarting wdqs1002
2016-05-04T23:26 deployed additional Icinga check increase visibility on this issue
2016-05-05T08:12 restarting wdqs1002
2016-05-05T08:57 restarting wdqs1002
2016-05-05T11:32 restarting wdqs1001
2016-05-05T21:00 deploying fix to Jolokia
2016-05-07T12:32 restarting wdqs1002
2016-05-07T20:13 restarting wdqs1001 and wdqs1002
2016-05-07T20:28 deploying updated Blazegraph version for WDQS to mitigate deadlock issue

Conclusions

Running on 2 servers when maintenance tasks (data reload) can take multiple days is not enough.
We were alerted by users, our monitoring is not sufficient.

Actionables

Done: run Jolokia as a Java agent, not attaching and detaching it at each run
Done: add response time check to WDQS
Done: Deploy new Blazregraph version to fix BLZG-1884
Tasks opened: Adjust balance of WDQS nodes / Deploy WDQS node on codfw