Starting from May 03 2016 around 7:30 UTC, WDQS started to have occasionally increased response time, leading to HTTP 502 errors from Varnish. At that time, WDQS was running on a single server due to a reinstall and data reload in progress. Restarting Blazegraph restored the service. Multiple restart were done over the following days.
The issue was tracked to multiple causes: a known bug in the version of Blazegraph that we use and a file descriptor leak related to Jolokia (monitoring agent).
- 2016-05-01T19:37 enabled wdqs1002, put wdqs1001 in maintenance mode for reload
- 2016-05-03T11:08 issue reported in https://phabricator.wikimedia.org/T134238 and IRC
- 2016-05-03T12:28 wdqs-updater killed as it seems to leak pipes
- 2016-05-03T13:01 restarting wdqs-updater and keeping it under close scrutiny
- 2016-05-03T17:18 restarting wdqs1002
- 2016-05-03T21:11 restarting wdqs1002
- 2016-05-04T23:26 deployed additional Icinga check increase visibility on this issue
- 2016-05-05T08:12 restarting wdqs1002
- 2016-05-05T08:57 restarting wdqs1002
- 2016-05-05T11:32 restarting wdqs1001
- 2016-05-05T21:00 deploying fix to Jolokia
- 2016-05-07T12:32 restarting wdqs1002
- 2016-05-07T20:13 restarting wdqs1001 and wdqs1002
- 2016-05-07T20:28 deploying updated Blazegraph version for WDQS to mitigate deadlock issue
- Running on 2 servers when maintenance tasks (data reload) can take multiple days is not enough.
- We were alerted by users, our monitoring is not sufficient.
- Done: run Jolokia as a Java agent, not attaching and detaching it at each run
- Done: add response time check to WDQS
- Done: Deploy new Blazregraph version to fix BLZG-1884
- Tasks opened: Adjust balance of WDQS nodes / Deploy WDQS node on codfw