Incidents/20151217-cxserver
Appearance
(Redirected from Incident documentation/20151217-cxserver)
Summary
service-runner migration of cxserver (https://phabricator.wikimedia.org/T101272) along with its Puppet config was planned and cxserver service was intruppted during this deployment. See Timeline for full details and conclusion on lesson learned.
Timeline
- 08:00Z Language team daily meeting. Service-runner migration is expected to happen today, but time is not known as it was not on Deployments. Runa was assigned to check with Kartik.
- 12:22Z Kartik announced in Language team chat that deployment is in progress and asks Santhosh to be around. These messages are not delivered on-time, given Santhosh does not respond and that Niklas asks at 12:40 whether deployment is in progress after seeing notices on #mediawiki-i18n.
- Kartik, Alex and Marko set service-runner migration of cxserver on Thursday 5.30 PM IST (https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0December.C2.A017)
- Alex depooled sca1001 to make sure cxserver is unintruppted. sca1002 continued to serve traffic.
- Alex disabled puppet and salt on sca1002 so that no changes happen to it at all and it continues to serve traffic.
- Alex merged https://gerrit.wikimedia.org/r/#q,250910,n,z and deployed it on sca1001 (service-runner migration for cxserver).
- Kartik deployed, https://gerrit.wikimedia.org/r/#q,258435,n,z (Update mediawiki-contenttranslation to 1eed8b4).
- Kartik tested that cxserver is OK at API end-point.
- At this point Alex and Kartik thought cxserver on sca1001 was OK. Monitoring reported an OK state. Unfortunately testing/monitoring was not complete enough and hence several problems detailed below surfaced.
- Alex pooled sca1001. However it is not used as it fails. That is not immediately obvious however. Monitoring keeps on saying OK, however pybal uses a different check and that fails. Effectively sca1001 is unpooled.
- Alex depooled sca1002, ran puppet on it and starts salt. However pybal does not depool it as it is the only backend left.
- Kartik deployed cxserver on sca1002.
- cxserver on sca1002 no longer works.
- Pages for LVS cxserver service are sent to ops.
- Language pairs (Registry) were empty ({}) on cxserver.
- 12:36Z #mediawiki-i18n starts sending notices that Apertium in WMF is down.
- Announces the service disruption at twitter: https://twitter.com/WhatToTranslate/status/677486473055105024
- 12:44Z-12:45Z Alex notices that / returns 404 and fixes the pybal configuration to use /_info https://gerrit.wikimedia.org/r/#/c/259669/ for monitoring.
- 12:47Z Alex restarts pybal to load the new config on lvs1003 first and a couples of minutes later on lvs1006 as is the process.
- 12:50Z Kartik asks in Language team chat whether everything is OK.
- 12:53Z Alex also fixes the paging config: https://gerrit.wikimedia.org/r/#/c/259672/
- 12:54Z Niklas replies that MT/dictionaries are not working.
- 12:56Z _joe_ notes that there are lots of stacktraces: https://phabricator.wikimedia.org/P2434
- 13:00Z We believe based on the stacktraces that there is something wrong with the config, Niklas is trying to check what it could be.
- 13:20Z It was determined that defaults-merging no longer happens and that we need to set everything in puppet.
- 13.31Z OK pages are sent to ops since /_info endpoint works fine. But other endpoints don't.
- Kartik with help of Alex and Niklas updated registry in hieradata since it was not picking from cxserver's config.yaml.
- 14:00Z Santhosh joins team chat to help debugging, Niklas goes back to non-WMF things.
- Alex deployed multipe hieradata patches: https://gerrit.wikimedia.org/r/#/c/259680/, https://gerrit.wikimedia.org/r/#/c/259695/, ..
- Language pairs (Registry) were OK.
- Niklas, Marko and Santhosh found that probably proxy was causing an issue for cxserver to connect to apertium and restbase.
- 14:57Z Alex deployed fix for Proxy: https://gerrit.wikimedia.org/r/#/c/259700/
- 16:00Z Niklas returns, status is that MT and page loading apis are not working and that we are having proxy issues
- 16:10Z-16:20Z Niklas points out that Yandex is using global proxy, not Yandex specific proxy as intended and that we could fix it.
- 16:20Z-16:26Z Santhosh makes two fixes in cxserver: https://gerrit.wikimedia.org/r/#/c/259731 and https://gerrit.wikimedia.org/r/#/c/259733
- 16:26Z Niklas notifies Alex about the above fixes, Alex asks to wait as he has found some clues related to why no_proxy_list doesn't work.
- 16:36Z Alex says he got it working and starts preparing patches.
- 16:50Z-17:05Z https://gerrit.wikimedia.org/r/259741 and https://gerrit.wikimedia.org/r/259743
- The above bug took quite a while to fix unfortunately. There were no logs and only an unhelpful 403 HTTP error reported by cxserver.
- 17:02Z-17:05Z Status check: Yandex was still down, page loading and Apertium were back.
- 17:07Z Santhosh has identified a bug in service-runner migration that causes MT routes to fail
- 17:10Z-17:16Z Patch in and merged https://gerrit.wikimedia.org/r/#/c/259746/
- We are reverting the two cxserver patches made earlier not to break current proxy config.
- 17:35Z Kartik deployed cxserver update and askes to test: https://gerrit.wikimedia.org/r/259752
- 17:36Z Niklas founds that it is CERT_UNTRUSTED error again
- 17:40Z Niklas suggests using previous strictSSL = false fix until we figure this out. It was clarified that we did not actually use ca-fix before service-runner.
- 17:46Z Santhosh provides a testing script to debug the issue. Five minutes later Niklas figures out how to actually run it.
- 18:02Z Debugging with the script goes on without providing clear clues. We confirm that ca file is read correctly and rule that out.
- 18:13Z eureka: Niklas figures out that we pass ca: file_path instead of agentOptions { ca: file_path } then it works.
- 18:18Z-18:24Z Patch by Santhosh submitted to gerrit and merged.
- 18:38Z Kartik deployed cxserver: https://gerrit.wikimedia.org/r/259772
- 18:00Z Announces service back status https://twitter.com/WhatToTranslate/status/677563660055748609
- 18:40Z MT confirmed working.
Conclusions
- Better monitoring for all endpoints in cxserver.
- Test and check config in Beta. Currently, there is no way until we merge 'cxserver/deploy' patch. So, this should have done earlier rather than on deployment day.
- Schedule early and make sure relevant people are around to debug issues and help with testing.
- cxserver need more error path testing. Also defensive coding for bad configurations and environments.
- Migration to different endpoints needs changes in LVS/pybal monitoring/configuration.
- (Language team) hangout chats are not reliable. We should we use other communication channel during outage investigations where communication and cooperation is important.
- Some of the problems could have been avoided with more recent nodejs. Language team had already planned to expedite that and filed a request.
- In absence of monitoring, have test checklist to confirm that important things work before claiming things are working.
Actionables
- Status: Done Add monitoring for cxserver (bug T121776)
- Status: Done Switch cxserver to use Node.js 4.2 (bug T121072)