Incidents/2019-05-21 RESTBase
(Redirected from Incident documentation/20190521-RESTBase)
Summary
The deployment of faulty code to 2 RESTBase nodes caused empty responses to be returned to clients for some requests.
Impact
Mostly our monitoring infrastructure, very low to no impact for users.
Detection
- (15:36:34) icinga-wm: PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get s
- (15:36:34) icinga-wm: age returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200): /{domain
- (15:36:34) icinga-wm: ay/{type}/{month}/{day} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
etc...
Timeline
All times UTC.
- 15:30 mobrovac@deploy1001: Started deploy [restbase/deploy@022cb98]: Temporarily copy from old tables to new ones if the data is not found - T215956
- 15:36 errors start appearing
- 15:39 mobrovac@deploy1001: Started deploy [restbase/deploy@cf00120]: Revert Temporarily copy from old tables to new ones if the data is not found
- Pchelolo and mobrovac investigate, conclude that the issue are faulty writes to Cassandra done by RESTBase that cannot be decoded back
- 16:10 mobrovac starts truncating the tables with faulty data
- 16:26 some checks recover, but not all
- 16:35 mobrovac re-issues the truncation commands, everything is back to normal
Conclusions
What weaknesses did we learn about and how can we address them?
The following sub-sections should have a couple brief bullet points each.
What went well?
- Automated monitoring alerted us of problems after the first two nodes (one in eqiad, one in codfw) were running the new code
- Because of the above, only a very small portion of pages that needed to be re-rendered were corrupt
What went poorly?
- The same deployment on beta did not present any problems
Where did we get lucky?
- Automated monitoring checks detected it straightaway because some of the pages used for testing were corrupted in that brief time span
Actionables
Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.
NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.
- mobrovac did not receive any email alerts from icinga about any of this