Incident documentation/20140622-imagescaler

From Wikitech
Jump to: navigation, search

Summary

Starting on June 22th UTC morning the imagescaler fleet ran out of available workers due to increased load. Users requesting thumbnails that needed scaling might have incurred in timeouts/errors.

Timeline

  • 2014-06-22 08:58 operations start getting alarms for rendering.svc.eqiad not responding
  • 2014-06-22 09:30 initial investigation by Filippo (godog) Ariel (apergos) Alex (akosiaris), increased load on imagescalers and more objects making their way to swift. Looks like genuine load.
  • 2014-06-22 10:30 a bulk image upload to commonswiki is suspected as the root cause and evidence found and the user is asked to stop
  • 2014-06-22 10:30 a bulk image upload to commonswiki is suspected as the root cause and evidence found
  • 2014-06-22 17:40 a new alarm for rendering.svc.eqiad is raised and investigation continues, no abnormal upload counts are seen
  • 2014-06-22 21:00 Mark (mark) points out that thumbnails are being generated with unusual sizes and a user-agent is suspected (WikiLinks)
  • 2014-06-23 10:00 Faidon (paravoid) notes that WikiLinks has been offered free the previous day, above normal load on the image scalers is seen but not at critical levels

Conclusions

The imagescaler cluster can sustain normal churn of thumbnails but not spikes while generating uncommon thumbnail sizes. Swift frontend bandwidth was nearly at capacity due to the increased load.

Actionables

  • Status:    Done - Upgrade swift frontend bandwidth to 10G
  • Status:    Not done - Number of uploads should be a metric in graphite
  • Status:    Done - Increase the number of imagescaler workers once swift bandwith can withstand it