Incident documentation/20160628-wdqs

From Wikitech
Jump to: navigation, search

Switching WDQS to Scap3 deployment caused some issue and a downtime of the service.

Summary

Tuesday June 28 around 19:30 UTC we migrated WDQS deployment to Scap3. Scap3 deployment replaces the application directory by a symlink, leading to files outside of scap in that directory to disappear. This lead to WDQS failing for about 20 minutes between 19:53 UTC and 20:11 UTC.

Timeline

  • 19:53: WDQS deployed and restarted by Scap3, service failing
  • 19:58: issue tracked to missing symlink to wikidata.jnl (blazegraph data file)
  • 19:59: running puppet to restore the missing symlink -> did not work (failed dependency between due to application folder now being managed by Scap3)
  • 20:08: stopped puppet while restoring service manually
  • 20:11: Service restored by manually restoring symlink

Notes

  • Migrating WDQS to scap looked like a simple operation. Plenty of services have already done this migration without issue.
  • puppet agent --noop is a great tool especially in non standard situation (Gehel did not use it, but should have).
  • Restoring service did take longer than expected. Gehel was not familiar enough with the details of how WDQS work and took some time double checking it before making changes.

Conclusions

  • Scap3 supports canaries, they should *always* be used.

Actionables