Incident documentation/20160924-ORES

From Wikitech
Jump to: navigation, search

Summary

ORES review tool (= ORES extension) couldn't score edits made in 14 hours between rolling out of wmf.20 and the fast fix made in 2016-09-23

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied.

  • (2016-09-22) SAL: 20:00 thcipriani: rolling out wmf.20 to all wikis
  • (2016-09-23) 9:44 The phab task is created
  • 9:45 The gerrit patch is made to fix it in master
  • 9:47 The patch is merged.
  • 9:48 The backport to wmf.20 is made.
  • 9:51 The backport is merged
  • SAL: 09:58 logmsgbot: hashar@tin Synchronized php-1.28.0-wmf.20/extensions/ORES/includes/Cache.php: No int typehinting (causes jobs to crash) T146461 (duration: 00m 42s)
  • SAL: 10:00 Amir1: ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki
  • SAL: 10:05 Amir1: ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=wikidatawiki (T146461) and for 'trwiki', 'plwiki', 'fawiki', 'nlwiki', 'ruwiki', 'ptwiki'

Conclusions

  • There should be an alarm to scream when jobs such as ORESFetchScoreJob is not triggered for more than an hour.
  • The lapse was easy to notice, ORES extension should have extensive CI tests.

Actionables

  • Extensive CI tests for ORES extension (Task T146560)
  • High failure rate of account creation should trigger an alarm / page people (Task T146090)