Incident documentation/20121211-TranslationNotifications

From Wikitech
Jump to: navigation, search

Report written by Niklas Laxström, 2012-12-07.


Disclaimer: This is my take on the event, which is incomplete because I am not aware of all the private channel conversations. The reports is also likely to have multiple errors in it, but those are not intentional. The point is not to blame anyone, but to explain how it happened, how it was dealt with and how we can improve to prevent similar issues from happening again.

TL;DR: Changes in MW core exposed a bug in TranslationNotifications. The same bug is also present in other extensions. Many people participated in preventing further damage and fixing the issue. More work remains to be done.

Timeline

  • 2012-04-06: TranslationNotifications job committed.[1]
  • 2012-05-08: TN deployed.
  • 2012-10-16: Introduction of new job system.[2]
  • 2012-11-14: metawiki switched to 1.21wmf4.
  • 2012-11-14: Retrying of failed jobs merged.
  • 2012-11-27 18:05: TN sent without known problems.
  • 2012-11-28 ?/22:33/22:41: metawiki to 1.21wmf5, then wmf4 and wmf5 again.
  • 2012-11-28 23:16: Aaron enables job retrying of jobs older than 1h.
  • 2012-12-01 06:35: A TN is sent by Tbayer to about 500 recipients.
  • 2012-12-04 19:19: Triple notifications for it in the last hour are reported, Tbayer answers but doesn't notify anyone.[3]
  • 2012-12-04 23:56: Poongothai video TN submitted by Seddon to about 1000 recipients.[4]
  • 2012-12-05 07:30: Poongothai video TN delivered second time.[5]
  • 2012-12-05 07:35: First report on Seddon's talk page.[6]
  • 2012-12-05 09:30: Report by Seddon on #wikimedia-tech.
  • 2012-12-05 09:55: E-mail to translators-l by Seddon.
  • 2012-12-05 10:10: Report via bugzilla.[7]
  • 2012-12-05 10:30/10:40: Bug 42715 linked from bug 42614, RobLa et al. added to cc etc.
  • 2012-12-06 00:14: Poongothai video TN delivered third time.[5]
  • 2012-12-06 09:14: TNB blocked on nl.wiki.[8]
  • 2012-12-06 09:53: Siebrand solicits action on #wikimedia operations.
  • 2012-12-06 10:22: Seddon blocks self on Meta [9] and sends another update to translators-l.
  • 2012-12-06 10:53: TNB account locked, warning added to special page.[10]
  • 2012-12-06 noon: Hashar debugs, probably removes broken jobs [11] and adds a debugging fix.[12]
  • 2012-12-06 15:40: Bug 42715 assigned to Aaron
  • 2012-12-07 04:40/05:25/06:27: Workaround by Aaron submitted/deployed/merged. [13]

[1] https://gerrit.wikimedia.org/r/#/c/4394

[2] https://gerrit.wikimedia.org/r/#/c/13194/

[3] https://meta.wikimedia.org/?diff=4750437&oldid=4748972

[4] https://meta.wikimedia.org/w/index.php?title=Special:Log&dir=prev&offset=20121201063500&limit=1&type=notifytranslators

[5] https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&offset=20121207073617&limit=3&action=history

[6] https://meta.wikimedia.org/w/index.php?title=User_talk:Jseddon_%28WMF%29&diff=4755087&oldid=4742714 -> about the issue

[7] https://bugzilla.wikimedia.org/42715

[8] https://nl.wikipedia.org/w/index.php?title=Overleg_gebruiker:Translation_Notification_Bot&oldid=33905273

[9] https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=block&user=Jseddon+%28WMF%29

[10] https://meta.wikimedia.org/?diff=4766049&oldid=4765994 https://meta.wikimedia.org/?diff=prev&oldid=4765934

[11] "2213 translationNotificationJobs pending" http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20121206.txt

[12] https://gerrit.wikimedia.org/r/#/c/37211/

[13] http://wikitech.wikimedia.org/view/Server_admin_log

Analysis

We can assume that breakage has happened between Nov 27 and Dec 1, since no problems were reported for the notification sent on Nov 27. The timeline seems to confirm this, on Nov 28 Aaron enabled job retrying.

There is some uncertainty on what code JobQueue runners are using, since they are long-living processes. Tim Starling had to kill some LST jobs on the 4th, and then it was talked that they had been using same code for three days. https://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=54182&oldid=54181&diffonly=yes

Assuming that the root of the problem was introduced already in 1.21wmf3 we should have seen an increasing amount of failed jobs kept around in the JobQueue before the retrying started happening. I don't know whether there are graphs for this. If there are, is anyone looking at them, and if not, are those going to be added now?

Also, why did we not see those very old jobs retried? Was there some other change which started keeping them around, or perhaps they were still in there, but most recent jobs were retried first?


Immediately after the second broken TN, the problem is reported to Seddon and on bugzilla. Then there is only some chatter until hashar starts investigating in the morning of the next day, probably because Siebrand solicited help. In the meanwhile Seddon takes measures to prevent further damage. Hashar finds some clues of the reason and removes broken jobs from the jobqueue. On the same morning USA time Aaron gets assigned to that bug, and by evening his time a workaround is deployed. He closes the bug pointing out how to fix TN, but does not mention the deployment of the workaround.

In total it took about 3 days from the very first report to "solve" the issue, though cleanup still remains. It took about 24 hours for the bug report to get any action, and that was only because Siebrand got hashar to investigate it immediately. Platform Engineering was on CC immediately after the bug was reported, but did not take any visible action until 30 hours later (assigned to Aaron) and the fix came about 12h after that.

The problem was put under control in 24 hours after hashar started working on it. We also lost 14 hours in the beginning because of the first reports by users were not reported to bugzilla nor to the i18n team.


The problem was not caught in code review, because nobody was aware that this even could be an issue: due to incorrect and lacking documentation, the wrong code got in in the first place. It is a mystery whether the committed workaround is there to stay (meaning that the interface should be changed since it is now redundant), or whether we are supposed to fix all the jobs (there are many similar ones, see below). I'm surprised that nobody hasn't yet done a scan of all extensions to see how many have this problem.

This issue(*) was only exposed now because the changes in the JobQueue started depending on it, and only when a certain configuration was in use. Other than the commit notices, I don't remember any mentions of the JobQueue refactoring work, so I am not surprised if this came as surprise to most people, who were not expecting to see any problems here.

(*) It is both a bug in TN and a regression in JobQueue. Breaking existing code is a regression even if the existing code is wrong.

Job retrying is not enabled by default, so most likely we would not have detected this on our local machines, even if we had tests. There is a safety measure that every job is retried at most 3 times (meaning delivering 4 copies of the message in the worst case).

Recommendations for immediate actions

In any case, I recommend fixing all the broken jobs we can find and beg for proper documentation of JobQueue system, since the latest developments have increased the technical debt in the organization by introducing new undocumented systems that only one or two developers understand properly. Since this code is used relatively often by extension developers, code example and good documentation are needed for that audience too.

The only other documentation that I was able to found quickly, other than existing code, is https://www.mediawiki.org/wiki/Job_queue where you find link to https://www.mediawiki.org/wiki/Manual:Job_queue/For_developers which has a *broken* example. This should be fixed ASAP.

I don't see any easy way to prevent problems like these happening in the future, since we have so much to catch up on good test coverage, dissemination of skills and comprehensive developer documentation.

One particular issue is the detection and escalation of potential problems. Incoming (bug) reports are not always triaged fast and deep enough to catch problems before they go live on production. Once they are considered critical, they are fixed quickly, but that doesn't seem to lead to any changes in the process itself.

Snapshot of jobs (not all extensions included)

Ok 18 – Broken 5

twn:/www/dev.translatewiki.net/w (master)$ ack -G '.*Job.php' '__construct'
includes/job/Job.php
137:    public function __construct( $command, $title, $params =
false, $id = 0 ) {
 
includes/job/jobs/DuplicateJob.php
37:     function __construct( $title, $params, $id = 0 ) {
38:             parent::__construct( 'duplicate', $title, $params, $id );
 
includes/job/jobs/EmaillingJob.php
31:     function __construct( $title, $params, $id = 0 ) {
32:             parent::__construct( 'sendMail', Title::newMainPage(),
$params, $id );
 
includes/job/jobs/EnotifNotifyJob.php
31:     function __construct( $title, $params, $id = 0 ) {
32:             parent::__construct( 'enotifNotify', $title, $params, $id );
 
includes/job/jobs/NullJob.php
37:     function __construct( $title, $params, $id = 0 ) {
38:             parent::__construct( 'null', $title, $params, $id );
 
includes/job/jobs/RefreshLinksJob.php
30:     function __construct( $title, $params = '', $id = 0 ) {
31:             parent::__construct( 'refreshLinks', $title, $params, $id );
115:    function __construct( $title, $params, $id = 0 ) {
116:            parent::__construct( 'refreshLinks2', $title, $params, $id );
 
includes/job/jobs/UploadFromUrlJob.php
46:     public function __construct( $title, $params, $id = 0 ) {
47:             parent::__construct( 'uploadFromUrl', $title, $params, $id );
 
includes/job/jobs/HTMLCacheUpdateJob.php
58:     function __construct( $title, $params, $id = 0 ) {
61:             parent::__construct( 'htmlCacheUpdate', $title, $params, $id );
 
includes/job/jobs/DoubleRedirectJob.php
76:     function __construct( $title, $params = false, $id = 0 ) {
77:             parent::__construct( 'fixDoubleRedirect', $title,
$params, $id );
 
extensions/SemanticForms/includes/SF_CreatePageJob.php
17:     function __construct( $title, $params = '', $id = 0 ) {
18:             parent::__construct( 'createPage', $title, $params, $id );
 
extensions/SemanticMediaWiki/includes/jobs/SMW_RefreshJob.php
35:     function __construct( $title, $params = array( 'spos' => 1,
'prog' => 0, 'rc' => 1 ) ) {
36:             parent::__construct( 'SMWRefreshJob', $title, $params );
 
extensions/SemanticMediaWiki/includes/jobs/SMW_UpdateJob.php
26:     function __construct( Title $title ) {
27:             parent::__construct( 'SMWUpdateJob', $title );
 
extensions/Translate/tag/RenderJob.php
31:     function __construct( $title, $params = array(), $id = 0 ) {
32:             parent::__construct( __CLASS__, $title, $params, $id );
 
extensions/Translate/tag/DeleteJob.php
36:     function __construct( $title, $params = array(), $id = 0 ) {
37:             parent::__construct( __CLASS__, $title, $params, $id );
 
extensions/Translate/tag/MoveJob.php
38:     function __construct( $title, $params = array(), $id = 0 ) {
39:             parent::__construct( __CLASS__, $title, $params, $id );
 
extensions/Translate/utils/MessageUpdateJob.php
26:     function __construct( $title, $params = array(), $id = 0 ) {
27:             parent::__construct( __CLASS__, $title, $params, $id );
 
extensions/Translate/utils/MessageGroupStatesUpdaterJob.php
37:     public function __construct( $title, $params = array(), $id = 0 ) {
38:             parent::__construct( __CLASS__, $title, $params, $id );
 
extensions/Translate/utils/MessageIndexRebuildJob.php
26:     function __construct( $title, $params = array(), $id = 0 ) {
27:             parent::__construct( __CLASS__, $title, $params, $id );
 
extensions/TranslationNotifications/TranslationNotificationJob.php
8:      public function __construct( $title, $params ) {
9:              parent::__construct( 'translationNotificationJob',
$title, $params );
 
extensions/LiquidThreads/classes/SynchroniseThreadArticleDataJob.php
3:      public function __construct( $title, $params ) {
4:              parent::__construct( 'synchroniseThreadArticleData',
$title, $params );
 
extensions/ReplaceText/ReplaceTextJob.php
11:     function __construct( $title, $params = '', $id = 0 ) {
12:             parent::__construct( 'replaceText', $title, $params, $id );
 
extensions/Renameuser/RenameUserJob.php
13:     public function __construct( $title, $params ) {
14:             parent::__construct( 'renameUser', $title, $params );

How to fix

+       public function __construct( $title, $params ) {
+               parent::__construct( 'translationNotificationJob',
$title, $params );
+       }

This should have been:

+       public function __construct( $title, $params, $id = 0 ) {
+               parent::__construct( 'translationNotificationJob',
$title, $params, $id );
+       }