Incident documentation/20150409-EventLogging

From Wikitech
Jump to: navigation, search

Summary

EventLogging events were not written to the database during some time spans, starting on March 22nd, 2015. Related Phabricator task: https://phabricator.wikimedia.org/T96082 The data loss affected all schema tables. The actual intervals are (UTC):

  • 2015-03-22 18:50:00 to 2015-03-22 20:20:00*
  • 2015-03-24 00:30:00 to 2015-03-24 01:50:00*
  • 2015-03-24 20:50:00 to 2015-03-24 21:30:00*
  • 2015-03-29 17:40:00 to 2015-03-29 19:00:00*
  • 2015-04-01 21:10:00 to 2015-04-01 21:20:00*
  • 2015-04-06 09:38:03 to 2015-04-06 11:31:23
  • 2015-04-06 14:08:10 to 2015-04-06 14:54:57
  • 2015-04-06 15:36:38 to 2015-04-06 15:52:16
  • 2015-04-06 23:00:38 to 2015-04-07 01:23:56
  • 2015-04-08 18:45:18 to 2015-04-08 21:20:36
  • 2015-04-09 17:12:48 to 2015-04-09 18:49:48
  • 2015-04-11 03:30:25 to 2015-04-11 05:20:30
  • 2015-04-11 13:59:42 to 2015-04-11 15:32:42
  • 2015-04-11 18:53:40 to 2015-04-11 20:11:02
  • 2015-04-12 14:25:16 to 2015-04-12 15:52:38
  • 2015-04-13 11:31:46 to 2015-04-13 12:41:49
  • 2015-04-13 16:23:46 to 2015-04-13 17:49:34
  • 2015-04-14 14:15:46 to 2015-04-14 16:16:06
  • 2015-04-14 19:22:31 to 2015-04-14 21:27:15
  • 2015-04-15 16:12:54 to 2015-04-15 16:43:47
  • 2015-04-19 10:07:21 to 2015-04-19 12:36:40
  • 2015-04-20 07:00:48 to 2015-04-19 10:00:39
  • 2015-04-20 17:40:00 to 2015-04-20 20:19:00
  • 2015-04-21 14:30:00 to 2015-04-21 16:59:00
  • 2015-04-22 01:50:00 to 2015-04-22 03:49:00
  • 2015-04-22 12:30:00 to 2015-04-22 13:49:00
  • 2015-04-22 15:30:00 to 2015-04-22 16:39:00
  • 2015-04-22 18:00:00 to 2015-04-22 18:59:00
  • 2015-04-22 20:20:00 to 2015-04-22 21:19:00


(*) Approximate

The problem continues to persist as of April 20, 2015. Backfilling will be carried out during this week (April 20 - 27, 2015), at least for the intervals not marked with (*).

Timeline

Thursday Apr 9, 2015

The problem is observed, studied and presented to the Analytics team.

Friday Apr 10 - Friday Apr 17, 2015

The issue is researched and scoped. Root cause is found.

Monday Apr 20, 2015

Backfilling starts.

Conclusions

The rate in which events are being inserted to the database is slower that the rate in which events come to the EventLogging server. This overhead accumulates events inside a buffer in the EL consumer, which eventually gets too big in memory, and gets killed by the system. All the events that were in the buffer at the time of the process kill are not inserted in the database.

Some weeks ago, this problem happened in the DB layer. Changes were made that optimized the event insertion with success. However, in the last weeks EL throughput has increased significantly, which made the problem surface again, this time in the EL consumer layer.

Actionables

  • Status:    Done Backfilling the data.
  • Status:    In progress Implementing a quick solution for the problem to cease.
  • Status:    In progress Implementing a true solution to the EL scaling problem.