Incident documentation/20151229-SiliconOutage

From Wikitech
Jump to: navigation, search

On December 29 2015 at about 21:57 UTC Silicon (the fundraising ActiveMQ server) stopped accepting connections. FR tech received a flood of email alerts from Barium, the queue consumer. Elliott investigated immediately, and paged Jeff.

Sequence of Events

  • 22:00 email alerts begin
  • 22:05 (correct time?) Peter takes banners down, pages Jeff
  • 22:40 Jeff finds CPUs pinned at 100% on Silicon, ActiveMQ unresponsive and restarts ActiveMQ
  • 22:41 queue and payments processing return to normal function
  • 22:45 Peter puts banners back up

Impact

During the time the queue was locked up no donations could be processed, donors may have seen error messages or timeouts. Any data irregularities should sort themselves out via the audit file download and process jobs.

Root cause

ActiveMQ appears to have crashed.

Recommendations

Further investigation re. potential ActiveMQ bugs.