Incidents/20151229-SiliconOutage
Appearance
On December 29 2015 at about 21:57 UTC Silicon (the fundraising ActiveMQ server) stopped accepting connections. FR tech received a flood of email alerts from Barium, the queue consumer. Elliott investigated immediately, and paged Jeff.
Sequence of Events
- 22:00 email alerts begin
- 22:05 (correct time?) Peter takes banners down, pages Jeff
- 22:40 Jeff finds CPUs pinned at 100% on Silicon, ActiveMQ unresponsive and restarts ActiveMQ
- 22:41 queue and payments processing return to normal function
- 22:45 Peter puts banners back up
Impact
During the time the queue was locked up no donations could be processed, donors may have seen error messages or timeouts. Any data irregularities should sort themselves out via the audit file download and process jobs.
Root cause
ActiveMQ appears to have crashed.
Recommendations
Further investigation re. potential ActiveMQ bugs.