Incidents/20151203-SiliconOutage

From Wikitech

On December 3 2015 at about 14:00 UTC Silicon (the fundraising ActiveMQ server) stopped accepting connections. FR tech then received a flood of email alerts from Barium, the queue consumer. Elliott, Peter, and Jeff investigated immediately.


Sequence of Events

  • 14:00 email alerts begin
  • 14:15 Peter takes banners down
  • 14:30 Jeff finds CPUs pinned at 100% on Silicon and restarts ActiveMQ with an increased heap size
  • 15:00 queues start processing and successful test donations are made
  • 15:10 Peter puts banners back up

Impact

During the time the queue was locked up no donations could be processed, donors may have seen error messages or timeouts. Any data irregularities should sort themselves out via the audit file download and process jobs.

Root cause

JVM maxed out its heap, probably due to a tuning problem:

2015-12-03 14:30:31,496 | WARN  | Transport Connection to: tcp://10.64.40.109:59670 failed: java.io.IOException: Unexpected error occured: java.lang.OutOfMemoryError: Java heap space | org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ Transport: tcp:///10.64.40.109:59670@61613

Recommendations

The problem has not reoccurred since increasing the heap size. However, reporting on Ganglia did not show the queue to be abnormally large at the time Java ran out of memory, so if this happens again further investigation into the JVM tuning on Silicon may be warranted.