Incident documentation/20140714-Etherpad

From Wikitech
Jump to: navigation, search

Summary

Someone ran some sort of scanner script against this service, causing the backend daemon to crash twice. Service recovered on its own, total downtime ~10 mins.

Timeline

  • 19:48 Apparent vulnerability scan from 83.206.173.212 (212-173.206-83.static-ip.oleane.fr.) begins, about 2000 404s seen in zirconium:/var/log/apache/error.log:
 [Mon Jul 14 19:59:42 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webavis
 [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webmail
 [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webnews
 [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webroot
 [Mon Jul 14 19:59:44 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webyep-system
  • 19:50 zirconium:/var/log/etherpad-lite/etherpad-lite.log starts saying strange things, daemon restarts twice:
 [2014-07-14 19:50:31.068] [ERROR] console - Error: Failed assertion: Can't have op.lines of 2 in attribution lines at Object.error
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:39:11)
 [2014-07-14 19:50:31.068] [INFO] console - graceful shutdown...
 RESTART!
 start...
 [...]
 [2014-07-14 19:53:43.458] [INFO] console - graceful shutdown...
 RESTART!
 start...
  • 19:53 Etherpad-lite daemon at zirconium:9001 stops accepting connections.
 Apache error log:
 [Mon Jul 14 19:53:46 2014] [error] [client 76.254.63.162] (20014)Internal
 error: proxy: error reading status line from remote server 127.0.0.1:9001
 [Mon Jul 14 19:53:46 2014] [error] [client 76.254.63.162] proxy: Error
 reading from remote server returned by /p/SULfinalisationengineering
 [Mon Jul 14 19:53:46 2014] [error] [client 140.221.131.106] (20014)Internal
 error: proxy: error reading status line from remote server 127.0.0.1:9001,
 referer: https://etherpad.wikimedia.org/p/Reproducibility_Workshop
  • 19:54: Greg-g reports on IRC that etherpad.wikimedia.org is down, giving a "Service Temporarily Unavailable" message.
  • 19:59: Scanning ends, according to logs
  • 20:01: Service is available again

Conclusions

  • Service recovered without administrative action
  • Icinga didn't detect this outage because it only checks for a running process rather than an HTTP 200 response from port 9001. Plus since Etherpad auto-restarted, Icinga wouldn't have caught it anyway unless itchecked for a PID change.
  • Ganglia shows CPU & loadavg spikes (small), no significant change in net traffic
  • Etherpad is backed by db1001, no significant impact observed in Ganglia
  • Unclear what HTTP response code was sent during the outage
  • We are already running the newest release of Etherpad-lite, 1.4.0. No mention of security concerns from upstream https://github.com/ether/etherpad-lite/ (though they do have 255 issues open!), no relevant entries in CVE DB.
  • No suggestion of security vulnerability at this time

Actionables

  • Configure etherpad-lite.log rotation, currently: 786MB since 2013-11-23

The above has been done. The new package 1.4.0-2 now has logrotate support and now etherpad-lite finally no longer uses console logging but logs directly to /var/log. Relevant changes are: https://gerrit.wikimedia.org/r/#/c/151115/ and https://gerrit.wikimedia.org/r/#/c/151106/ . Due to the migration to upstart that was adder in the latter, downtime due to the software malfunctioning will be shorter (faster restarts by upstart)

Etherpad-lite stack trace

 [2014-07-14 19:53:43.458] [ERROR] console - Error: Failed assertion: Can't
 have op.lines of 2 in attribution lines
     at Object.error
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:39:11)
     at Object.assert
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:53:13)
     at outputMutOp
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:1167:15)
     at Object.exports.mutateAttributionLines
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:1213:9)
     at async.series.callback.forwardsChangesets
 (/usr/share/etherpad-lite/src/node/handler/PadMessageHandler.js:1392:19)
     at async.series.results
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:486:21)
     at _asyncMap
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:185:13)
     at async.forEachSeries.iterate
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:108:13)
     at async.forEachSeries.iterate
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:119:25)
     at _asyncMap
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:187:17)
 [2014-07-14 19:53:43.458] [INFO] console - graceful shutdown...
 RESTART!

The above has been reported in https://github.com/ether/etherpad-lite/issues/2215 . No reaction after 2 days