Incident documentation/20140714-Etherpad

From Wikitech
Jump to navigation Jump to search

Summary

Someone ran some sort of scanner script against this service, causing the backend daemon to crash twice. Service recovered on its own, total downtime ~10 mins.

Timeline

  • 19:48 Apparent vulnerability scan from 83.206.173.212 (212-173.206-83.static-ip.oleane.fr.) begins, about 2000 404s seen in zirconium:/var/log/apache/error.log:
 [Mon Jul 14 19:59:42 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webavis
 [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webmail
 [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webnews
 [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webroot
 [Mon Jul 14 19:59:44 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webyep-system
  • 19:50 zirconium:/var/log/etherpad-lite/etherpad-lite.log starts saying strange things, daemon restarts twice:
 [2014-07-14 19:50:31.068] [ERROR] console - Error: Failed assertion: Can't have op.lines of 2 in attribution lines at Object.error
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:39:11)
 [2014-07-14 19:50:31.068] [INFO] console - graceful shutdown...
 RESTART!
 start...
 [...]
 [2014-07-14 19:53:43.458] [INFO] console - graceful shutdown...
 RESTART!
 start...
  • 19:53 Etherpad-lite daemon at zirconium:9001 stops accepting connections.
 Apache error log:
 [Mon Jul 14 19:53:46 2014] [error] [client 76.254.63.162] (20014)Internal
 error: proxy: error reading status line from remote server 127.0.0.1:9001
 [Mon Jul 14 19:53:46 2014] [error] [client 76.254.63.162] proxy: Error
 reading from remote server returned by /p/SULfinalisationengineering
 [Mon Jul 14 19:53:46 2014] [error] [client 140.221.131.106] (20014)Internal
 error: proxy: error reading status line from remote server 127.0.0.1:9001,
 referer: https://etherpad.wikimedia.org/p/Reproducibility_Workshop
  • 19:54: Greg-g reports on IRC that etherpad.wikimedia.org is down, giving a "Service Temporarily Unavailable" message.
  • 19:59: Scanning ends, according to logs
  • 20:01: Service is available again

Conclusions

  • Service recovered without administrative action
  • Icinga didn't detect this outage because it only checks for a running process rather than an HTTP 200 response from port 9001. Plus since Etherpad auto-restarted, Icinga wouldn't have caught it anyway unless itchecked for a PID change.
  • Ganglia shows CPU & loadavg spikes (small), no significant change in net traffic
  • Etherpad is backed by db1001, no significant impact observed in Ganglia
  • Unclear what HTTP response code was sent during the outage
  • We are already running the newest release of Etherpad-lite, 1.4.0. No mention of security concerns from upstream https://github.com/ether/etherpad-lite/ (though they do have 255 issues open!), no relevant entries in CVE DB.
  • No suggestion of security vulnerability at this time

Actionables

  • Configure etherpad-lite.log rotation, currently: 786MB since 2013-11-23

The above has been done. The new package 1.4.0-2 now has logrotate support and now etherpad-lite finally no longer uses console logging but logs directly to /var/log. Relevant changes are: https://gerrit.wikimedia.org/r/#/c/151115/ and https://gerrit.wikimedia.org/r/#/c/151106/ . Due to the migration to upstart that was adder in the latter, downtime due to the software malfunctioning will be shorter (faster restarts by upstart)

Etherpad-lite stack trace

 [2014-07-14 19:53:43.458] [ERROR] console - Error: Failed assertion: Can't
 have op.lines of 2 in attribution lines
     at Object.error
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:39:11)
     at Object.assert
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:53:13)
     at outputMutOp
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:1167:15)
     at Object.exports.mutateAttributionLines
 (/usr/share/etherpad-lite/src/static/js/Changeset.js:1213:9)
     at async.series.callback.forwardsChangesets
 (/usr/share/etherpad-lite/src/node/handler/PadMessageHandler.js:1392:19)
     at async.series.results
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:486:21)
     at _asyncMap
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:185:13)
     at async.forEachSeries.iterate
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:108:13)
     at async.forEachSeries.iterate
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:119:25)
     at _asyncMap
 (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:187:17)
 [2014-07-14 19:53:43.458] [INFO] console - graceful shutdown...
 RESTART!

The above has been reported in https://github.com/ether/etherpad-lite/issues/2215 . No reaction after 2 days