Incidents/20140714-Etherpad
Appearance
Summary
Someone ran some sort of scanner script against this service, causing the backend daemon to crash twice. Service recovered on its own, total downtime ~10 mins.
Timeline
- 19:48 Apparent vulnerability scan from 83.206.173.212 (212-173.206-83.static-ip.oleane.fr.) begins, about 2000 404s seen in zirconium:/var/log/apache/error.log:
[Mon Jul 14 19:59:42 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webavis [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webmail [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webnews [Mon Jul 14 19:59:43 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webroot [Mon Jul 14 19:59:44 2014] [error] [client 83.206.173.212] File does not exist: /var/www/planet/en/webyep-system
- 19:50 zirconium:/var/log/etherpad-lite/etherpad-lite.log starts saying strange things, daemon restarts twice:
[2014-07-14 19:50:31.068] [ERROR] console - Error: Failed assertion: Can't have op.lines of 2 in attribution lines at Object.error (/usr/share/etherpad-lite/src/static/js/Changeset.js:39:11) [2014-07-14 19:50:31.068] [INFO] console - graceful shutdown... RESTART! start... [...] [2014-07-14 19:53:43.458] [INFO] console - graceful shutdown... RESTART! start...
- 19:53 Etherpad-lite daemon at zirconium:9001 stops accepting connections.
Apache error log: [Mon Jul 14 19:53:46 2014] [error] [client 76.254.63.162] (20014)Internal error: proxy: error reading status line from remote server 127.0.0.1:9001 [Mon Jul 14 19:53:46 2014] [error] [client 76.254.63.162] proxy: Error reading from remote server returned by /p/SULfinalisationengineering [Mon Jul 14 19:53:46 2014] [error] [client 140.221.131.106] (20014)Internal error: proxy: error reading status line from remote server 127.0.0.1:9001, referer: https://etherpad.wikimedia.org/p/Reproducibility_Workshop
- 19:54: Greg-g reports on IRC that etherpad.wikimedia.org is down, giving a "Service Temporarily Unavailable" message.
- 19:59: Scanning ends, according to logs
- 20:01: Service is available again
Conclusions
- Service recovered without administrative action
- Icinga didn't detect this outage because it only checks for a running process rather than an HTTP 200 response from port 9001. Plus since Etherpad auto-restarted, Icinga wouldn't have caught it anyway unless itchecked for a PID change.
- Ganglia shows CPU & loadavg spikes (small), no significant change in net traffic
- Etherpad is backed by db1001, no significant impact observed in Ganglia
- Unclear what HTTP response code was sent during the outage
- We are already running the newest release of Etherpad-lite, 1.4.0. No mention of security concerns from upstream https://github.com/ether/etherpad-lite/ (though they do have 255 issues open!), no relevant entries in CVE DB.
- No suggestion of security vulnerability at this time
Actionables
- Configure etherpad-lite.log rotation, currently: 786MB since 2013-11-23
The above has been done. The new package 1.4.0-2 now has logrotate support and now etherpad-lite finally no longer uses console logging but logs directly to /var/log. Relevant changes are: https://gerrit.wikimedia.org/r/#/c/151115/ and https://gerrit.wikimedia.org/r/#/c/151106/ . Due to the migration to upstart that was adder in the latter, downtime due to the software malfunctioning will be shorter (faster restarts by upstart)
Etherpad-lite stack trace
[2014-07-14 19:53:43.458] [ERROR] console - Error: Failed assertion: Can't have op.lines of 2 in attribution lines at Object.error (/usr/share/etherpad-lite/src/static/js/Changeset.js:39:11) at Object.assert (/usr/share/etherpad-lite/src/static/js/Changeset.js:53:13) at outputMutOp (/usr/share/etherpad-lite/src/static/js/Changeset.js:1167:15) at Object.exports.mutateAttributionLines (/usr/share/etherpad-lite/src/static/js/Changeset.js:1213:9) at async.series.callback.forwardsChangesets (/usr/share/etherpad-lite/src/node/handler/PadMessageHandler.js:1392:19) at async.series.results (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:486:21) at _asyncMap (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:185:13) at async.forEachSeries.iterate (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:108:13) at async.forEachSeries.iterate (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:119:25) at _asyncMap (/usr/share/etherpad-lite/src/node_modules/async/lib/async.js:187:17) [2014-07-14 19:53:43.458] [INFO] console - graceful shutdown... RESTART!
The above has been reported in https://github.com/ether/etherpad-lite/issues/2215 . No reaction after 2 days