Monitoring/Long running screens

From Wikitech
Jump to navigation Jump to search

The Icinga check for "long running screen or tmux" processes has been added in https://phabricator.wikimedia.org/T165348.

reasoning was: "We should flag/alert long-running screen sessions, these are usually a sign of work which was forgotten or should rather be puppetised or launched by cron"


There are different options to solve an alert like this. Either determine it has been indeed been forgotten work and can be closed or that this is a kind of host where long running screens are expected and the host should be white-listed to exclude it from the monitoring check.


The user name and PID running the process in question is part of the script output. So if that matches their IRC nick they should already get highlighted.

What you can do:

  • if a WARNING: Ignore, unless it is your own process- warnings are there so the owner has a window to realize about it and close it

if already a CRITICAL:

  • ping the user and ask them if they still need the screen/tmux and ask them to close it or to ack it on icinga (Notice: There would be very few reasons to just ack it- so a good reason has to be given (e.g. ongoing outage). Prefer the permanent whitelist below.

or

  • go to the host in question yourself and check what is in the screen
    • trick to get into the screen of another user: sudo -s; su $username; script /dev/null ; screen -x (why? [1])
    • if things look inactive / forgotten.. close the screen, otherwise go back to asking the user

or

  • white-list the host in question because long-running screens are expected
    • make a puppet change and set "monitor_screens: false" in Hiera for a role or host