Monitoring procedure

From Wikitech
Jump to navigation Jump to search

Proposed monitoring procedure

Daily:

  • Check nagios for new alerts.
  • Fix simple issues such as daemons that need restarting or servers that can be rebooted remotely.
  • Note any issues which need on-site attention at datacentre tasks.
  • Pass responsibility for any more complex software issues to a competent staff member.

Weekly:

  • Capacity check. Make sure key metrics such as application CPU utilisation and disk space usage are not approaching dangerous limits.
  • Publish a report detailing the times at which Nagios was checked, the issues noted, and any people notified. Or, make this information available continuously, for review on a weekly basis.
  • Another team member should check the report and make sure that the monitoring done was of an appropriate standard.

One to two months:

  • Capacity review. Analyse capacity metrics and report your findings. Notify the team of upcoming performance bottlenecks which might require hardware purchases.
  • Report any long-term issues which have been left unfixed.