SMART

From Wikitech

As part of bug T86552 we are now exporting SMART attributes from disks installed on all physical hosts.

Overview

We want to track SMART attributes values over time, both for operational purposes (a disk is reporting errors) and auditing purposes (e.g. disks usually are within these temperatures). To this end, from each host we export attributes known to smartmontools as Prometheus metrics. The metrics are then collected as part of the regular monitoring pipeline and available for alerting/graphing/querying. The grunt work is done by a custom python script (smart-data-dump) that periodically enumerates physical disks (including ones behind raid controllers) and parses smartctl output for those disks, writing the resulting metrics into a plaintext file for node_exporter to pick up and present via HTTP.

Metrics

Each attribute is exporter as a metric with the attribute name (as reported by smartctl as the metric name, prefixed with device_smart_. The tags attached to all metrics are at least instance and device, the former is used to uniquely identify the disk: sdX in case of directly-attached disks or DRIVERNAME,DISK_ID in case of disks behind raid controllers. In case of raid controllers, the device tag is what is passed as the argument to smartctl -d.

For example:

# smart-data-dump | grep -v ^# | grep sda
device_smart_udma_crc_error_count{device="sda"} 0.0
device_smart_raw_read_error_rate{device="sda"} 205493718.0
device_smart_media_wearout_indicator{device="sda"} 365324.0
device_smart_healthy{device="sda"} 1.0
device_smart_used_rsvd_blk_cnt_tot{device="sda"} 0.0
device_smart_offline_uncorrectable{device="sda"} 0.0
device_smart_power_cycle_count{device="sda"} 24.0
device_smart_power_on_hours{device="sda"} 14065.0
device_smart_read_soft_error_rate{device="sda"} 124759545302.0
device_smart_reallocated_sector_ct{device="sda"} 0.0
device_smart_end_to_end_error{device="sda"} 0.0
device_smart_info{device="sda",firmware="G201DL2B",model="INTEL SSDSC2BX200G4R"} 1.0
device_smart_erase_fail_count_total{device="sda"} 0.0
device_smart_hardware_ecc_recovered{device="sda"} 0.0
device_smart_unused_rsvd_blk_cnt_tot{device="sda"} 5455.0
device_smart_program_fail_cnt_total{device="sda"} 0.0
device_smart_temperature_celsius{device="sda"} 27.0

Operations

The smart-data-dump script can be invoked from the command line to see the resulting metrics on stdout and debug eventual problems. Since the script is normally invoked by cron, most of the non-metrics output will be normally logged to syslog for easier auditing and to avoid cronspam.

Some attributes are purposefully not reported as metrics because are not useful/interesting, invoking with --debug will show what attributes are being ignored and which are found to be output by smartmontools, but ignored.

To run a full audit of unknown attributes it is sufficient to run smart-data-dump in debug and syslog mode, across the fleet:

 cumin -s20 -b50 '*' 'if [ -x /usr/local/sbin/smart-data-dump ]; then /usr/local/sbin/smart-data-dump --debug --syslog >/dev/null; fi'

Alerts

SMART not healthy

This alert checks whether disks are healthy according to smartmontools. The alert contains the device(s) affected, these can be passed to smartctl to get the full SMART status, e.g for an alert on db1064 such as cluster=mysql device={megaraid,2,megaraid,6} instance=db1064:9100 job=node site=eqiad

root@db1068:~# smart-data-dump --debug >/dev/null
...
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --info --health -d megaraid,2 /dev/bus/0
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --attributes -d megaraid,2 /dev/bus/0
root@db1068:~# /usr/sbin/smartctl --info --health -d megaraid,2 /dev/bus/0
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-1-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST3600057SS
Revision:             0008
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        15000 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50076ad15a3
Serial number:        6SL8SR0K0000N45004XL
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue May 15 11:11:18 2018 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

root@db1068:~# /usr/sbin/smartctl --attributes -d megaraid,2 /dev/bus/0
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-1-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     43 C
Drive Trip Temperature:        68 C

Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 443295929
  Blocks received from initiator = 2624070826
  Blocks read from cache and sent to initiator = 784713905
  Number of read and write commands whose size <= segment size = 130821891
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 7375.38
  number of minutes until next internal SMART test = 45

TODO

  • We are exporting the raw attributes now, consider exporting normalized values too