SMART
As part of bug T86552 we are now exporting SMART attributes from disks installed on all physical hosts.
Overview
We want to track SMART attributes values over time, both for operational purposes (a disk is reporting errors) and auditing purposes (e.g. disks usually are within these temperatures). To this end, from each host we export attributes known to smartmontools as Prometheus metrics. The metrics are then collected as part of the regular monitoring pipeline and available for alerting/graphing/querying. The grunt work is done by a custom python script (smart-data-dump) that periodically enumerates physical disks (including ones behind raid controllers) and parses smartctl output for those disks, writing the resulting metrics into a plaintext file for node_exporter to pick up and present via HTTP.
Metrics
Each attribute is exporter as a metric with the attribute name (as reported by smartctl as the metric name, prefixed with device_smart_. The tags attached to all metrics are at least instance and device, the former is used to uniquely identify the disk: sdX in case of directly-attached disks or DRIVERNAME,DISK_ID in case of disks behind raid controllers. In case of raid controllers, the device tag is what is passed as the argument to smartctl -d.
For example:
# smart-data-dump | grep -v ^# | grep sda device_smart_udma_crc_error_count{device="sda"} 0.0 device_smart_raw_read_error_rate{device="sda"} 205493718.0 device_smart_media_wearout_indicator{device="sda"} 365324.0 device_smart_healthy{device="sda"} 1.0 device_smart_used_rsvd_blk_cnt_tot{device="sda"} 0.0 device_smart_offline_uncorrectable{device="sda"} 0.0 device_smart_power_cycle_count{device="sda"} 24.0 device_smart_power_on_hours{device="sda"} 14065.0 device_smart_read_soft_error_rate{device="sda"} 124759545302.0 device_smart_reallocated_sector_ct{device="sda"} 0.0 device_smart_end_to_end_error{device="sda"} 0.0 device_smart_info{device="sda",firmware="G201DL2B",model="INTEL SSDSC2BX200G4R"} 1.0 device_smart_erase_fail_count_total{device="sda"} 0.0 device_smart_hardware_ecc_recovered{device="sda"} 0.0 device_smart_unused_rsvd_blk_cnt_tot{device="sda"} 5455.0 device_smart_program_fail_cnt_total{device="sda"} 0.0 device_smart_temperature_celsius{device="sda"} 27.0
Operations
The smart-data-dump script can be invoked from the command line to see the resulting metrics on stdout and debug eventual problems. Since the script is normally invoked by cron, most of the non-metrics output will be normally logged to syslog for easier auditing and to avoid cronspam.
Some attributes are purposefully not reported as metrics because are not useful/interesting, invoking with --debug will show what attributes are being ignored and which are found to be output by smartmontools, but ignored.
To run a full audit of unknown attributes it is sufficient to run smart-data-dump in debug and syslog mode, across the fleet:
cumin -s20 -b50 '*' 'if [ -x /usr/local/sbin/smart-data-dump ]; then /usr/local/sbin/smart-data-dump --debug --syslog >/dev/null; fi'
Alerts
SMART not healthy
This alert checks whether disks are healthy according to smartmontools. The alert contains the device(s) affected, these can be passed to smartctl to get the full SMART status, e.g for an alert on db1064 such as cluster=mysql device={megaraid,2,megaraid,6} instance=db1064:9100 job=node site=eqiad
root@db1068:~# smart-data-dump --debug >/dev/null ... DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --info --health -d megaraid,2 /dev/bus/0 DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --attributes -d megaraid,2 /dev/bus/0 root@db1068:~# /usr/sbin/smartctl --info --health -d megaraid,2 /dev/bus/0 smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-1-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST3600057SS Revision: 0008 User Capacity: 600,127,266,816 bytes [600 GB] Logical block size: 512 bytes Rotation Rate: 15000 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c50076ad15a3 Serial number: 6SL8SR0K0000N45004XL Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue May 15 11:11:18 2018 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK root@db1068:~# /usr/sbin/smartctl --attributes -d megaraid,2 /dev/bus/0 smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.0-1-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === Current Drive Temperature: 43 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 443295929 Blocks received from initiator = 2624070826 Blocks read from cache and sent to initiator = 784713905 Number of read and write commands whose size <= segment size = 130821891 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 7375.38 number of minutes until next internal SMART test = 45
TODO
- We are exporting the raw attributes now, consider exporting normalized values too