SRE/Dc-operations/Hardware Troubleshooting Runbook

From Wikitech

This workbook should be applied in the following manner:

All Hardware => Hardware Failure Log Gathering => Type of Hardware Failure

Each step assumes the failure of the step preceding it.

There are two parts to this workbook. Part 1 is for any/all SRE team members. Part 2 is specific to DC Operations sub-team.

Part 1: All SRE Teams

All Hardware

Theses steps apply to all hardware troubleshooting steps and tasks.

  • Open a phabricator task using the Hardware Troubleshooting Phabricator Template.
  • Determine the location of the server via netbox or via Infrastructure_naming_conventions. You will need to apply the project for that on-site queue to the task. Example: A server in eqiad needs #ops-eqiad project applied to its troubleshooting task.
  • Determine the warranty status of the server via netbox. You need to state
    • In warranty items will be replaced with in warranty replacements from the vendor, which take one business day to ship to us once the appropriate steps have been taken by the on-site for the vendor.
    • Out of warranty items will have their replacement reviewed, if a system is 5+ years old, it will need to be slated for replacement. Systems between 3-5 years old will have their replacement reviewed at time of hardware failure.
  • Gather the hardware failure information for the system in question.
  • After setting the status to "failed" in netbox, run the sre.dns.netbox cookbook.

Hardware Failure Log Gathering

  • Please note the following directions require root and mgmt access to the system in question. This limits the steps to the SRE team and very few outside of it.
  • The system hardware log sometimes shows disk failures, and sometimes does not. There are additional log gathering steps later in the runbook for disk logging.

Any System with OS running

  • SSH in and run: sudo ipmi-sel
    • Save the output and paste as comment on the hardware repair/troubleshooting task.

Dell Systems

  • Only use the ilom directions below if the OS is offline and you cannot simply run: sudo ipmi-sel
  • login via ssh to the mgmt/drac interface of the system.
    • You will need to have basic shell access to the cluster (via a bastion host) setup, all mgmt ssh access is restricted to bastions and cumin hosts.
    • the mgmt address is hostname.mgmt.sitename.wmnet. Example: bast1001 = bast1001.mgmt.eqiad.wmnet. So ssh root@bast1001.mgmt.eqiad.wmnet to login to it.
    • You will need to provide the mgmt password, which is stored in pwstore.
  • Once logged in, run: racadm getsel
    • Save the output and paste as comment on the hardware repair/troubleshooting task.
    • If no log matches the times, sometimes racadm getraclog has some additional info.

HP Systems

  • Only use the ilom directions below if the OS is offline and you cannot simply run: sudo ipmi-sel
  • login via https to the mgmt/drac interface of the system.
    • Logging into the https mgmt interface requires you setup a https proxy via a cluster management host. You can do this by using a proxy extension (like FoxyProxy) or just setting your browser to route all traffic via localhost:8080 and running the following command (requires shell access to the cluster management host be already functioning): ssh -D 8080 cumin1001.eqiad.wmnet.
    • Your browser will warn you that the https certificate is not configured properly (it is self signed by the vendor for the ilom interface) and you will have to click 'Advanced' and then confirm the exception and add the certificate. Once you do this, it will load the ilom login screen.
    • You will need to provide the mgmt password, which is stored in pwstore
    • Once https mgmt interface loads, click 'iLO event log', 'View CSV' and then copy/paste the CSV as a comment on the hardware troubleshooting task for the host.

iLom Command Failures

  • System will sometimes have drac command failures (failure to reboot psu, failure to connect to serial port)
  • first try a racadm racreset to soft reset the idrac.
  • If that fails, then sometimes this can be fixed by a full power removal of the host.
  • The host will need to schedule time to be taken offline for troubleshooting.
  • Pull up the host and its .mgmt interface in icinga & set system and mgmt interface to maint mode (no checks/alarms on services) for 5 business days (this will often result in 7 calendar days due to weekends). Set the comment for this change to point to the troubleshooting phabicator task.
    • Please note if a system cannot be offline that long, the person filing the task will need to coordinate with the on-site engineer for the datacenter in question. This coordination is the responsibility of the person filing the task. IE: They need to provide acceptable maintenance windows on the task.
  • Poll syslog and/or /var/log/messages for error messages & paste them into the task comments.
  • Pull the hardware failure logs from the system ilom & paste into task comments. (Directions on how to do that below.)
  • Put task in the 'Hardware Repair / Troubleshooting' column for the appropriate onsite phabricator queue and the on-site engineer for that site will triage its repair.

Power Supply Failures

Power supply replacements/repair/troubleshooting can most often be done without any resulting downtime to the system. For these hosts, you won't need to put the host in to a maintenance window.

  • Most often, this is simply the power cord becoming unseated on the server or the PDU in the rack.
  • Do not offline the host or services.
  • Basic log gathering steps from earlier should show the power supply failure (if not, are you sure its failed or you pulled the info correctly?)
  • Put task in the 'Hardware Repair / Troubleshooting' column for the appropriate onsite phabricator queue and the on-site engineer for that site will triage its repair.

HDD & SSD Failures

Hard drive and solid state drive can most often be done without any resulting downtime to the system. For these hosts, you won't need to put the host in to a maintenance window.

Please note that HDD and SSD failures will need further information gathered from the systems. How this is gathered depends on if it is software raid or hardware raid (and then what hardware raid controller is in use.)

We'll often use software raid on systems with 4 or less disks. Anything with more disks tends to be hardware raid. Full details can be gathered on host via the usual means, or can be referenced via netbox which shows what phabricator task a system was ordered on, and that task includes hardware raid controller details (if included) or software raid being specified.

You will need to determine if a system uses software or hardware raid before you can follow the rest of the runbook.

Determining type of disk raid

  • login to os, run cat /proc/mdstat

Software Raid Information Gathering

  • Login to the system OS via SSH
  • Check the syslog and/or /var/log/messages and parse for info on which disk has failed. Paste the failures as a comment on the task.
  • Check cat /proc/mdstat and paste output into task.
  • Check sudo mdadm --detail /dev/mdX (Where x is the number of the failed mdadm array) and paste onto task.
  • If the system is in warranty, please see the steps for 'Gathering Support Logs for Warranty Replacement'
  • Put task in the 'Hardware Repair / Troubleshooting' column for the appropriate onsite phabricator queue and the on-site engineer for that site will triage its repair, see https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ for details.

Hardware Raid Information Gathering

Dell Hardware Raid Information Gathering

  • Dell uses the megacli utility to interface with the hardware raid controllers.
  • All of our Dell Systems use the Perc H730P controller for internal raid arrays.
    • External disk arrays are controlled by either the Perc H830 or H840 controllers.
  • Login to the system OS.
  • Run the following to poll the raid controller log and then paste output via comment to task:
sudo megacli -AdpEventLog -GetEvents -f events.log -aALL && cat events.log
  • Run the following to poll the summary data for the virtual disks and determine which has a depreciated array:
sudo megacli -LDInfo -Lall -aALL
  • Run the following to list ALL physical disks, which includes their raid state:
sudo megacli -PDList -aALL
  • You'll want to copy/paste the disk info for the disk with a firmware state other than Firmware state: Online, Spun Up. Anything else is likely a bad disk, and should be copied via comment to the task.
  • If the system is in warranty, please see the steps for 'Gathering Support Logs for Warranty Replacement'
  • Put task in the 'Hardware Repair / Troubleshooting' column for the appropriate onsite phabricator queue and the on-site engineer for that site will triage its repair.

HP Hardware Raid Information Gathering

  • HP uses the hpssacli utlity to interface with ALL HP system's hardware raid controllers.
  • Login to the system OS.
  • Run the following to poll the raid controller's config & paste into task:
hpssacli ctrl all show config
  • Run the following to poll raid controller status & paste into task:
hpssacli ctrl all show status
  • Run the following to show disk status & paste into task:
hpssacli ctrl slot=<controller slot number from show config command, typically 1> pd all show status
    • Any failed disk should have its detailed information gathered and pasted onto task:
hpssacli ctrl slot=slot=<controller slot number from show config command, typically 1> pd all show detail.  Example: sudo hpssacli ctrl slot=1 pd all show detail

All Other Failures

These failures include memory, mainboard, or any other unspecified system hardware errors.

Any of the below failures will require multiple troubleshooting steps, resulting in downtime to the system. For these hosts, you must put the host in to a maintenance window in icinga.

  • Pull up the host and its .mgmt interface in icinga & set system and mgmt interface to maint mode (no checks/alarms on services) for 5 business days (this will often result in 7 calendar days due to weekends). Set the comment for this change to point to the troubleshooting phabicator task.
    • Please note if a system cannot be offline that long, the person filing the task will need to coordinate with the on-site engineer for the datacenter in question. This coordination is the responsibility of the person filing the task. IE: They need to provide acceptable maintenance windows on the task.
  • Poll syslog and/or /var/log/messages for error messages & paste them into the task comments.
  • Pull the hardware failure logs from the system ilom & paste into task comments. (Directions on how to do that below.)
  • Put task in the 'Hardware Repair / Troubleshooting' column for the appropriate onsite phabricator queue and the on-site engineer for that site will triage its repair.

Hardware Failure Log Gathering

  • Please note the following directions require root and mgmt access to the system in question. This limits the steps to the SRE team and very few outside of it.
  • The system hardware log sometimes shows disk failures, and sometimes does not. There are additional log gathering steps later in the runbook for disk logging.

Dell Systems

  • login via ssh to the mgmt/drac interface of the system.
    • You will need to have basic shell access to the cluster (via a bastion host) setup, all mgmt ssh access is restricted to bastions and cumin hosts.
    • the mgmt address is hostname.mgmt.sitename.wmnet. Example: bast1001 = bast1001.mgmt.eqiad.wmnet. So ssh root@bast1001.mgmt.eqiad.wmnet to login to it.
    • You will need to provide the mgmt password, which is stored in pwstore.
  • Once logged in, run: racadm getsel
    • Save the output and paste as comment on the hardware repair/troubleshooting task.

HP Systems

  • login via https to the mgmt/drac interface of the system.
    • Logging into the https mgmt interface requires you setup a https proxy via a cluster management host. You can do this by using a proxy extension (like FoxyProxy) or just setting your browser to route all traffic via localhost:8080 and running the following command (requires shell access to the cluster management host be already functioning): ssh -D 8080 cumin1001.eqiad.wmnet.
    • Your browser will warn you that the https certificate is not configured properly (it is self signed by the vendor for the ilom interface) and you will have to click 'Advanced' and then confirm the exception and add the certificate. Once you do this, it will load the ilom login screen.
    • You will need to provide the mgmt password, which is stored in [ https://office.wikimedia.org/wiki/Pwstore | pwstore ].
    • Once https mgmt interface loads, click 'iLO event log', 'View CSV' and then copy/paste the CSV as a comment on the hardware troubleshooting task for the host.

Gathering Support Logs for Warranty Replacement

  • Logging into the https mgmt interface requires you setup a https proxy via a cluster management host. You can do this by using a proxy extension (like FoxyProxy) or just setting your browser to route all traffic via localhost:8080 and running the following command (requires shell access to the cluster management host be already functioning): ssh -D 8080 cumin1001.eqiad.wmnet.
This is how Jaime does it:
  • ssh -L 8080:db2001.mgmt.codfw.wmnet:443 cumin2002.codfw.wmnet (change the target management host and the cluster management according to the right host and datacenter)
  • Point your browser to https://localhost:8080
  • If your browser complains about a wrong certificate, just press "I accept the risk and continue" (traffic to the cluster is ssh-tunneled)
  • Your browser will warn you that the https certificate is not configured properly (it is self signed by the vendor for the ilom interface) and you will have to click 'Advanced' and then confirm the exception and add the certificate. Once you do this, it will load the ilom login screen.
  • You will need to provide the mgmt password, which is stored in pwstore.

Dell Support Assist Report

  • Login to the https interface of the system, using the root mgmt login credentials.
  • What you click next depends on iLom Version
    • Older verisons: Navbar on left, click on Overview > Server > Troubleshooting > Support Assist
      • Click 'Export SupportAssist Collection' It will take a few minutes, and then allow you to click Ok to download it to your local machine to attach to the task.
    • New versions will have navbar along top, click on Maintainaince > Support Assist
      • Cancel their automated wizard, as we dont use it.
      • Click on 'Start a Collection' and check 'System Information', 'Storage Logs', 'Debug Logs' and click Ok. It will take 2-10 minutes to run, and then allow you to download it to your local machine to attach to the task. The new systems show a progress meter of the collection service running the report.

HP

  • Login to the https interface of the system, using the root mgmt login credentials.
  • Click on Information > Active Health System Log
    • Enter the date range (usually last 90 days, but longer if needed), and click download.
    • System will prompt to download 'HPE_systemserial_dategenerated.ahs' file, download and attach via comment to task.

Part 2: DC Operations Team Steps

All of the following steps are taken by the DC Ops or onsite engineer for hardware troubleshooting.

Please note that warranty replacements are handed differently depending on vendor. All of the below steps outline how to test for the hardware failure, and then will need to follow the steps for the vendor in question.

iLom Command Failures

  • System will sometimes have drac command failures (failure to reboot psu, failure to connect to serial port)
  • first try a racadm racreset to soft reset the idrac.
  • If that fails, then sometimes this can be fixed by a full power removal of the host.
  • Ensure task has noted the system is offline, confirm system offline status before proceeding.
  • Fully remove all power cables, as a full power removal can resolve most iLom issues.
  • Check all cable connections within the chassis.
  • If firmware can be updated, update across the host.
  • Open case with Vendor for hardware support replacement.

Power Supply Failures

HDD & SSD Failures

  • There is not much to test here, confirm logs and support case info are on the task to submit to the vendor.
  • If system is in warranty, move down to the steps for opening a SRE/Dc-operations/Hardware_Troubleshooting_Runbook#Warranty_Support_w/_Vendor case with the vendor.
    • You'll likely need to update all firmware versions of bios, idrac, and power supplies.
  • If system is out of warranty, it will need to be reviewed for replacement.

All Other Failures

Memory Failures

Warranty Support w/ Vendor

Dell

HP

  • HP Support must be called directly, as there is no self dispatch.

Part 3: All SRE Teams

Getting your server back

Once your system has been repaired and given back to you to be put back in service:

  • Set its Netbox status back to "active"
  • Run the sre.dns.netbox cookbook