Management Interfaces

From Wikitech
Jump to navigation Jump to search

List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.

General advices

How to execute remote IPMI commands

SSH into one of the hosts with the cluster::management Puppet role applied (cumin[12]001 at the time of writing) and run ipmitool, it will ask for a password, that is stored in pwstore.

Troubleshooting Commands

Does IPMI work locally?

SSH into the host (no mgmt) and run:

sudo ipmi-chassis --get-chassis-status

The typical error is: ipmi_cmd_get_chassis_status: internal system error or driver timeout

Does IPMI work remotely?

Execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status

The typical failing error is: Error: Unable to establish IPMI v2 / RMCP+ session

If it fails very quickly:

It might be that the IPMI password has gone out of sync with the host one (was the host rebooted recently?), see below on how to set it again.

Is remote IPMI enabled?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff

If there is no output it means that remote IPMI is enabled and this configuration is good. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

See on the section below how to do the same from the web mgmt interface, if no local host access is available (e.g. a new host).

Are IPMI permissions set correctly?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff

If there is no output it means that the user is configured correctly. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

Is there any overriding for next boot?

The BIOS Boot_Device is managed by puppet so it should be correct. This can be validated using the custom ipmi_chassis fact. A clean configuration should return NO-OVERRIDE for the ipmi_chassis.boot_flags.device fact

$ sudo facter -p ipmi_chassis.boot_flags.device

Alternativly one can execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5

A typical output for a clean settings that have no overrides is:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0000000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

In case of overrides the Boot parameter data bitmask will be different from 0000000000 and the line below will show the overridden values.

The wmf-auto-reimage script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.

In case it's needed to manually reset it to remove any override, run:

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none

And then re-check if the change has been applied.

Does IPMI work but SSH to the management console doesn't?

In this case it's possible to reset the management card, see below.

Fix Commands

Set the IPMI password

In case it is thought that the IPMI password has got out of sync with the management one, it is possible to set it again. SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 ${PASSWORD}
    
  • for HP hosts:
    set /map1/accounts1/root password=${PASSWORD}
    

Enable remote IPMI access (over LAN) without local host access

For HP ilo4 and lower, the option is under:

Administration > Access Settings > IPMI/DCMI

For HP ilo5, the option is under:

Security > Access Settings > Edit Network settings

Reset the management card

In case the management card is unresponsive IPMI or SSH or ping but at least one of following options is available, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host.

From the SSH console

To reset the management card, SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm racreset
    
  • for HP hosts:
    reset /map1
    

From local IPMI

To reset the management card via local IPMI, SSH into the host (not mgmt) and run:

bmc-device --cold-reset; echo $?

It doesn't print anything on success, hence the print of the exit code at least to check it executed correctly.

From remote IPMI

To reset the management card via remote IPMI, execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E mc reset cold

For the full list of available commands to manage the management card via remote IPMI see man ipmitool and search for the section titled mc \| bmc .

Power drain the host

For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.

Change the mgmt password

cookbook

To change the mgmt (SSH) password there is a Spicerack cookbook called sre.hosts.ipmi-password-reset. Connect to a cumin server and run it like:

@cumin1001:~$ sudo cookbook sre.hosts.ipmi-password-reset 'db2*'

In this example we are running it on all db hosts in codfw simply by host name with wildcard. You will be asked to enter current mgmt password followed twice by the new password.

ipmitool

After using the cookbook above there might be some failures to "establish an IPMI sesssion". Usually these are HP hosts and it depends on their ILO version.

The next step is to directly run ipmitool yourself. Example:

ipmitool -I lanplus -H db2063.mgmt.codfw.wmnet -U root -E user set password 1 <password> 16

Note you use the mgmt interface name, not the server name.

Replace <password> with the new password. You will be asked for the current (old) password interactively or you can use -f to read it from a file.

user slot

The "1" before the password means we are using user slot 1. Dell servers usually use slot 2 and HP servers usually use slot 1. To be sure always check if your password change was successful by connecting via ssh to the mgmt interface.

Also you can use this command to list the user slots. example:

ipmitool -I lanplus -H ms-be1039.mgmt.eqiad.wmnet -U root -E user list

The ID column in the output of this command should match your slot number.

The "16" after the password is about the (minimum) password length.

running ipmitool on multiple servers

You can make a simple text file with the list of failed servers (taken from the output of the cookbook but turning regular expressions into a list of all FQDNs) and then run ipmitool in a loop like, example:

[cumin1001:~] $ for host in $(cat failures) ; do echo $host; ipmitool -I lanplus -H $host -U root -E user set password 1 <PASSWORD> 16; sleep 1; done

Here "failures" is the text file with host names. "1" is the user slot and "16" is the password length again. Replace <PASSWORD> with the actual password. You will be asked for the current password interactively for each host or you can use -f to provide it from a file.

racadm (Dell)

There might be some cases where IPMI is not working, because IPMI over LAN is disabled in BIOS or because it needs a reset. In these cases you might get failures using ipmitool but you can still ssh to the mgmt interface using an existing/old password. If it's a Dell server you can change the password there using:

racadm set iDRAC.Users.2.Password <Password>

Where <Password> needs to be replaced and the number "2" refers to the same slot number referred to in the ipmitool section above. (If it doesn't work with slot 2, check if it's slot 1).

HP ILO

If it's a HP server with a newer version ILO, ssh to the mgmt interface and:

set /map1/accounts1/root password=<Password>


Alternatively it's possible to change the password via a web browser UI.

tunnel to web interface (on a HP)

Create an ssh tunnel to jump via a cumin host, example:

ssh -L 8000:db2056.mgmt.codfw.wmnet:443 cumin2001.codfw.wmnet

and keep the connection open.

In a browser connect to https://localhost:8000/ and create an exception for the certificate error.

Other useful ipmitool commands

Force PXE boot

ipmitool -I lanplus -H "$hostname" -U root -E chassis bootdev pxe

Show boot parameter

ipmitool -I lanplus -H "$hostname" -U root -E chassis bootparam get 5