Management Interfaces

From Wikitech
(Redirected from Systems management)

List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.

General advices

How to execute remote IPMI commands

SSH into one of the hosts with the cluster::management Puppet role applied (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet) and run ipmitool, it will ask for management password, that is stored in pwstore.

Looking for racadm/hpiLO commands?

Then you're most likely at the wrong place, check out SRE/Dc-operations/Platform-specific documentation

Troubleshooting Commands

Does IPMI work locally?

SSH into the host (not mgmt) and run:

sudo ipmi-chassis --get-chassis-status

The typical error is: ipmi_cmd_get_chassis_status: internal system error or driver timeout

Does IPMI work remotely?

Execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status

The typical failing error is: Error: Unable to establish IPMI v2 / RMCP+ session

If it fails very quickly:

It might be that the IPMI password has gone out of sync with the host one (was the host rebooted recently?), see below on how to set it again.

Is remote IPMI enabled?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff

If there is no output it means that remote IPMI is enabled and this configuration is good. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

On some HP systems, this check can suggest the configuration is good even when in fact it isn't. To confirm this, visit the web mgmt interface, select "Security" from the left-hand menu, and check in the "Network" box that "IPMI/DCMI over LAN" is enabled; if not, click the Pencil to go to the Edit interface, check the appropriate box, and scroll to the bottom to click "OK" to apply the change.

See on the section below how to do the same from the web mgmt interface, if no local host access is available (e.g. a new host).

Are IPMI permissions set correctly?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff

If there is no output it means that the user is configured correctly. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

Did you do a reset but still getting IPMI connection failed (when using the reimage cookbook)?

Try logging in on mgmt, with the regular mgmt password, and then re-set the same password with

"racadm config" entered is not supported on iDRAC "4.40.00.00"

racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 <password>

and then try again. We have had at least one case where this fixed running the reimage cookbook getting a remote IPMI connection failure.

If that fails set the IDRAC to factory and run the provison cookbook.

Is there any overriding for next boot?

The BIOS Boot_Device is managed by puppet so it should be correct. This can be validated using the custom ipmi_chassis fact. A clean configuration should return NO-OVERRIDE for the ipmi_chassis.boot_flags.device fact

$ sudo facter -p ipmi_chassis.boot_flags.device

Alternativly one can execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5

A typical output for a clean settings that have no overrides is:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0000000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

In case of overrides the Boot parameter data bitmask will be different from 0000000000 and the line below will show the overridden values.

The sre.hosts.reimage cookbook script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.

In case it's needed to manually reset it to remove any override, run:

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none

And then re-check if the change has been applied.

Does IPMI work but SSH to the management console doesn't?

In this case it's possible to reset the management card, see below.

If you are receiving an error message like the following

Unable to negotiate with UNKNOWN port 65535: no matching key exchange method found. Their offer:
diffie-hellman-group14-sha1,diffie-hellman-gr oup1-sha1

or 

debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(2048<8192<8192) sent
debug1: got SSH2_MSG_KEX_DH_GEX_GROUP
debug2: bits set: 3952/8192
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
Received disconnect from UNKNOWN port 65535:11: Logged out.
Disconnected from UNKNOWN port 65535

it means that the SSH release running on the management console is too old to interoperate with current SSH client defaults. A firmware update will fix this by providing a more recent OpenSSH. Alternatively you can force older ciphers/key exchanges in your SSH client using -oKexAlgorithms=diffie-hellman-group14-sha1 -oCiphers=aes256-cbc.

Is the management password wrong?

It could happen that during provisioning the wrong management password was set into the management card. To verify if it's the correct one execute the following.

  • First find which user is enabled, usually it's User2 for DELL and User1 for HP but sometimes might be a different one.
    sudo ipmi-config -g core -S User2 -e "User2:Enable_User=Yes" --diff
    
    this should return nothing and exit with 0. If there is a diff it means that the user is not enabled, try with User1 or other higher integers.
  • Then to check if the password is the correct one do the following:
    # When Prompted insert the management password to set
    $ sudo -i
    # read -s MGMT_PASSWORD
    # # Adapt the UserN to the actual one
    # ipmi-config -g core -S User2 -e "User2:Password=${MGMT_PASSWORD}" --diff
    User2:Password - input=`test':actual=`<something else>'
    #
    
    If the output is empty and the script exit with 0 it means that the password is the correct one, otherwise a diff like the one shown above will appear. In that case see the Management Interfaces#Local IPMI section below.

Fix Commands

Set the IPMI password

In case it is thought that the IPMI password has got out of sync with the management one, it is possible to set it again. SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 ${PASSWORD}
    
  • for HP hosts:
    set /map1/accounts1/root password=${PASSWORD}
    

Enable remote IPMI access (over LAN) without local host access

Dell

From the racadm console check it with:

racadm>>get iDRAC.IPMILan.Enable
[Key=iDRAC.Embedded.1#IPMILan.1]
Enable=Enabled

Modify it with:

racadm>>set iDRAC.IPMILan.Enable 1
[Key=iDRAC.Embedded.1#IPMILan.1]
Object value modified successfully

HP

For HP ilo4 and lower, the option is under:

Administration > Access Settings > IPMI/DCMI

For HP ilo5, the option is under:

Security > Access Settings > Edit Network settings

Reset the management card

Wait a couple of minutes at least after the execution of the commands below before proceeding with any test to let the card OS restart. To verify that it restarted try to ssh into the management interface.

In case the management card is unresponsive IPMI or SSH or ping but at least one of following options is available, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host.

From the SSH console

To reset the management card, SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm racreset
    
  • for HP hosts:
    reset /map1
    

From local IPMI

To reset the management card via local IPMI, SSH into the host (not mgmt) and run:

bmc-device --cold-reset; echo $?

It doesn't print anything on success, hence the print of the exit code at least to check it executed correctly.

From remote IPMI

To reset the management card via remote IPMI, execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E mc reset cold

For the full list of available commands to manage the management card via remote IPMI see man ipmitool and search for the section titled mc \| bmc .

Power drain the host

For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.

Change the mgmt password

cookbook

To change the mgmt (SSH) password there is a Spicerack cookbook called sre.hosts.ipmi-password-reset. Connect to a cumin server and run it like:

@cumin1001:~$ sudo cookbook sre.hosts.ipmi-password-reset 'db2*'

In this example we are running it on all db hosts in codfw simply by host name with wildcard. You will be asked to enter current mgmt password followed twice by the new password.

ipmitool

After using the cookbook above there might be some failures to "establish an IPMI sesssion". Usually these are HP hosts and it depends on their ILO version.

The next step is to directly run ipmitool yourself. Example:

ipmitool -I lanplus -H db2063.mgmt.codfw.wmnet -U root -E user set password 1 <password> 16

Note you use the mgmt interface name, not the server name.

Replace <password> with the new password. You will be asked for the current (old) password interactively or you can use -f to read it from a file.

user slot

The "1" before the password means we are using user slot 1. Dell servers usually use slot 2 and HP servers usually use slot 1. To be sure always check if your password change was successful by connecting via ssh to the mgmt interface.

Also you can use this command to list the user slots. example:

ipmitool -I lanplus -H ms-be1039.mgmt.eqiad.wmnet -U root -E user list

The ID column in the output of this command should match your slot number.

The "16" after the password is about the (minimum) password length.

running ipmitool on multiple servers

You can make a simple text file with the list of failed servers (taken from the output of the cookbook but turning regular expressions into a list of all FQDNs) and then run ipmitool in a loop like, example:

[cumin1001:~] $ for host in $(cat failures) ; do echo $host; ipmitool -I lanplus -H $host -U root -E user set password 1 <PASSWORD> 16; sleep 1; done

Here "failures" is the text file with host names. "1" is the user slot and "16" is the password length again. Replace <PASSWORD> with the actual password. You will be asked for the current password interactively for each host or you can use -f to provide it from a file.

racadm (Dell)

There might be some cases where IPMI is not working, because IPMI over LAN is disabled in BIOS or because it needs a reset. In these cases you might get failures using ipmitool but you can still ssh to the mgmt interface using an existing/old password. If it's a Dell server you can change the password there using:

racadm set iDRAC.Users.2.Password <Password>

Where <Password> needs to be replaced and the number "2" refers to the same slot number referred to in the ipmitool section above. (If it doesn't work with slot 2, check if it's slot 1).

HP ILO

If it's a HP server with a newer version ILO, ssh to the mgmt interface and:

set /map1/accounts1/root password=<Password>


Alternatively it's possible to change the password via a web browser UI.

tunnel to web interface (on a HP)

Create an ssh tunnel to jump via a cumin host, example:

ssh -L 8000:db2056.mgmt.codfw.wmnet:443 cumin2001.codfw.wmnet

and keep the connection open.

In a browser connect to https://localhost:8000/ and create an exception for the certificate error.

Local IPMI

Use the same procedure described in Management Interfaces#Is the management password wrong? just replacing --diff with --commit in the command to check the password.

Other useful ipmitool commands

Force reboot

Force a host reboot "pulling the plug". This is not graceful and should be used only if the host is completely unreachable by all other means.

ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power cycle

Force PXE boot

Forces to reboot into PXE at the next reboot

ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev pxe

Show boot parameter

ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5