Management Interfaces
List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.
General advices
How to execute remote IPMI commands
SSH into one of the hosts with the cluster::management
Puppet role applied (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) and run ipmitool
, it will ask for management
password, that is stored in pwstore.
Looking for racadm/hpiLO commands?
Then you're most likely at the wrong place, check out SRE/Dc-operations/Platform-specific documentation
Serial Console =
After logging into the management interface using SSH, you can start the serial console redirection using:
-> cd /system1/sol1/
-> start
And exit it with <enter> <esc> <T> (hit the keys in sequence, not together, and that's a capital T so use shift).
Troubleshooting Commands
Does IPMI work locally?
SSH into the host (not mgmt) and run:
sudo ipmi-chassis --get-chassis-status
The typical error is: ipmi_cmd_get_chassis_status: internal system error
or driver timeout
Does IPMI work remotely?
Execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status
The typical failing error is: Error: Unable to establish IPMI v2 / RMCP+ session
If it fails very quickly:
It might be that the IPMI password has gone out of sync with the host one (was the host rebooted recently?), see below on how to set it again.
Is remote IPMI enabled?
SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config
instead of ipmi-config
):
sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff
If there is no output it means that remote IPMI is enabled and this configuration is good. If there is any diff shown it means that the current values are not the correct ones.
Re-run the same command replacing --diff
with --commit
to change the config. Verify it again after the commit.
On some HP systems, this check can suggest the configuration is good even when in fact it isn't. To confirm this, visit the web mgmt interface, select "Security" from the left-hand menu, and check in the "Network" box that "IPMI/DCMI over LAN" is enabled; if not, click the Pencil to go to the Edit interface, check the appropriate box, and scroll to the bottom to click "OK" to apply the change.
See on the section below how to do the same from the web mgmt interface, if no local host access is available (e.g. a new host).
Are IPMI permissions set correctly?
SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config
instead of ipmi-config
):
sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff
If there is no output it means that the user is configured correctly. If there is any diff shown it means that the current values are not the correct ones.
Re-run the same command replacing --diff
with --commit
to change the config. Verify it again after the commit.
Did you do a reset but still getting IPMI connection failed (when using the reimage cookbook)?
Try logging in on mgmt, with the regular mgmt password, and then re-set the same password with
"racadm config" entered is not supported on iDRAC "4.40.00.00"
racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 <password>
and then try again. We have had at least one case where this fixed running the reimage cookbook getting a remote IPMI connection failure.
If that fails set the IDRAC to factory and run the provison cookbook.
Is there any overriding for next boot?
The BIOS Boot_Device is managed by puppet so it should be correct. This can be validated using the custom ipmi_chassis
fact. A clean configuration should return NO-OVERRIDE
for the ipmi_chassis.boot_flags.device
fact
$ sudo facter -p ipmi_chassis.boot_flags.device
Alternativly one can execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5
A typical output for a clean settings that have no overrides is:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0000000000
Boot Flags :
- Boot Flag Invalid
- Options apply to only next boot
- BIOS PC Compatible (legacy) boot
- Boot Device Selector : No override
- Console Redirection control : System Default
- BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
- BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST
In case of overrides the Boot parameter data
bitmask will be different from 0000000000
and the line below will show the overridden values.
The sre.hosts.reimage
cookbook script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.
In case it's needed to manually reset it to remove any override, run:
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none
And then re-check if the change has been applied.
Does IPMI work but SSH to the management console doesn't?
In this case it's possible to reset the management card, see below.
If you are receiving an error message like the following
Unable to negotiate with UNKNOWN port 65535: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diffie-hellman-gr oup1-sha1 or debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(2048<8192<8192) sent debug1: got SSH2_MSG_KEX_DH_GEX_GROUP debug2: bits set: 3952/8192 debug1: SSH2_MSG_KEX_DH_GEX_INIT sent Received disconnect from UNKNOWN port 65535:11: Logged out. Disconnected from UNKNOWN port 65535
it means that the SSH release running on the management console is too old to interoperate with current SSH client defaults. A firmware update will fix this by providing a more recent OpenSSH. Alternatively you can force older ciphers/key exchanges in your SSH client using -oKexAlgorithms=diffie-hellman-group14-sha1 -oCiphers=aes256-cbc.
Is the management password wrong?
It could happen that during provisioning the wrong management password was set into the management card. To verify if it's the correct one execute the following.
- First find which user is enabled, usually it's
User2
for DELL andUser1
for HP but sometimes might be a different one.this should return nothing and exit with 0. If there is a diff it means that the user is not enabled, try withsudo ipmi-config -g core -S User2 -e "User2:Enable_User=Yes" --diff
User1
or other higher integers. - Then to check if the password is the correct one do the following:If the output is empty and the script exit with 0 it means that the password is the correct one, otherwise a diff like the one shown above will appear. In that case see the Management Interfaces#Local IPMI section below.
# When Prompted insert the management password to set $ sudo -i # read -s MGMT_PASSWORD # # Adapt the UserN to the actual one # ipmi-config -g core -S User2 -e "User2:Password=${MGMT_PASSWORD}" --diff User2:Password - input=`test':actual=`<something else>' #
Fix Commands
Set the IPMI password
In case it is thought that the IPMI password has got out of sync with the management one, it is possible to set it again. SSH into the management interface of the host and run:
- for DELL hosts:
racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 ${PASSWORD}
- for HP hosts:
set /map1/accounts1/root password=${PASSWORD}
Note: This one doesn't work for Idracs higher than 4.40.00.00, you must use the one listed at below, for Dell hosts.
Enable remote IPMI access (over LAN) without local host access
Dell
From the racadm
console check it with:
racadm>>get iDRAC.IPMILan.Enable
[Key=iDRAC.Embedded.1#IPMILan.1]
Enable=Enabled
Modify it with:
racadm>>set iDRAC.IPMILan.Enable 1
[Key=iDRAC.Embedded.1#IPMILan.1]
Object value modified successfully
HP
For HP ilo4 and lower, the option is under:
Administration > Access Settings > IPMI/DCMI
For HP ilo5, the option is under:
Security > Access Settings > Edit Network settings
Reset the management card
In case the management card is unresponsive IPMI or SSH or ping but at least one of following options is available, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host.
From the SSH console
To reset the management card, SSH into the management interface of the host and run:
- for DELL hosts:
racadm racreset
- for HP hosts:
reset /map1
From local IPMI
To reset the management card via local IPMI, SSH into the host (not mgmt) and run:
bmc-device --cold-reset; echo $?
It doesn't print anything on success, hence the print of the exit code at least to check it executed correctly.
From remote IPMI
To reset the management card via remote IPMI, execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E mc reset cold
For the full list of available commands to manage the management card via remote IPMI see man ipmitool
and search for the section titled mc \| bmc
.
Power drain the host
For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.
Change the mgmt password
cookbook
To change the mgmt (SSH) password there is a Spicerack cookbook called sre.hosts.ipmi-password-reset. Connect to a cumin server and run it like:
@cumin1001:~$ sudo cookbook sre.hosts.ipmi-password-reset 'db2*'
In this example we are running it on all db hosts in codfw simply by host name with wildcard. You will be asked to enter current mgmt password followed twice by the new password.
ipmitool
After using the cookbook above there might be some failures to "establish an IPMI sesssion". Usually these are HP hosts and it depends on their ILO version.
The next step is to directly run ipmitool yourself. Example:
ipmitool -I lanplus -H db2063.mgmt.codfw.wmnet -U root -E user set password 1 <password> 16
Note you use the mgmt interface name, not the server name.
Replace <password> with the new password. You will be asked for the current (old) password interactively or you can use -f to read it from a file.
user slot
The "1" before the password means we are using user slot 1. Dell servers usually use slot 2 and HP servers usually use slot 1. To be sure always check if your password change was successful by connecting via ssh to the mgmt interface.
Also you can use this command to list the user slots. example:
ipmitool -I lanplus -H ms-be1039.mgmt.eqiad.wmnet -U root -E user list
The ID column in the output of this command should match your slot number.
The "16" after the password is about the (minimum) password length.
running ipmitool on multiple servers
You can make a simple text file with the list of failed servers (taken from the output of the cookbook but turning regular expressions into a list of all FQDNs) and then run ipmitool in a loop like, example:
[cumin1001:~] $ for host in $(cat failures) ; do echo $host; ipmitool -I lanplus -H $host -U root -E user set password 1 <PASSWORD> 16; sleep 1; done
Here "failures" is the text file with host names. "1" is the user slot and "16" is the password length again. Replace <PASSWORD> with the actual password. You will be asked for the current password interactively for each host or you can use -f to provide it from a file.
racadm (Dell)
There might be some cases where IPMI is not working, because IPMI over LAN is disabled in BIOS or because it needs a reset. In these cases you might get failures using ipmitool but you can still ssh to the mgmt interface using an existing/old password. If it's a Dell server you can change the password there using:
racadm set iDRAC.Users.2.Password <Password>
Where <Password> needs to be replaced and the number "2" refers to the same slot number referred to in the ipmitool section above. (If it doesn't work with slot 2, check if it's slot 1).
HP ILO
If it's a HP server with a newer version ILO, ssh to the mgmt interface and:
set /map1/accounts1/root password=<Password>
Alternatively it's possible to change the password via a web browser UI.
tunnel to web interface (on a HP)
Create an ssh tunnel to jump via a cumin host, example:
ssh -L 8000:db2056.mgmt.codfw.wmnet:443 cumin2001.codfw.wmnet
and keep the connection open.
In a browser connect to https://localhost:8000/ and create an exception for the certificate error.
Local IPMI
Use the same procedure described in Management Interfaces#Is the management password wrong? just replacing --diff
with --commit
in the command to check the password.
Other useful ipmitool commands
Force reboot
Force a host reboot "pulling the plug". This is not graceful and should be used only if the host is completely unreachable by all other means.
ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power cycle
Force PXE boot
Forces to reboot into PXE at the next reboot
ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev pxe
Show boot parameter
ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5