Management Interfaces

From Wikitech
Jump to navigation Jump to search

List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.

General advices

How to execute remote IPMI commands

SSH into one of the hosts with the cluster::management Puppet role applied (cumin[12]001 at the time of writing) and run ipmitool, it will ask for a password, that is stored in pwstore.

Troubleshooting Commands

Does IPMI works locally?

SSH into the host (no mgmt) and run:

sudo ipmi-chassis --get-chassis-status

The typical error is: ipmi_cmd_get_chassis_status: internal system error or driver timeout

Does IPMI work remotely?

Execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status

The typical failing error is: Error: Unable to establish IPMI v2 / RMCP+ session

If it fails very quickly:

It might be that the IPMI password has gone out of sync with the host one (was the host rebooted recently?), see below on how to set it again.

Is remote IPMI enabled?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff

If there is no output it means that remote IPMI is enabled and this configuration is good. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

See on the section below how to do the same from the web mgmt interface, if no local host access is available (e.g. a new host).

Are IPMI permissions set correctly?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff

If there is no output it means that the user is configured correctly. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

Is there any overriding for next boot?

The BIOS Boot_Device is managed by puppet so it should be correct. This can be validated using the custom ipmi_chassis fact. A clean configuration should return NO-OVERRIDE for the ipmi_chassis.boot_flags.device fact

$ sudo facter -p ipmi_chassis.boot_flags.device

Alternativly one can execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5

A typical output for a clean settings that have no overrides is:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0000000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

In case of overrides the Boot parameter data bitmask will be different from 0000000000 and the line below will show the overridden values.

The wmf-auto-reimage script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.

In case it's needed to manually reset it to remove any override, run:

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none

And then re-check if the change has been applied.

Does IPMI works but SSH to the management console doesn't?

In this case it's possible to reset the management card, see below.

Fix Commands

Set the IPMI password

In case it is thought that the IPMI password has got out of sync with the management one, it is possible to set it again. SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 ${PASSWORD}
    
  • for HP hosts:
    set /map1/accounts1/root password=${PASSWORD}
    

Enable remote IPMI access (over LAN) without local host access

For HP ilo4 and lower, the option is under:

Administration > Access Settings > IPMI/DCMI

For HP ilo5, the option is under:

Security > Access Settings > Edit Network settings

Reset the management card

In case the management card is unresponsive IPMI or SSH or ping but at least one of following options is available, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host.

From the SSH console

To reset the management card, SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm racreset
    
  • for HP hosts:
    reset /map1
    

From local IPMI

To reset the management card via local IPMI, SSH into the host (not mgmt) and run:

bmc-device --cold-reset; echo $?

It doesn't print anything on success, hence the print of the exit code at least to check it executed correctly.

From remote IPMI

To reset the management card via remote IPMI, execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E mc reset cold

For the full list of available commands to manage the management card via remote IPMI see man ipmitool and search for the section titled mc \| bmc .

Power drain the host

For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.