SRE/Dc-operations/Platform-specific documentation/HP Documentation
SRE Data Center Operations
DC Operations | About | Projects & Workboards | IRC: #wikimedia-dcops connect
HW Troubleshooting | HW Specific Documentation
HP ProLiant Server Directions
Please note our datacenter installation currently includes both HP ProLiant Gen8, Gen9, and Gen10. This document will cover all instances, with details where the commands differ.
- Lights Out Manager: HP iLO 4
Lights Out Management
The SSH implementation on the remote management interfaces only supports legacy SSH. When using a current openssh release you'll likely need to explicitly enable a key exchange algorithm supported by iLO, e.g.
ssh -oKexAlgorithms=+diffie-hellman-group14-sha1 HOST -or- ssh -oKexAlgorithms=diffie-hellman-group14-sha1 HOST -or- ssh -oKexAlgorithms=diffie-hellman-group14-sha1 -c 3des-cbc HOST
Common Actions
Reboot and boot from network then console
These boxes take ages to boot, so it will look like it's hanging when you connect to the serial console. Give it a few minutes (it takes 4-5 minutes just to get some screen feedback).
set /system1/bootconfig1/bootsource5 bootorder=1 power reset VSP
(That will leave it netbooting forever. So at some point, return to ilo and
set /system1/bootconfig1/bootsource5 bootorder=5
Note: this is only true if BootFmDisk is bootsource5. Some newer hardware may only have 4 bootsources:
bootsource2=BootFmDisk bootorder=1 bootsource1=BootFmCd bootorder=2 bootsource3=BootFmUSBKey bootorder=3 bootsource4=BootFmNetwork1 bootorder=4
In this case the command would be
set /system1/bootconfig1/bootsource2 bootorder=1
Get console
</>hpiLO-> vsp Virtual Serial Port Active: COM2 Starting virtual serial port. Press 'ESC (' to return to the CLI Session.
or
start /system1/oemhp_vsp1
Reset Virtual Serial port
Should the message
Virtual Serial Port is currently in use by another session.
be returned and you know you are not killing someone's else working session, issue
stop /system1/oemhp_vsp1
Reboot and boot into BIOS then console
Connecting to mgmt interface
- Via SSH
- ssh root@servername.mgmt.datacenter.wmnet
- Example: ssh root@bast1001.mgmt.eqiad.wmnet
- Via Browser
- Sometimes userful in troubleshooting, but you must have set up some kind of http(s) proxy into the mgmt network, demonstrated at Proxy_access_to_cluster
- https://servername.mgmt.datacenter.wmnet
- Please note you will have to override an unknown (self signed) certificate, you won't want to save it permanently, as a few of these saved tends to result in errors connecting to other Dell DRAC interfaces via HTTPS.
Connecting to Serial Console
- Attach to the serial console: vsp
- Detach from serial console: esc+(
- BIOS Serial Console Boot Keys:
- ESC+9 for ROM
- ESC+0 for Intelligent Provisioning
- ESC+! for Default Boot Override Options
- ESC+@ for Network Boot
- Crash Cart Boot Keys:
- F9 for ROM-Based Setup Utility
- F10 for Intelligent Provisioning
- F11 for Default Boot Override Options
- F12 for Network Boot
Power cycling
- Login to the iLO command line interface.
- Power commands are as follows:
power
-- Displays the current server power statepower on
-- Turns the server onpower off
-- Turns the server offpower off hard
-- Force the server off using press and holdpower reset
-- Reset the server
Administrative Actions
Polling for MAC Address
- Login to iLO command line interface.
- Run: show system1/network1/Integrated_NICs
Changing iLO User Password
Changing the iLO Network IP Settings
Enable / Disable IPMI over iLO
Setting a one-time boot option
Get MAC address
show /system1/network1/Integrated_NICs
Unless we are bonding, only the first NIC is used. So the first NIC as reported by ILO should be the one that is plugged.
Disable PCI device
rbsu> SHOW PCI DEVICE ENABLE/DISABLE rbsu> SET PCI DEVICE ENABLE/DISABLE <entrynum> 0
Enable/Disable Hyperthreading
rbsu> SHOW CONFIG INTEL(R) HYPERTHREADING OPTIONS Intel(R) Hyperthreading Options 1|Enabled <= 2|Disabled rbsu> SET CONFIG INTEL(R) HYPERTHREADING OPTIONS 2 Intel(R) Hyperthreading Options 1|Enabled 2|Disabled <=
Show system event log entries
While at the ILO console:
# show all recorded entries show /system1/log1 # show a particular entry show /system1/log1/record15
Also check out the main Sel page
ms-be RAID0 config
An easy way to configure swift backend ms-be machines disks all in raid0 using the console above (order is important )
First, reboot the system and during reboot Press 'ESC+9' to enter for System Utilities. Once in the System Utilities, select System Configuration then Slot 3 : Smart Array P840 Controller. Select Exit and launch HP Smart Storage Administrator(HPSSA). At the next step, an error message will appear 'error: no such device: EMBEDDED250.' there is nothing to do at this point, but wait for the hpssacli prompt (==>)
set target controller slot=3 array all delete forced create type=arrayr0 drivetype=ss_sata create type=arrayr0 drivetype=sata
Additional MS-BE RAID details
The ms-be systems RAID configuration is each disk in its own RAID 0 Starting from the SSD disks first. So the ms-be systems in general comes with a total of 14 disks. Counting from 0 to 13, the ssd's are in slot 12 and 13. You need to create first a RAID 0 for the first SSD disk in slot 12 then another RAID 0 for the SSD in slot 13 so that each SSD's will take as name sda and sdb. After that, do the same for the other 12 disks. At the end you will have:
Array A Array BArray C Array D Array E Array F Array G Array H Array I Array J Array K Array L Array M Array N
Array A being the SSD in slot 12 and Array B the SSD in slot 13
once in BIOS go to "system Configuration" - "Embedded RAID 1 : HPE Smart Array P816i-a SR Gen10 " - "Array Configuration " - "Create Array "
Mark a disk as failed
It might happen that Linux detects errors while writing to a disk but the raid controller itself doesn't see the disk as failed (e.g. https://phabricator.wikimedia.org/T163690). In these cases it is useful to forcefully mark the physical drive as failed as follows:
set target controller slot=3 pd all show # take note of the disk e.g. 1I:1:5 pd DISK modify disablepd forced
To reenable the LD (not the PD) after the disk has been swapped:
ld NUMBER modify reenable
Blink disk led
Via hpssacli:
set target controller slot=3 pd DISK modify led=on
ACPI Errors
On first install and after the first puppet run there might be messages similar to this showing up on console:
ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427) ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a523f04f2f8), AE_AML_BUFFER_LIMIT (20160831/psparse-543) ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
This is related to the "power meter" ACPI module loaded, we blacklist the module since https://gerrit.wikimedia.org/r/#/c/356422/ and thus a reboot will make the message disappear.
Initial System Setup
Gen10: HP's require F9 to get to BIOS Gen[89]: Enter the system setup tool by pressing esc-9 during boot. The terminal emulation of this tool is lousy, so things will scrawl all over your screen and generally be hard to use.
System Options
Checklist for BIOS settings on HP systems:
- boot to system setup utilities (F9 on post)
- Check the following settings in the System Settings > System Configuration > Bios/Platform Configuration (RBSU):
- System Options > Serial Port Options > Virtual Serial Port : COM 2
- System Options > USB Options > Internal SD Card Slot : Disabled
- Processor Options > Intel(R) Hyper-Threading : Enabled
- Virtualization Options : Disabled on EVERYTHING unless it is a cloudvirt or ganeti server (those are enabled on everything)
- Cloudvirt/Ganeti settings: Virtualization Technology = Enabled, Intel(R) VT-d = Enabled, SR-IOV = Enabled
- Boot Options > Boot Mode : Legacy BIOS Mode
- Legacy BIOS Boot Order > Standard Boot Order (IPL) : Hard drive is located ahead of network port.
- Network Options > Network Boot Options : Ensure only the active/primary ethernet port is set to Network Boot and the other ports set to Disabled. (This does NOT disable the port, ONLY disables it attempting to network boot.)
- System Configuration -> BIOS/Platform Configuration (RBSU) -> System Options -> USB Options -> Embedded User Partition = Disabled
- Save settings, go back to the System Configuration main menu.
- If systems have hardware raid, select the embedded raid controller on the System Configuration Screen and setup the raid.
- Differing systems have differing raid requirements, please see the individual system setup tasks for details on the raid levels.
Setting Management Network Settings
During Post hit F8 to get to ILO (moves slow)
- Under Network
- DHCP/DNS Setting
- Disable DHCP by pressing space bar and ensuring it says OFF hit f10
- NIC and TCP/IP
- Enter IP/Subnet/Gateway
- Under User
- edit and change Administrator to root
user name root login name root password mgmt password
- Ensure all iLO privileges are marked yes
- File
- Exit and iLO will reset
NetBios / Wins ILOM Traffic
If Wins Resolution is mistakenly enabled in the dedicated iLOM network port IPv4 settings, it will spam the network with netbios requests. Disable this under the dedicated ilom port IPv4 settings and reset the ilom to resolve.
RAID controller firmware upgrade
Upgrading RAID controller firmware is relatively straightforward and as of Jan 2019 hasn't posed issues, see also bug T141756 for more context.
This guide assumes one of the common controllers we have deployed on the fleet, usually P840.
version=ea3138d8e8-6.88-1.1 cd /tmp curl apt.wikimedia.org/firmware/firmware-smartarray-${version}.x86_64.tgz | tar zxv sudo ./usr/lib/x86_64-linux-gnu/firmware-smartarray-${version}/setup
Setting proper power option
In bios:
* select service options
* Set Processor Power Monitoring and choose disabled
* Press enter, ignore warning message regarding modification by pressing enter again. Select disabled and press enter again.