Sel

From Wikitech

The system event log is used to log hardware events directly to the BMC. This allows events to be logged even if the Os is unable to. Currently we have use ipmiseld. We also export the ipmi_sel_logs_count metric so we can detect when new events have been logged however at this point a manual investigations is required to investigate what action to take.

Addtional information

To get additional information one is able to use either the ipmi-sel command on the server or the racadm command getsel. The later is more likely to give more precise information however we the former often gives the necessary information to further progress the resolution

ipmi-sel

ipmi-sel comes from the freeipmi tools and tries to interpret the messages based on standardised codes. running the command without any commands will give you some information e.g.

$ sudo ipmi-sel                                                             
ID  | Date        | Time     | Name             | Type                        | Event
67  | Feb-14-2023 | 22:45:19 | SBE Log Disabled | Event Logging Disabled      | Correctable Memory Error Logging Disabled ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 40h

however you can get ipmi-sel to try and infer additional information by adding the following parameters --interpret-oem-data and --entity-sensor-names e.g.


sudo ipmi-sel --interpret-oem-data --entity-sensor-names                    
ID  | Date        | Time     | Name                             | Type                        | Event
67  | Feb-14-2023 | 22:45:19 | System Firmware SBE Log Disabled | Event Logging Disabled      | Correctable Memory Error Logging Disabled ; DIMM A7

racadm

Using racadm is likely to give you more accurate human readable information as it is using the dell interface, however this requires logging into the idrac interface

$ ssh cloudvirt1036.mgmt.eqiad.wmnet                                                      
root@cloudvirt1036.mgmt.eqiad.wmnet's password: 
racadm>>getsel
Record:      67
Date/Time:   02/14/2023 22:45:19
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_A7.
-------------------------------------------------------------------------------

Alerts

When you receive an alert you should check the SEL using one of the aforementioned methods creating phabricator ticket for each actionable event then clear the log down using

Memory Errors

See Memory correctable errors -EDAC-

Clearing SEL

once all events have been logged you should clear the sel log down. this can be preformed either using the sudo ipmi-sel --clear