Last reviewed: 06/02/2009 | |
HP iLO2 NMI Watchdog Driver | |
NMI sourcing for iLO2 based ProLiant Servers | |
Documentation and Driver by | |
Thomas Mingarelli <thomas.mingarelli@hp.com> | |
The HP iLO2 NMI Watchdog driver is a kernel module that provides basic | |
watchdog functionality and the added benefit of NMI sourcing. Both the | |
watchdog functionality and the NMI sourcing capability need to be enabled | |
by the user. Remember that the two modes are not dependant on one another. | |
A user can have the NMI sourcing without the watchdog timer and vice-versa. | |
Watchdog functionality is enabled like any other common watchdog driver. That | |
is, an application needs to be started that kicks off the watchdog timer. A | |
basic application exists in the Documentation/watchdog/src directory called | |
watchdog-test.c. Simply compile the C file and kick it off. If the system | |
gets into a bad state and hangs, the HP ProLiant iLO 2 timer register will | |
not be updated in a timely fashion and a hardware system reset (also known as | |
an Automatic Server Recovery (ASR)) event will occur. | |
The hpwdt driver also has three (3) module parameters. They are the following: | |
soft_margin - allows the user to set the watchdog timer value | |
allow_kdump - allows the user to save off a kernel dump image after an NMI | |
nowayout - basic watchdog parameter that does not allow the timer to | |
be restarted or an impending ASR to be escaped. | |
NOTE: More information about watchdog drivers in general, including the ioctl | |
interface to /dev/watchdog can be found in | |
Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. | |
The NMI sourcing capability is disabled when the driver discovers that the | |
nmi_watchdog is turned on (nmi_watchdog = 1). This is due to the inability to | |
distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the | |
Linux kernel. What this means is that the hpwdt nmi handler code is called | |
each time the NMI signal fires off. This could amount to several thousands of | |
NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and | |
confused" message in the logs or if the system gets into a hung state, then | |
the user should reboot with nmi_watchdog=0. | |
1. If the kernel has not been booted with nmi_watchdog turned off then | |
edit /boot/grub/menu.lst and place the nmi_watchdog=0 at the end of the | |
currently booting kernel line. | |
2. reboot the sever | |
Now, the hpwdt can successfully receive and source the NMI and provide a log | |
message that details the reason for the NMI (as determined by the HP BIOS). | |
Below is a list of NMIs the HP BIOS understands along with the associated | |
code (reason): | |
No source found 00h | |
Uncorrectable Memory Error 01h | |
ASR NMI 1Bh | |
PCI Parity Error 20h | |
NMI Button Press 27h | |
SB_BUS_NMI 28h | |
ILO Doorbell NMI 29h | |
ILO IOP NMI 2Ah | |
ILO Watchdog NMI 2Bh | |
Proc Throt NMI 2Ch | |
Front Side Bus NMI 2Dh | |
PCI Express Error 2Fh | |
DMA controller NMI 30h | |
Hypertransport/CSI Error 31h | |
-- Tom Mingarelli | |
(thomas.mingarelli@hp.com) |