summaryrefslogtreecommitdiff
path: root/Documentation/nmi_watchdog.txt
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@ppc970.osdl.org>2005-04-16 15:20:36 -0700
committerLinus Torvalds <torvalds@ppc970.osdl.org>2005-04-16 15:20:36 -0700
commit1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch)
tree0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/nmi_watchdog.txt
downloadlwn-1da177e4c3f41524e886b7f1b8a0c1fc7321cac2.tar.gz
lwn-1da177e4c3f41524e886b7f1b8a0c1fc7321cac2.zip
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
Diffstat (limited to 'Documentation/nmi_watchdog.txt')
-rw-r--r--Documentation/nmi_watchdog.txt81
1 files changed, 81 insertions, 0 deletions
diff --git a/Documentation/nmi_watchdog.txt b/Documentation/nmi_watchdog.txt
new file mode 100644
index 000000000000..c025a4561c10
--- /dev/null
+++ b/Documentation/nmi_watchdog.txt
@@ -0,0 +1,81 @@
+
+[NMI watchdog is available for x86 and x86-64 architectures]
+
+Is your system locking up unpredictably? No keyboard activity, just
+a frustrating complete hard lockup? Do you want to help us debugging
+such lockups? If all yes then this document is definitely for you.
+
+On many x86/x86-64 type hardware there is a feature that enables
+us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt
+which get executed even if the system is otherwise locked up hard).
+This can be used to debug hard kernel lockups. By executing periodic
+NMI interrupts, the kernel can monitor whether any CPU has locked up,
+and print out debugging messages if so.
+
+In order to use the NMI watchdog, you need to have APIC support in your
+kernel. For SMP kernels, APIC support gets compiled in automatically. For
+UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
+APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
+features -> IO-APIC support on uniprocessors) in your kernel config.
+CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
+CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
+kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
+may implicitly disable the NMI watchdog.]
+
+For x86-64, the needed APIC is always compiled in, and the NMI watchdog is
+always enabled with I/O-APIC mode (nmi_watchdog=1). Currently, local APIC
+mode (nmi_watchdog=2) does not work on x86-64.
+
+Using local APIC (nmi_watchdog=2) needs the first performance register, so
+you can't use it for other purposes (such as high precision performance
+profiling.) However, at least oprofile and the perfctr driver disable the
+local APIC NMI watchdog automatically.
+
+To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
+parameter. Eg. the relevant lilo.conf entry:
+
+ append="nmi_watchdog=1"
+
+For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
+For UP machines without an IO-APIC use nmi_watchdog=2, this only works
+for some processor types. If in doubt, boot with nmi_watchdog=1 and
+check the NMI count in /proc/interrupts; if the count is zero then
+reboot with nmi_watchdog=2 and check the NMI count. If it is still
+zero then log a problem, you probably have a processor that needs to be
+added to the nmi code.
+
+A 'lockup' is the following scenario: if any CPU in the system does not
+execute the period local timer interrupt for more than 5 seconds, then
+the NMI handler generates an oops and kills the process. This
+'controlled crash' (and the resulting kernel messages) can be used to
+debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
+the oops will show up automatically. If the kernel produces no messages
+then the system has crashed so hard (eg. hardware-wise) that either it
+cannot even accept NMI interrupts, or the crash has made the kernel
+unable to print messages.
+
+Be aware that when using local APIC, the frequency of NMI interrupts
+it generates, depends on the system load. The local APIC NMI watchdog,
+lacking a better source, uses the "cycles unhalted" event. As you may
+guess it doesn't tick when the CPU is in the halted state (which happens
+when the system is idle), but if your system locks up on anything but the
+"hlt" processor instruction, the watchdog will trigger very soon as the
+"cycles unhalted" event will happen every clock tick. If it locks up on
+"hlt", then you are out of luck -- the event will not happen at all and the
+watchdog won't trigger. This is a shortcoming of the local APIC watchdog
+-- unfortunately there is no "clock ticks" event that would work all the
+time. The I/O APIC watchdog is driven externally and has no such shortcoming.
+But its NMI frequency is much higher, resulting in a more significant hit
+to the overall system performance.
+
+NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default,
+you have to enable it with a boot time parameter. Prior to 2.4.2-ac18
+the NMI-oopser is enabled unconditionally on x86 SMP boxes.
+
+On x86-64 the NMI oopser is on by default. On 64bit Intel CPUs
+it uses IO-APIC by default and on AMD it uses local APIC.
+
+[ feel free to send bug reports, suggestions and patches to
+ Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
+ list at <linux-smp@vger.kernel.org> ]
+