<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/arch/x86/kernel, branch v3.18.26</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v3.18.26</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v3.18.26'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2015-10-28T20:20:25+00:00</updated>
<entry>
<title>x86: Init per-cpu shadow copy of CR4 on 32-bit CPUs too</title>
<updated>2015-10-28T20:20:25+00:00</updated>
<author>
<name>Steven Rostedt</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2015-02-27T19:50:19+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=4534a1c432551e5588955a5eb8f6cb26c69b047f'/>
<id>urn:sha1:4534a1c432551e5588955a5eb8f6cb26c69b047f</id>
<content type='text'>
[ Upstream commit 5b2bdbc84556774afbe11bcfd24c2f6411cfa92b ]

Commit:

   1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4")

added a shadow CR4 such that reads and writes that do not
modify the CR4 execute much faster than always reading the
register itself.

The change modified cpu_init() in common.c, so that the
shadow CR4 gets initialized before anything uses it.

Unfortunately, there's two cpu_init()s in common.c. There's
one for 64-bit and one for 32-bit. The commit only added
the shadow init to the 64-bit path, but the 32-bit path
needs the init too.

Link: http://lkml.kernel.org/r/20150227125208.71c36402@gandalf.local.home Fixes: 1e02ce4cccdc "x86: Store a per-cpu shadow copy of CR4"
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Acked-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20150227145019.2bdd4354@gandalf.local.home
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/nmi/64: Fix a paravirt stack-clobbering bug in the NMI code</title>
<updated>2015-10-28T02:14:57+00:00</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2015-09-20T23:32:05+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=a70046a8f849fb1a98c5aee0c3504ddf61461eab'/>
<id>urn:sha1:a70046a8f849fb1a98c5aee0c3504ddf61461eab</id>
<content type='text'>
[ Upstream commit 83c133cf11fb0e68a51681447e372489f052d40e ]

The NMI entry code that switches to the normal kernel stack needs to
be very careful not to clobber any extra stack slots on the NMI
stack.  The code is fine under the assumption that SWAPGS is just a
normal instruction, but that assumption isn't really true.  Use
SWAPGS_UNSAFE_STACK instead.

This is part of a fix for some random crashes that Sasha saw.

Fixes: 9b6e6a8334d5 ("x86/nmi/64: Switch stacks on userspace NMI entry")
Reported-and-tested-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/974bc40edffdb5c2950a5c4977f821a446b76178.1442791737.git.luto@kernel.org
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/process: Add proper bound checks in 64bit get_wchan()</title>
<updated>2015-10-28T02:13:12+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2015-09-30T08:38:22+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=09be2e412c80f53854d251975aceac56a621bc29'/>
<id>urn:sha1:09be2e412c80f53854d251975aceac56a621bc29</id>
<content type='text'>
[ Upstream commit eddd3826a1a0190e5235703d1e666affa4d13b96 ]

Dmitry Vyukov reported the following using trinity and the memory
error detector AddressSanitizer
(https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel).

[ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on
address ffff88002e280000
[ 124.576801] ffff88002e280000 is located 131938492886538 bytes to
the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a)
[ 124.578633] Accessed by thread T10915:
[ 124.579295] inlined in describe_heap_address
./arch/x86/mm/asan/report.c:164
[ 124.579295] #0 ffffffff810dd277 in asan_report_error
./arch/x86/mm/asan/report.c:278
[ 124.580137] #1 ffffffff810dc6a0 in asan_check_region
./arch/x86/mm/asan/asan.c:37
[ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0
[ 124.581893] #3 ffffffff8107c093 in get_wchan
./arch/x86/kernel/process_64.c:444

The address checks in the 64bit implementation of get_wchan() are
wrong in several ways:

 - The lower bound of the stack is not the start of the stack
   page. It's the start of the stack page plus sizeof (struct
   thread_info)

 - The upper bound must be:

       top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long).

   The 2 * sizeof(unsigned long) is required because the stack pointer
   points at the frame pointer. The layout on the stack is: ... IP FP
   ... IP FP. So we need to make sure that both IP and FP are in the
   bounds.

Fix the bound checks and get rid of the mix of numeric constants, u64
and unsigned long. Making all unsigned long allows us to use the same
function for 32bit as well.

Use READ_ONCE() when accessing the stack. This does not prevent a
concurrent wakeup of the task and the stack changing, but at least it
avoids TOCTOU.

Also check task state at the end of the loop. Again that does not
prevent concurrent changes, but it avoids walking for nothing.

Add proper comments while at it.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Reported-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Based-on-patch-from: Wolfram Gloger &lt;wmglo@dent.med.uni-muenchen.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Borislav Petkov &lt;bp@alien8.de&gt;
Reviewed-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Andrey Ryabinin &lt;ryabinin.a.a@gmail.com&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Cc: Kostya Serebryany &lt;kcc@google.com&gt;
Cc: Alexander Potapenko &lt;glider@google.com&gt;
Cc: kasan-dev &lt;kasan-dev@googlegroups.com&gt;
Cc: Denys Vlasenko &lt;dvlasenk@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Wolfram Gloger &lt;wmglo@dent.med.uni-muenchen.de&gt;
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.de
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/asm/entry: Create and use a 'TOP_OF_KERNEL_STACK_PADDING' macro</title>
<updated>2015-10-28T02:13:11+00:00</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@amacapital.net</email>
</author>
<published>2015-03-10T18:05:58+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=6dbba2133ec370305c1c91785f91f415c26164d5'/>
<id>urn:sha1:6dbba2133ec370305c1c91785f91f415c26164d5</id>
<content type='text'>
[ Upstream commit 3ee4298f440c81638cbb5ec06f2497fb7a9a9eb4 ]

x86_32, unlike x86_64, pads the top of the kernel stack, because the
hardware stack frame formats are variable in size.

Document this padding and give it a name.

This should make no change whatsoever to the compiled kernel
image. It also doesn't fix any of the current bugs in this area.

Signed-off-by: Andy Lutomirski &lt;luto@amacapital.net&gt;
Acked-by: Denys Vlasenko &lt;dvlasenk@redhat.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/02bf2f54b8dcb76a62a142b6dfe07d4ef7fc582e.1426009661.git.luto@amacapital.net
[ Fixed small details, such as a missed magic constant in entry_32.S pointed out by Denys Vlasenko. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/kexec: Fix kexec crash in syscall kexec_file_load()</title>
<updated>2015-10-28T02:13:10+00:00</updated>
<author>
<name>Lee, Chun-Yi</name>
<email>joeyli.kernel@gmail.com</email>
</author>
<published>2015-09-29T12:58:57+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=9a2a1db52f2acfe671c8192d36b6b84b6ea5c9fa'/>
<id>urn:sha1:9a2a1db52f2acfe671c8192d36b6b84b6ea5c9fa</id>
<content type='text'>
[ Upstream commit e3c41e37b0f4b18cbd4dac76cbeece5a7558b909 ]

The original bug is a page fault crash that sometimes happens
on big machines when preparing ELF headers:

    BUG: unable to handle kernel paging request at ffffc90613fc9000
    IP: [&lt;ffffffff8103d645&gt;] prepare_elf64_ram_headers_callback+0x165/0x260

The bug is caused by us under-counting the number of memory ranges
and subsequently not allocating enough ELF header space for them.
The bug is typically masked on smaller systems, because the ELF header
allocation is rounded up to the next page.

This patch modifies the code in fill_up_crash_elf_data() by using
walk_system_ram_res() instead of walk_system_ram_range() to correctly
count the max number of crash memory ranges. That's because the
walk_system_ram_range() filters out small memory regions that
reside in the same page, but walk_system_ram_res() does not.

Here's how I found the bug:

After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
the code uses walk_system_ram_res() to fill-in crash memory regions information
to the program header, so it counts those small memory regions that
reside in a page area.

But, when the kernel was using walk_system_ram_range() in
fill_up_crash_elf_data() to count the number of crash memory regions,
it filters out small regions.

I printed those small memory regions, for example:

  kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Based on the code in walk_system_ram_range(), this memory region
will be filtered out:

  pfn = (0x77592258 + 0x1000 - 1) &gt;&gt; 12 = 0x77593
  end_pfn = (0x77592258 + 0xfc0 -1 + 1) &gt;&gt; 12 = 0x77593
  end_pfn - pfn = 0x77593 - 0x77593 = 0  &lt;=== if (end_pfn &gt; pfn) is FALSE

So, the max_nr_ranges that's counted by the kernel doesn't include
small memory regions - causing us to under-allocate the required space.
That causes the page fault crash that happens in a later code path
when preparing ELF headers.

This bug is not easy to reproduce on small machines that have few
CPUs, because the allocated page aligned ELF buffer has more free
space to cover those small memory regions' PT_LOAD headers.

Signed-off-by: Lee, Chun-Yi &lt;jlee@suse.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Jiang Liu &lt;jiang.liu@linux.intel.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Cc: Takashi Iwai &lt;tiwai@suse.de&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Viresh Kumar &lt;viresh.kumar@linaro.org&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: kexec@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: &lt;stable@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/paravirt: Replace the paravirt nop with a bona fide empty function</title>
<updated>2015-10-28T02:13:08+00:00</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2015-09-20T23:32:04+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=7cb9685d817e262c8dce4e3bd369f243bc6961f8'/>
<id>urn:sha1:7cb9685d817e262c8dce4e3bd369f243bc6961f8</id>
<content type='text'>
[ Upstream commit fc57a7c68020dcf954428869eafd934c0ab1536f ]

PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
example, trimmed for readability):

    ff 15 00 00 00 00       callq  *0x0(%rip)        # 2796 &lt;nmi+0x6&gt;
              2792: R_X86_64_PC32     pv_irq_ops+0x2c

That's a call through a function pointer to regular C function that
does nothing on native boots, but that function isn't protected
against kprobes, isn't marked notrace, and is certainly not
guaranteed to preserve any registers if the compiler is feeling
perverse.  This is bad news for a CLBR_NONE operation.

Of course, if everything works correctly, once paravirt ops are
patched, it gets nopped out, but what if we hit this code before
paravirt ops are patched in?  This can potentially cause breakage
that is very difficult to debug.

A more subtle failure is possible here, too: if _paravirt_nop uses
the stack at all (even just to push RBP), it will overwrite the "NMI
executing" variable if it's called in the NMI prologue.

The Xen case, perhaps surprisingly, is fine, because it's already
written in asm.

Fix all of the cases that default to paravirt_nop (including
adjust_exception_frame) with a big hammer: replace paravirt_nop with
an asm function that is just a ret instruction.

The Xen case may have other problems, so document them.

This is part of a fix for some random crashes that Sasha saw.

Reported-and-tested-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.org
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/platform: Fix Geode LX timekeeping in the generic x86 build</title>
<updated>2015-10-28T02:13:07+00:00</updated>
<author>
<name>David Woodhouse</name>
<email>dwmw2@infradead.org</email>
</author>
<published>2015-09-16T13:10:03+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=6ea48cdcf3c9054b03d2607ead2740d7a0376fa4'/>
<id>urn:sha1:6ea48cdcf3c9054b03d2607ead2740d7a0376fa4</id>
<content type='text'>
[ Upstream commit 03da3ff1cfcd7774c8780d2547ba0d995f7dc03d ]

In 2007, commit 07190a08eef36 ("Mark TSC on GeodeLX reliable")
bypassed verification of the TSC on Geode LX. However, this code
(now in the check_system_tsc_reliable() function in
arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was
set.

OpenWRT has recently started building its generic Geode target
for Geode GX, not LX, to include support for additional
platforms. This broke the timekeeping on LX-based devices,
because the TSC wasn't marked as reliable:
https://dev.openwrt.org/ticket/20531

By adding a runtime check on is_geode_lx(), we can also include
the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus
fixing the problem.

Signed-off-by: David Woodhouse &lt;David.Woodhouse@intel.com&gt;
Cc: Andres Salomon &lt;dilinger@queued.net&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Marcelo Tosatti &lt;marcelo@kvack.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/apic: Serialize LVTT and TSC_DEADLINE writes</title>
<updated>2015-10-28T02:13:06+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2015-07-30T23:24:43+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=51b366f06578c32bc97b7b35d827945cdae6f877'/>
<id>urn:sha1:51b366f06578c32bc97b7b35d827945cdae6f877</id>
<content type='text'>
[ Upstream commit 5d7c631d926b59aa16f3c56eaeb83f1036c81dc7 ]

The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an
MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not
guaranteed that the write to LVTT has reached the APIC before the
TSC_DEADLINE MSR is written. In such a case the write to the MSR is
ignored and as a consequence the local timer interrupt never fires.

The SDM decribes this issue for xAPIC and x2APIC modes. The
serialization methods recommended by the SDM differ.

xAPIC:
 "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
  2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter.
  3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2.
  4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline."

x2APIC:
 "To allow for efficient access to the APIC registers in x2APIC mode,
  the serializing semantics of WRMSR are relaxed when writing to the
  APIC registers. Thus, system software should not use 'WRMSR to APIC
  registers in x2APIC mode' as a serializing instruction. Read and write
  accesses to the APIC registers will occur in program order. A WRMSR to
  an APIC register may complete before all preceding stores are globally
  visible; software can prevent this by inserting a serializing
  instruction, an SFENCE, or an MFENCE before the WRMSR."

The xAPIC method is to just wait for the memory mapped write to hit
the LVTT by checking whether the MSR write has reached the hardware.
There is no reason why a proper MFENCE after the memory mapped write would
not do the same. Andi Kleen confirmed that MFENCE is sufficient for the
xAPIC case as well.

Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done
unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE
support.

[ tglx: Massaged the changelog ]

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: &lt;Kernel-team@fb.com&gt;
Cc: &lt;lenb@kernel.org&gt;
Cc: &lt;fenghua.yu@intel.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: stable@vger.kernel.org #v3.7+
Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection</title>
<updated>2015-10-27T13:33:07+00:00</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2015-07-15T17:29:38+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=4bc532d8428f6dd671c66f51ce5e459cc0ff1c86'/>
<id>urn:sha1:4bc532d8428f6dd671c66f51ce5e459cc0ff1c86</id>
<content type='text'>
[ Upstream commit 810bc075f78ff2c221536eb3008eac6a492dba2d ]

We have a tricky bug in the nested NMI code: if we see RSP
pointing to the NMI stack on NMI entry from kernel mode, we
assume that we are executing a nested NMI.

This isn't quite true.  A malicious userspace program can point
RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
happen while RSP is still pointing at the NMI stack.

Fix it with a sneaky trick.  Set DF in the region of code that
the RSP check is intended to detect.  IRET will clear DF
atomically.

( Note: other than paravirt, there's little need for all this
  complexity. We could check RIP instead of RSP. )

Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Reviewed-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Borislav Petkov &lt;bp@suse.de&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
<entry>
<title>x86/nmi/64: Reorder nested NMI checks</title>
<updated>2015-10-27T13:33:06+00:00</updated>
<author>
<name>Andy Lutomirski</name>
<email>luto@kernel.org</email>
</author>
<published>2015-07-15T17:29:37+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=eb0bad52bb1c7754607a49fdfdf1a57082dded64'/>
<id>urn:sha1:eb0bad52bb1c7754607a49fdfdf1a57082dded64</id>
<content type='text'>
[ Upstream commit a27507ca2d796cfa8d907de31ad730359c8a6d06 ]

Check the repeat_nmi .. end_repeat_nmi special case first.  The
next patch will rework the RSP check and, as a side effect, the
RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
we'll need this ordering of the checks.

Note: this is more subtle than it appears.  The check for
repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
instead of adjusting the "iret" frame to force a repeat.  This
is necessary, because the code between repeat_nmi and
end_repeat_nmi sets "NMI executing" and then writes to the
"iret" frame itself.  If a nested NMI comes in and modifies the
"iret" frame while repeat_nmi is also modifying it, we'll end up
with garbage.  The old code got this right, as does the new
code, but the new code is a bit more explicit.

If we were to move the check right after the "NMI executing"
check, then we'd get it wrong and have random crashes.

( Because the "NMI executing" check would jump to the code that would
  modify the "iret" frame without checking if the interrupted NMI was
  currently modifying it. )

Signed-off-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Reviewed-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Borislav Petkov &lt;bp@suse.de&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
</content>
</entry>
</feed>
