summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)Author
2018-11-02Merge tag 'powerpc-4.20-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Some things that I missed due to travel, or that came in late. Two fixes also going to stable: - A revert of a buggy change to the 8xx TLB miss handlers. - Our flushing of SPE (Signal Processing Engine) registers on fork was broken. Other changes: - A change to the KVM decrementer emulation to use proper APIs. - Some cleanups to the way we do code patching in the 8xx code. - Expose the maximum possible memory for the system in /proc/powerpc/lparcfg. - Merge some updates from Scott: "a couple device tree updates, and a fix for a missing prototype warning" A few other minor fixes and a handful of fixes for our selftests. Thanks to: Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe Leroy, Felipe Rechia, Joel Stanley, Naveen N. Rao, Paul Mackerras, Scott Wood, Tyrel Datwyler" * tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (21 commits) selftests/powerpc: Fix compilation issue due to asm label selftests/powerpc/cache_shape: Fix out-of-tree build selftests/powerpc/switch_endian: Fix out-of-tree build selftests/powerpc/pmu: Link ebb tests with -no-pie selftests/powerpc/signal: Fix out-of-tree build selftests/powerpc/ptrace: Fix out-of-tree build powerpc/xmon: Relax frame size for clang selftests: powerpc: Fix warning for security subdir selftests/powerpc: Relax L1d miss targets for rfi_flush test powerpc/process: Fix flush_all_to_thread for SPE powerpc/pseries: add missing cpumask.h include file selftests/powerpc: Fix ptrace tm failure KVM: PPC: Use exported tb_to_ns() function in decrementer emulation powerpc/pseries: Export maximum memory value powerpc/8xx: Use patch_site for perf counters setup powerpc/8xx: Use patch_site for memory setup patching powerpc/code-patching: Add a helper to get the address of a patch_site Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP" powerpc/8xx: add missing header in 8xx_mmu.c powerpc/8xx: Add DT node for using the SEC engine of the MPC885 ...
2018-10-31memblock: stop using implicit alignment to SMP_CACHE_BYTESMike Rapoport
When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES. Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise. Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions. For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate. The direct memblock APIs users were updated using the semantic patch below: @@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) ) [mhocko@suse.com: changelog update] [akpm@linux-foundation.org: coding-style fixes] [rppt@linux.ibm.com: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: rename free_all_bootmem to memblock_free_allMike Rapoport
The conversion is done using sed -i 's@free_all_bootmem@memblock_free_all@' \ $(git grep -l free_all_bootmem) Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: remove _virt from APIs returning virtual addressMike Rapoport
The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*Mike Rapoport
Make it explicit that the caller gets a physical address rather than a virtual one. This will also allow using meblock_alloc prefix for memblock allocations returning virtual address, which is done in the following patches. The conversion is done using the following semantic patch: @@ expression e1, e2, e3; @@ ( - memblock_alloc(e1, e2) + memblock_phys_alloc(e1, e2) | - memblock_alloc_nid(e1, e2, e3) + memblock_phys_alloc_nid(e1, e2, e3) | - memblock_alloc_try_nid(e1, e2, e3) + memblock_phys_alloc_try_nid(e1, e2, e3) ) Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-29Merge branch 'next' of ↵Michael Ellerman
https://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next Updates from Scott: "This contains a couple device tree updates, and a fix for a missing prototype warning."
2018-10-26Merge tag 'powerpc-4.20-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - A large series to rewrite our SLB miss handling, replacing a lot of fairly complicated asm with much fewer lines of C. - Following on from that, we now maintain a cache of SLB entries for each process and preload them on context switch. Leading to a 27% speedup for our context switch benchmark on Power9. - Improvements to our handling of SLB multi-hit errors. We now print more debug information when they occur, and try to continue running by flushing the SLB and reloading, rather than treating them as fatal. - Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9). - Add support for physical memory up to 2PB in the linear mapping on 64-bit Book3S. We only support up to 512TB as regular system memory, otherwise the percpu allocator runs out of vmalloc space. - Add stack protector support for 32 and 64-bit, with a per-task canary. - Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP. - Support recognising "big cores" on Power9, where two SMT4 cores are presented to us as a single SMT8 core. - A large series to cleanup some of our ioremap handling and PTE flags. - Add a driver for the PAPR SCM (storage class memory) interface, allowing guests to operate on SCM devices (acked by Dan). - Changes to our ftrace code to handle very large kernels, where we need to use a trampoline to get to ftrace_caller(). And many other smaller enhancements and cleanups. Thanks to: Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton Blanchard, Aravinda Prasad, Bartlomiej Zolnierkiewicz, Benjamin Herrenschmidt, Breno Leitao, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Dan Carpenter, Daniel Axtens, Finn Thain, Gautham R. Shenoy, Gustavo Romero, Haren Myneni, Hari Bathini, Jia Hongtao, Joel Stanley, John Allen, Laurent Dufour, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael Bringmann, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab, Rob Herring, Sam Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan Johnson, Stephen Rothwell, Stewart Smith, Suraj Jitindar Singh, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, YueHaibing, zhong jiang" * tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (221 commits) Revert "selftests/powerpc: Fix out-of-tree build errors" powerpc/msi: Fix compile error on mpc83xx powerpc: Fix stack protector crashes on CPU hotplug powerpc/traps: restore recoverability of machine_check interrupts powerpc/64/module: REL32 relocation range check powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd selftests/powerpc: Add a test of wild bctr powerpc/mm: Fix page table dump to work on Radix powerpc/mm/radix: Display if mappings are exec or not powerpc/mm/radix: Simplify split mapping logic powerpc/mm/radix: Remove the retry in the split mapping logic powerpc/mm/radix: Fix small page at boundary when splitting powerpc/mm/radix: Fix overuse of small pages in splitting logic powerpc/mm/radix: Fix off-by-one in split mapping logic powerpc/ftrace: Handle large kernel configs powerpc/mm: Fix WARN_ON with THP NUMA migration selftests/powerpc: Fix out-of-tree build errors powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64 powerpc/time: isolate scaled cputime accounting in dedicated functions. ...
2018-10-26powerpc/8xx: Use patch_site for memory setup patchingChristophe Leroy
The 8xx TLB miss routines are patched at startup at several places. This patch uses the new patch_site functionality in order to get a better code readability and avoid a label mess when dumping the code with 'objdump -d' Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-26Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"Christophe Leroy
This reverts commit 4f94b2c7462d9720b2afa7e8e8d4c19446bb31ce. That commit was buggy, as it used rlwinm instead of rlwimi. Instead of fixing that bug, we revert the previous commit in order to reduce the dependency between L1 entries and L2 entries Fixes: 4f94b2c7462d9 ("powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-25Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Radim Krčmář: "ARM: - Improved guest IPA space support (32 to 52 bits) - RAS event delivery for 32bit - PMU fixes - Guest entry hardening - Various cleanups - Port of dirty_log_test selftest PPC: - Nested HV KVM support for radix guests on POWER9. The performance is much better than with PR KVM. Migration and arbitrary level of nesting is supported. - Disable nested HV-KVM on early POWER9 chips that need a particular hardware bug workaround - One VM per core mode to prevent potential data leaks - PCI pass-through optimization - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base s390: - Initial version of AP crypto virtualization via vfio-mdev - Improvement for vfio-ap - Set the host program identifier - Optimize page table locking x86: - Enable nested virtualization by default - Implement Hyper-V IPI hypercalls - Improve #PF and #DB handling - Allow guests to use Enlightened VMCS - Add migration selftests for VMCS and Enlightened VMCS - Allow coalesced PIO accesses - Add an option to perform nested VMCS host state consistency check through hardware - Automatic tuning of lapic_timer_advance_ns - Many fixes, minor improvements, and cleanups" * tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned Revert "kvm: x86: optimize dr6 restore" KVM: PPC: Optimize clearing TCEs for sparse tables x86/kvm/nVMX: tweak shadow fields selftests/kvm: add missing executables to .gitignore KVM: arm64: Safety check PSTATE when entering guest and handle IL KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips arm/arm64: KVM: Enable 32 bits kvm vcpu events support arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension() KVM: arm64: Fix caching of host MDCR_EL2 value KVM: VMX: enable nested virtualization by default KVM/x86: Use 32bit xor to clear registers in svm.c kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD kvm: vmx: Defer setting of DR6 until #DB delivery kvm: x86: Defer setting of CR2 until #PF delivery kvm: x86: Add payload operands to kvm_multiple_exception kvm: x86: Add exception payload fields to kvm_vcpu_events kvm: x86: Add has_payload and payload to kvm_queued_exception KVM: Documentation: Fix omission in struct kvm_vcpu_events KVM: selftests: add Enlightened VMCS test ...
2018-10-24Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-10-22powerpc/8xx: add missing header in 8xx_mmu.cChristophe Leroy
arch/powerpc/mm/8xx_mmu.c:174:6: error: no previous prototype for ‘set_context’ [-Werror=missing-prototypes] void set_context(unsigned long id, pgd_t *pgd) Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Scott Wood <oss@buserror.net>
2018-10-20powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmdNicholas Piggin
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: Fix page table dump to work on RadixMichael Ellerman
When we're running on Book3S with the Radix MMU enabled the page table dump currently prints the wrong addresses because it uses the wrong start address. Fix it to use PAGE_OFFSET rather than KERN_VIRT_START. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm/radix: Display if mappings are exec or notMichael Ellerman
At boot we print the ranges we've mapped for the linear mapping and what page size we've used. Also track whether the range is mapped executable or not and display that as well. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm/radix: Simplify split mapping logicMichael Ellerman
If we look closely at the logic in create_physical_mapping(), when we're doing STRICT_KERNEL_RWX, we do the following steps: - determine the gap from where we are to the end of the range - choose an appropriate mapping_size based on the gap - check if that mapping_size would overlap the __init_begin boundary, and if not choose an appropriate mapping_size We can simplify the logic by taking the __init_begin boundary into account when we calculate the initial gap. So add a next_boundary() function which tells us what the next boundary is, either the __init_begin boundary or end. In future we can add more boundaries. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm/radix: Remove the retry in the split mapping logicMichael Ellerman
When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the linear mapping at the text/data boundary so we can map the kernel text read only. The current logic uses a goto inside the for loop, which works, but is hard to reason about. When we hit the goto retry case we set max_mapping_size to PMD_SIZE and go back to the start. Setting max_mapping_size means we skip the PUD case and go to the PMD case. We know we will pass the alignment and gap checks because the only reason we are there is we hit the goto retry, and that is guarded by mapping_size == PUD_SIZE, which means addr is PUD aligned and gap is greater or equal to PUD_SIZE. So the only part of the check that can fail is the mmu_psize_defs check for the 2M page size. If we just duplicate that check we can avoid the goto, and we get the same result. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm/radix: Fix small page at boundary when splittingMichael Ellerman
When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the linear mapping at the text/data boundary so we can map the kernel text read only. Currently we always use a small page at the text/data boundary, even when that's not necessary: Mapped 0x0000000000000000-0x0000000000e00000 with 2.00 MiB pages Mapped 0x0000000000e00000-0x0000000001000000 with 64.0 KiB pages Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages This is because the check that the mapping crosses the __init_begin boundary is too strict, it also returns true when we map exactly up to the boundary. So fix it to check that the mapping would actually map past __init_begin, and with that we see: Mapped 0x0000000000000000-0x0000000040000000 with 2.00 MiB pages Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm/radix: Fix overuse of small pages in splitting logicMichael Ellerman
When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the linear mapping at the text/data boundary so we can map the kernel text read only. But the current logic uses small pages for the entire text section, regardless of whether a larger page size would fit. eg. with the boundary at 16M we could use 2M pages, but instead we use 64K pages up to the 16M boundary: Mapped 0x0000000000000000-0x0000000001000000 with 64.0 KiB pages Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages This is because the test is checking if addr is < __init_begin and addr + mapping_size is >= _stext. But that is true for all pages between _stext and __init_begin. Instead what we want to check is if we are crossing the text/data boundary, which is at __init_begin. With that fixed we see: Mapped 0x0000000000000000-0x0000000000e00000 with 2.00 MiB pages Mapped 0x0000000000e00000-0x0000000001000000 with 64.0 KiB pages Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages ie. we're correctly using 2MB pages below __init_begin, but we still drop down to 64K pages unnecessarily at the boundary. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm/radix: Fix off-by-one in split mapping logicMichael Ellerman
When we have CONFIG_STRICT_KERNEL_RWX enabled, we try to split the kernel linear (1:1) mapping so that the kernel text is in a separate page to kernel data, so we can mark the former read-only. We could achieve that just by always using 64K pages for the linear mapping, but we try to be smarter. Instead we use huge pages when possible, and only switch to smaller pages when necessary. However we have an off-by-one bug in that logic, which causes us to calculate the wrong boundary between text and data. For example with the end of the kernel text at 16M we see: radix-mmu: Mapped 0x0000000000000000-0x0000000001200000 with 64.0 KiB pages radix-mmu: Mapped 0x0000000001200000-0x0000000040000000 with 2.00 MiB pages radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages ie. we mapped from 0 to 18M with 64K pages, even though the boundary between text and data is at 16M. With the fix we see we're correctly hitting the 16M boundary: radix-mmu: Mapped 0x0000000000000000-0x0000000001000000 with 64.0 KiB pages radix-mmu: Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: Fix WARN_ON with THP NUMA migrationAneesh Kumar K.V
WARNING: CPU: 12 PID: 4322 at /arch/powerpc/mm/pgtable-book3s64.c:76 set_pmd_at+0x4c/0x2b0 Modules linked in: CPU: 12 PID: 4322 Comm: qemu-system-ppc Tainted: G W 4.19.0-rc3-00758-g8f0c636b0542 #36 NIP: c0000000000872fc LR: c000000000484eec CTR: 0000000000000000 REGS: c000003fba876fe0 TRAP: 0700 Tainted: G W (4.19.0-rc3-00758-g8f0c636b0542) MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24282884 XER: 00000000 CFAR: c000000000484ee8 IRQMASK: 0 GPR00: c000000000484eec c000003fba877268 c000000001f0ec00 c000003fbd229f80 GPR04: 00007c8fe8e00000 c000003f864c5a38 860300853e0000c0 0000000000000080 GPR08: 0000000080000000 0000000000000001 0401000000000080 0000000000000001 GPR12: 0000000000002000 c000003fffff5400 c000003fce292000 00007c9024570000 GPR16: 0000000000000000 0000000000ffffff 0000000000000001 c000000001885950 GPR20: 0000000000000000 001ffffc0004807c 0000000000000008 c000000001f49d05 GPR24: 00007c8fe8e00000 c0000000020f2468 ffffffffffffffff c000003fcd33b090 GPR28: 00007c8fe8e00000 c000003fbd229f80 c000003f864c5a38 860300853e0000c0 NIP [c0000000000872fc] set_pmd_at+0x4c/0x2b0 LR [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20 Call Trace: [c000003fba877268] [c00000000045931c] mpol_misplaced+0x1bc/0x230 (unreliable) [c000003fba8772c8] [c000000000484eec] do_huge_pmd_numa_page+0xb1c/0xc20 [c000003fba877398] [c00000000040d344] __handle_mm_fault+0x5e4/0x2300 [c000003fba8774d8] [c00000000040f400] handle_mm_fault+0x3a0/0x420 [c000003fba877528] [c0000000003ff6f4] __get_user_pages+0x2e4/0x560 [c000003fba877628] [c000000000400314] get_user_pages_unlocked+0x104/0x2a0 [c000003fba8776c8] [c000000000118f44] __gfn_to_pfn_memslot+0x284/0x6a0 [c000003fba877748] [c0000000001463a0] kvmppc_book3s_radix_page_fault+0x360/0x12d0 [c000003fba877838] [c000000000142228] kvmppc_book3s_hv_page_fault+0x48/0x1300 [c000003fba877988] [c00000000013dc08] kvmppc_vcpu_run_hv+0x1808/0x1b50 [c000003fba877af8] [c000000000126b44] kvmppc_vcpu_run+0x34/0x50 [c000003fba877b18] [c000000000123268] kvm_arch_vcpu_ioctl_run+0x288/0x2d0 [c000003fba877b98] [c00000000011253c] kvm_vcpu_ioctl+0x1fc/0x8c0 [c000003fba877d08] [c0000000004e9b24] do_vfs_ioctl+0xa44/0xae0 [c000003fba877db8] [c0000000004e9c44] ksys_ioctl+0x84/0xf0 [c000003fba877e08] [c0000000004e9cd8] sys_ioctl+0x28/0x80 We removed the pte_protnone check earlier with the understanding that we mark the pte invalid before the set_pte/set_pmd usage. But the huge pmd autonuma still use the set_pmd_at directly. This is ok because a protnone pte won't have translation cache in TLB. Fixes: da7ad366b497 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: fix always true/false warning in slice.cChristophe Leroy
This patch fixes the following warnings (obtained with make W=1). arch/powerpc/mm/slice.c: In function 'slice_range_to_mask': arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (start < SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c:81:20: error: comparison is always false due to limited range of data type [-Werror=type-limits] if ((start + len) > SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c: In function 'slice_mask_for_free': arch/powerpc/mm/slice.c:136:17: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (high_limit <= SLICE_LOW_TOP) ^ arch/powerpc/mm/slice.c: In function 'slice_check_range_fits': arch/powerpc/mm/slice.c:185:12: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (start < SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c:195:39: error: comparison is always false due to limited range of data type [-Werror=type-limits] if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) { ^ arch/powerpc/mm/slice.c: In function 'slice_scan_available': arch/powerpc/mm/slice.c:306:11: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (addr < SLICE_LOW_TOP) { ^ arch/powerpc/mm/slice.c: In function 'get_slice_psize': arch/powerpc/mm/slice.c:709:11: error: comparison is always true due to limited range of data type [-Werror=type-limits] if (addr < SLICE_LOW_TOP) { ^ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: fix missing prototypes in slice.cChristophe Leroy
This patch fixes the following warnings (obtained with make W=1). arch/powerpc/mm/slice.c: At top level: arch/powerpc/mm/slice.c:682:15: error: no previous prototype for 'arch_get_unmapped_area' [-Werror=missing-prototypes] unsigned long arch_get_unmapped_area(struct file *filp, ^ arch/powerpc/mm/slice.c:692:15: error: no previous prototype for 'arch_get_unmapped_area_topdown' [-Werror=missing-prototypes] unsigned long arch_get_unmapped_area_topdown(struct file *filp, ^ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: Trace tlbia instructionChristophe Leroy
Add a trace point for tlbia (Translation Lookaside Buffer Invalidate All) instruction. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/mm: Add missing tracepoint for tlbieChristophe Leroy
commit 0428491cba927 ("powerpc/mm: Trace tlbie(l) instructions") added tracepoints for tlbie calls, but _tlbil_va() was forgotten Fixes: 0428491cba927 ("powerpc/mm: Trace tlbie(l) instructions") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20powerpc/book3s64: fix dump_linuxpagetables "present" flagChristophe Leroy
Since commit bd0dbb73e013 ("powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid."), _PAGE_PRESENT doesn't mean exactly that a page is present. A page is also considered preset when _PAGE_INVALID is set. This patch changes the meaning of "present" and adds a status "valid" associated to the _PAGE_PRESENT flag. Fixes: bd0dbb73e013 ("powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-19powerpc: Add -Werror at arch/powerpc levelMichael Ellerman
Back when I added -Werror in commit ba55bd74360e ("powerpc: Add configurable -Werror for arch/powerpc") I did it by adding it to most of the arch Makefiles. At the time we excluded math-emu, because apparently it didn't build cleanly. But that seems to have been fixed somewhere in the interim. So move the -Werror addition to the top-level of the arch, this saves us from repeating it in every Makefile and means we won't forget to add it to any new sub-dirs. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: Increase the max addressable memory to 2PBAneesh Kumar K.V
Currently we limit the max addressable memory to 128TB. This patch increase the limit to 2PB. We can have devices like nvdimm which adds memory above 512TB limit. We still don't support regular system ram above 512TB. One of the challenge with that is the percpu allocator, that allocates per node memory and use the max distance between them as the percpu offsets. This means with large gap in address space ( system ram above 1PB) we will run out of vmalloc space to map the percpu allocation. In order to support addressable memory above 512TB, kernel should be able to linear map this range. To do that with hash translation we now add 4 context to kernel linear map region. Our per context addressable range is 512TB. We still keep VMALLOC and VMEMMAP region to old size. SLB miss handlers is updated to validate these limit. We also limit this update to SPARSEMEM_VMEMMAP and SPARSEMEM_EXTREME Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm/hash: Rename get_ea_context to get_user_contextAneesh Kumar K.V
We will be adding get_kernel_context later. Update function name to indicate this handle context allocation user space address. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Add some SLB debugging testsNicholas Piggin
This adds CONFIG_DEBUG_VM checks to ensure: - The kernel stack is in the SLB after it's flushed and bolted. - We don't insert an SLB for an address that is aleady in the SLB. - The kernel SLB miss handler does not take an SLB miss. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Simplify slb_flush_and_rebolt()Nicholas Piggin
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so it can not possibly change the stack, so it should not be touching the shadow area. And since vmalloc is no longer bolted, it should not change any bolted mappings at all. Change the name to slb_flush_and_restore_bolted(), and have it just load the kernel stack from what's currently in the shadow SLB area. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Add a SLB preload cacheNicholas Piggin
When switching processes, currently all user SLBEs are cleared, and a few (exec_base, pc, and stack) are preloaded. In trivial testing with small apps, this tends to miss the heap and low 256MB segments, and it will also miss commonly accessed segments on large memory workloads. Add a simple round-robin preload cache that just inserts the last SLB miss into the head of the cache and preloads those at context switch time. Every 256 context switches, the oldest entry is removed from the cache to shrink the cache and require fewer slbmte if they are unused. Much more could go into this, including into the SLB entry reclaim side to track some LRU information etc, which would require a study of large memory workloads. But this is a simple thing we can do now that is an obvious win for common workloads. With the full series, process switching speed on the context_switch benchmark on POWER9/hash (with kernel speculation security masures disabled) increases from 140K/s to 178K/s (27%). POWER8 does not change much (within 1%), it's unclear why it does not see a big gain like POWER9. Booting to busybox init with 256MB segments has SLB misses go down from 945 to 69, and with 1T segments 900 to 21. These could almost all be eliminated by preloading a bit more carefully with ELF binary loading. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Provide arch_setup_exec() hooks for hash slice setupNicholas Piggin
This will be used by the SLB code in the next patch, but for now this sets the slb_addr_limit to the correct size for 32-bit tasks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Add SLB allocation status bitmapsNicholas Piggin
Add 32-entry bitmaps to track the allocation status of the first 32 SLB entries, and whether they are user or kernel entries. These are used to allocate free SLB entries first, before resorting to the round robin allocator. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Convert SLB miss handlers to CNicholas Piggin
This patch moves SLB miss handlers completely to C, using the standard exception handler macros to set up the stack and branch to C. This can be done because the segment containing the kernel stack is always bolted, so accessing it with relocation on will not cause an SLB exception. Arbitrary kernel memory must not be accessed when handling kernel space SLB misses, so care should be taken there. However user SLB misses can access any kernel memory, which can be used to move some fields out of the paca (in later patches). User SLB misses could quite easily reconcile IRQs and set up a first class kernel environment and exit via ret_from_except, however that doesn't seem to be necessary at the moment, so we only do that if a bad fault is encountered. [ Credit to Aneesh for bug fixes, error checks, and improvements to bad address handling, etc ] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Disallow tracing for all of slb.c for now.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/8xx: change name of a few page flags to avoid confusionChristophe Leroy
_PAGE_PRIVILEGED corresponds to the SH bit which doesn't protect against user access but only disables ASID verification on kernel accesses. User access is controlled with _PMD_USER flag. Name it _PAGE_SH instead of _PAGE_PRIVILEGED _PAGE_HUGE corresponds to the SPS bit which doesn't really tells that's it is a huge page but only that it is not a 4k page. Name it _PAGE_SPS instead of _PAGE_HUGE Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: Split dump_pagelinuxtables flag_array tableChristophe Leroy
To reduce the complexity of flag_array, and allow the removal of default 0 value of non existing flags, lets have one flag_array table for each platform family with only the really existing flags. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: use pte helpers in generic codeChristophe Leroy
Get rid of platform specific _PAGE_XXXX in powerpc common code and use helpers instead. mm/dump_linuxpagetables.c will be handled separately Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: don't use _PAGE_EXEC for calling hash_preload()Christophe Leroy
The 'access' parameter of hash_preload() is either 0 or _PAGE_EXEC. Among the two versions of hash_preload(), only the PPC64 one is doing something with this 'access' parameter. In order to remove the use of _PAGE_EXEC outside platform code, 'access' parameter is replaced by 'is_exec' which will be either true of false, and the PPC64 version of hash_preload() creates the access flag based on 'is_exec'. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: don't use _PAGE_EXEC in book3s/32Christophe Leroy
book3s/32 doesn't define _PAGE_EXEC, so no need to use it. All other platforms define _PAGE_EXEC so no need to check it is not NUL when not book3s/32. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc: handover page flags with a pgprot_t parameterChristophe Leroy
In order to avoid multiple conversions, handover directly a pgprot_t to map_kernel_page() as already done for radix. Do the same for __ioremap_caller() and __ioremap_at(). Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: properly set PAGE_KERNEL flags in ioremap()Christophe Leroy
Set PAGE_KERNEL directly in the caller and do not rely on a hack adding PAGE_KERNEL flags when _PAGE_PRESENT is not set. As already done for PPC64, use pgprot_cache() helpers instead of _PAGE_XXX flags in PPC32 ioremap() derived functions. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-13powerpc/32: Add ioremap_wt() and ioremap_coherent()Christophe Leroy
Other arches have ioremap_wt() to map IO areas write-through. Implement it on PPC as well in order to avoid drivers using __ioremap(_PAGE_WRITETHRU) Also implement ioremap_coherent() to avoid drivers using __ioremap(_PAGE_COHERENT) Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-13powerpc/pseries/mobility: Extend start/stop topology update scopeMichael Bringmann
The powerpc mobility code may receive RTAS requests to perform PRRN (Platform Resource Reassignment Notification) topology changes at any time, including during LPAR migration operations. In some configurations where the affinity of CPUs or memory is being changed on that platform, the PRRN requests may apply or refer to outdated information prior to the complete update of the device-tree. This patch changes the duration for which topology updates are suppressed during LPAR migrations from just the rtas_ibm_suspend_me() / 'ibm,suspend-me' call(s) to cover the entire migration_store() operation to allow all changes to the device-tree to be applied prior to accepting and applying any PRRN requests. For tracking purposes, pr_info notices are added to the functions start_topology_update() and stop_topology_update() of 'numa.c'. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09Merge branch 'fixes' into nextMichael Ellerman
Merge our fixes branch. It has a few important fixes that are needed for futher testing and also some commits that will conflict with content in next.
2018-10-09KVM: PPC: Book3S HV: Handle page fault for a nested guestSuraj Jitindar Singh
Consider a normal (L1) guest running under the main hypervisor (L0), and then a nested guest (L2) running under the L1 guest which is acting as a nested hypervisor. L0 has page tables to map the address space for L1 providing the translation from L1 real address -> L0 real address; L1 | | (L1 -> L0) | ----> L0 There are also page tables in L1 used to map the address space for L2 providing the translation from L2 real address -> L1 read address. Since the hardware can only walk a single level of page table, we need to maintain in L0 a "shadow_pgtable" for L2 which provides the translation from L2 real address -> L0 real address. Which looks like; L2 L2 | | | (L2 -> L1) | | | ----> L1 | (L2 -> L0) | | | (L1 -> L0) | | | ----> L0 --------> L0 When a page fault occurs while running a nested (L2) guest we need to insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To do this we need to: 1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and provide a page fault to L1 if this mapping doesn't exist. 2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address, or try to insert a pte for that mapping if it doesn't exist. 3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable Once this mapping exists we can take rc faults when hardware is unable to automatically set the reference and change bits in the pte. On these we need to: 1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect the fault down to L1. 2. Set the rc bits in the L1 -> L0 pte which corresponds to the same host page. 3. Set the rc bits in the L2 -> L0 pte. As we reuse a large number of functions in book3s_64_mmu_radix.c for this we also needed to refactor a number of these functions to take an lpid parameter so that the correct lpid is used for tlb invalidations. The functionality however has remained the same. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-07Merge tag 'powerpc-4.19-4' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Michael writes: "powerpc fixes for 4.19 #4 Four regression fixes. A fix for a change to lib/xz which broke our zImage loader when building with XZ compression. OK'ed by Herbert who merged the original patch. The recent fix we did to avoid patching __init text broke some 32-bit machines, fix that. Our show_user_instructions() could be tricked into printing kernel memory, add a check to avoid that. And a fix for a change to our NUMA initialisation logic, which causes crashes in some kdump configurations. Thanks to: Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis Roos, Murilo Opsfelder Araujo, Srikar Dronamraju." * tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/numa: Skip onlining a offline node in kdump path powerpc: Don't print kernel instructions in show_user_instructions() powerpc/lib: fix book3s/32 boot failure due to code patching lib/xz: Put CRC32_POLY_LE in xz_private.h
2018-10-05powerpc/numa: Skip onlining a offline node in kdump pathSrikar Dronamraju
With commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot"), kdump kernel on shared LPAR may crash. The necessary conditions are - Shared LPAR with at least 2 nodes having memory and CPUs. - Memory requirement for kdump kernel must be met by the first N-1 nodes where there are at least N nodes with memory and CPUs. Example numactl of such a machine. $ numactl -H available: 5 nodes (0,2,5-7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 255 MB node 2 free: 189 MB node 5 cpus: 24 25 26 27 28 29 30 31 node 5 size: 4095 MB node 5 free: 4024 MB node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 6 size: 6353 MB node 6 free: 5998 MB node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 node 7 size: 7640 MB node 7 free: 7164 MB node distances: node 0 2 5 6 7 0: 10 40 40 40 40 2: 40 10 40 40 40 5: 40 40 10 40 40 6: 40 40 40 10 20 7: 40 40 40 20 10 Steps to reproduce. 1. Load / start kdump service. 2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger) When booting a kdump kernel with 2048M: kexec: Starting switchover sequence. I'm in purgatory Using 1TB segments hash-mmu: Initializing hash mmu with SLB Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018 Found initrd at 0xc000000009e70000:0xc00000000ae554b4 Using pSeries machine description ----------------------------------------------------- ppc64_pft_size = 0x1e phys_mem_size = 0x88000000 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x000000ff8f5d91a7 possible = 0x0000fbffcf5fb1a7 always = 0x0000006f8b5c91a1 cpu_user_features = 0xdc0065c2 0xef000000 mmu_features = 0x7c006001 firmware_features = 0x00000007c45bfc57 htab_hash_mask = 0x7fffff physical_start = 0x8000000 ----------------------------------------------------- numa: NODE_DATA [mem 0x87d5e300-0x87d67fff] numa: NODE_DATA(0) on node 6 numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff] Top of RAM: 0x88000000, Total RAM: 0x88000000 Memory hole size: 0MB Zone ranges: DMA [mem 0x0000000000000000-0x0000000087ffffff] DMA32 empty Normal empty Movable zone start for each node Early memory node ranges node 6: [mem 0x0000000000000000-0x0000000087ffffff] Could not find start_pfn for node 0 Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000] On node 0 totalpages: 0 Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff] On node 6 totalpages: 34816 Unable to handle kernel paging request for data at address 0x00000060 Faulting instruction address: 0xc000000008703a54 Oops: Kernel access of bad area, sig: 11 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1 NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000 REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24022022 XER: 20000002 CFAR: c0000000086fc238 IRQMASK: 0 GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000 GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220 GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008 GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000 GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008 GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78 GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400 NIP [c000000008703a54] bus_add_device+0x84/0x1e0 LR [c000000008703a38] bus_add_device+0x68/0x1e0 Call Trace: [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable) [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0 [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240 [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180 [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90 [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190 [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580 [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108 [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450 [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160 [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80 Instruction dump: 60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378 ---[ end trace 593577668c2daa65 ]--- However a regular kernel with 4096M (2048 gets reserved for crash kernel) boots properly. Unlike regular kernels, which mark all available nodes as online, kdump kernel only marks just enough nodes as online and marks the rest as offline at boot. However kdump kernel boots with all available CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot"), all CPUs are onlined on their respective nodes at boot time. try_online_node() tries to online the offline nodes but fails as all needed subsystems are not yet initialized. As part of fix, detect and skip early onlining of a offline node. Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") Reported-by: Pavithra Prakash <pavrampu@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidationNicholas Piggin
Local radix TLB flush operations that operate on congruence classes have explicit ERAT flushes for POWER9. The process scoped LPID flush did not have a flush, so add it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>