summaryrefslogtreecommitdiff
path: root/arch/s390/mm/pgtable.c
AgeCommit message (Collapse)Author
2013-08-22s390/mm: introduce ptep_flush_lazy helperMartin Schwidefsky
Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions, ptep_flush_lazy and __tlb_flush_mm_lazy. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-08-22s390/mm: cleanup page table definitionsMartin Schwidefsky
Improve the encoding of the different pte types and the naming of the page, segment table and region table bits. Due to the different pte encoding the hugetlbfs primitives need to be adapted as well. To improve compatability with common code make the huge ptes use the encoding of normal ptes. The conversion between the pte and pmd encoding for a huge pte is done with set_huge_pte_at and huge_ptep_get. Overall the code is now easier to understand. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-07-29KVM: s390: fix task size checkMartin Schwidefsky
The gmap_map_segment function uses PGDIR_SIZE in the check for the maximum address in the tasks address space. This incorrectly limits the amount of memory usable for a kvm guest to 4TB. The correct limit is (1UL << 53). As the TASK_SIZE has different values (4TB vs 8PB) dependent on the existance of the fourth page table level, create a new define 'TASK_MAX_SIZE' for (1UL << 53). Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29KVM: s390: allow sie enablement for multi-threaded programsMartin Schwidefsky
Improve the code to upgrade the standard 2K page tables to 4K page tables with PGSTEs to allow the operation to happen when the program is already multi-threaded. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-04Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc changes for the 3.11 merge window. In addition to the usual bug fixes and small updates, the main highlights are: - Support for transparent huge pages by Aneesh Kumar for 64-bit server processors. This allows the use of 16M pages as transparent huge pages on kernels compiled with a 64K base page size. - Base VFIO support for KVM on power by Alexey Kardashevskiy - Wiring up of our nvram to the pstore infrastructure, including putting compressed oopses in there by Aruna Balakrishnaiah - Move, rework and improve our "EEH" (basically PCI error handling and recovery) infrastructure. It is no longer specific to pseries but is now usable by the new "powernv" platform as well (no hypervisor) by Gavin Shan. - I fixed some bugs in our math-emu instruction decoding and made it usable to emulate some optional FP instructions on processors with hard FP that lack them (such as fsqrt on Freescale embedded processors). - Support for Power8 "Event Based Branch" facility by Michael Ellerman. This facility allows what is basically "userspace interrupts" for performance monitor events. - A bunch of Transactional Memory vs. Signals bug fixes and HW breakpoint/watchpoint fixes by Michael Neuling. And more ... I appologize in advance if I've failed to highlight something that somebody deemed worth it." * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) pstore: Add hsize argument in write_buf call of pstore_ftrace_call powerpc/fsl: add MPIC timer wakeup support powerpc/mpic: create mpic subsystem object powerpc/mpic: add global timer support powerpc/mpic: add irq_set_wake support powerpc/85xx: enable coreint for all the 64bit boards powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig powerpc/mpic: Add get_version API both for internal and external use powerpc: Handle both new style and old style reserve maps powerpc/hw_brk: Fix off by one error when validating DAWR region end powerpc/pseries: Support compression of oops text via pstore powerpc/pseries: Re-organise the oops compression code pstore: Pass header size in the pstore write callback powerpc/powernv: Fix iommu initialization again powerpc/pseries: Inform the hypervisor we are using EBB regs powerpc/perf: Add power8 EBB support powerpc/perf: Core EBB support for 64-bit book3s powerpc/perf: Drop MMCRA from thread_struct powerpc/perf: Don't enable if we have zero events ...
2013-07-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "On the x86 side, there are some optimizations and documentation updates. The big ARM/KVM change for 3.11, support for AArch64, will come through Catalin Marinas's tree. s390 and PPC have misc cleanups and bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits) KVM: PPC: Ignore PIR writes KVM: PPC: Book3S PR: Invalidate SLB entries properly KVM: PPC: Book3S PR: Allow guest to use 1TB segments KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry KVM: PPC: Book3S PR: Fix proto-VSID calculations KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL KVM: Fix RTC interrupt coalescing tracking kvm: Add a tracepoint write_tsc_offset KVM: MMU: Inform users of mmio generation wraparound KVM: MMU: document fast invalidate all mmio sptes KVM: MMU: document fast invalidate all pages KVM: MMU: document fast page fault KVM: MMU: document mmio page fault KVM: MMU: document write_flooding_count KVM: MMU: document clear_spte_count KVM: MMU: drop kvm_mmu_zap_mmio_sptes KVM: MMU: init kvm generation close to mmio wrap-around value KVM: MMU: add tracepoint for check_mmio_spte KVM: MMU: fast invalidate all mmio sptes ...
2013-06-26s390/kvm: Provide function for setting the guest storage keyChristian Borntraeger
From time to time we need to set the guest storage key. Lets provide a helper function that handles the changes with all the right locking and checking. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-06-20mm/THP: add pmd args to pgtable deposit and withdraw APIsAneesh Kumar K.V
This will be later used by powerpc THP support. In powerpc we want to use pgtable for storing the hash index values. So instead of adding them to mm_context list, we would like to store them in the second half of pmd Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-17KVM: s390: Provide function for setting the guest storage keyChristian Borntraeger
From time to time we need to set the guest storage key. Lets provide a helper function that handles the changes with all the right locking and checking. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-05-31s390/pgtable: Fix gmap notifier addressChristian Borntraeger
The address of the gmap notifier was broken, resulting in unhandled validity intercepts in KVM. Fix the rmap->vmaddr to be on a segment boundary. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-05-21s390: fix gmap_ipte_notifier vs. software dirty pagesChristian Borntraeger
On heavy paging load some guest cpus started to loop in gmap_ipte_notify. This was visible as stalled cpus inside the guest. The gmap_ipte_notifier tries to map a user page and then made sure that the pte is valid and writable. Turns out that with the software change bit tracking the pte can become read-only (and only software writable) if the page is clean. Since we loop in this code, the page would stay clean and, therefore, be never writable again. Let us just use fixup_user_fault, that guarantees to call handle_mm_fault. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21s390/kvm: rename RCP_xxx defines to PGSTE_xxxMartin Schwidefsky
The RCP byte is a part of the PGSTE value, the existing RCP_xxx names are inaccurate. As the defines describe bits and pieces of the PGSTE, the names should start with PGSTE_. The KVM_UR_BIT and KVM_UC_BIT are part of the PGSTE as well, give them better names as well. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-15s390: fix gmap_ipte_notifier vs. software dirty pagesChristian Borntraeger
On heavy paging load some guest cpus started to loop in gmap_ipte_notify. This was visible as stalled cpus inside the guest. The gmap_ipte_notifier tries to map a user page and then made sure that the pte is valid and writable. Turns out that with the software change bit tracking the pte can become read-only (and only software writable) if the page is clean. Since we loop in this code, the page would stay clean and, therefore, be never writable again. Let us just use fixup_user_fault, that guarantees to call handle_mm_fault. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-05-03s390/mm: add pte invalidation notifier for kvmMartin Schwidefsky
Add a notifier for kvm to get control before a page table entry is invalidated. The notifier is only called for ptes of an address space with pgstes that have been explicitly marked to require notification. Kvm will use this to get control before prefix pages of virtual CPU are unmapped. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-04-23s390/mm,gmap: segment mapping raceMartin Schwidefsky
The gmap_map_segment function creates a special invalid segment table entry with the address of the requested target location in the process address space. The first access will create the connection between the gmap segment table and the target page table of the main process. If two threads do this concurrently both will walk the page tables and allocate a gmap_rmap structure for the same segment table entry. To avoid the race recheck the segment table entry after taking to page table lock. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-04-23s390/mm,gmap: implement gmap_translate()Heiko Carstens
Implement gmap_translate() function which translates a guest absolute address to a user space process address without establishing the guest page table entries. This is useful for kvm guest address translations where no memory access is expected to happen soon (e.g. tprot exception handler). Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-11-23s390/mm: move kernel_page_present/kernel_map_pages to page_attr.cHeiko Carstens
Keep related functions together and move to appropriate file. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-10-09thp, s390: architecture backend for thp on s390Gerald Schaefer
This implements the architecture backend for transparent hugepages on s390. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: disable thp for kvm host on s390Gerald Schaefer
This patch is part of the architecture backend for thp on s390. It disables thp for kvm hosts, because there is no kvm host hugepage support so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT, and future thp mappings are prevented by setting VM_NOHUGEPAGE in mm->def_flags. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: thp pagetable pre-allocation for s390Gerald Schaefer
This patch is part of the architecture backend for thp on s390. It provides the pagetable pre-allocation functions pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike other archs, s390 has no struct page * as pgtable_t, but rather a pointer to the page table. So instead of saving the pagetable pre- allocation list info inside the struct page, it is being saved within the pagetable itself. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09thp, s390: thp splitting backend for s390Gerald Schaefer
This patch is part of the architecture backend for thp on s390. It provides the functions related to thp splitting, including serialization against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb flushing operation to serialize against gup on s390, because that wouldn't be stopped by the disabled IRQs. So instead, smp_call_function() is called with an empty function, which will have the expected effect. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-26s390: add uninitialized_var() to suppress false positive compiler warningsHeiko Carstens
Get rid of these: arch/s390/kernel/smp.c:134:19: warning: ‘status’ may be used uninitialized in this function [-Wuninitialized] arch/s390/mm/pgtable.c:641:10: warning: ‘table’ may be used uninitialized in this function [-Wuninitialized] arch/s390/mm/pgtable.c:644:12: warning: ‘page’ may be used uninitialized in this function [-Wuninitialized] drivers/s390/cio/cio.c:1037:14: warning: ‘schid’ may be used uninitialized in this function [-Wuninitialized] Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-09-26s390/mm: rename addressing_mode to s390_user_modeHeiko Carstens
Renaming the globally visible variable "user_mode" to "addressing_mode" in order to fix a name clash was not a good idea. (Commit 37fe1d73 "s390/mm: rename user_mode variable to addressing_mode") Looking at the code after a couple of weeks one thinks: addressing mode of what? So rename the variable again. This time to s390_user_mode. Which hopefully makes more sense. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-07-30s390/mm: rename user_mode variable to addressing_modeHeiko Carstens
Fix name clash with user_mode() define which is also used in common code. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-07-26s390/mm: downgrade page table after fork of a 31 bit processMartin Schwidefsky
The downgrade of the 4 level page table created by init_new_context is currently done only in start_thread31. If a 31 bit process forks the new mm uses a 4 level page table, including the task size of 2<<42 that goes along with it. This is incorrect as now a 31 bit process can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade after fork. Cc: stable@vger.kernel.org Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-07-20s390/comments: unify copyright messages and remove file namesHeiko Carstens
Remove the file name from the comment at top of many files. In most cases the file name was wrong anyway, so it's rather pointless. Also unify the IBM copyright statement. We did have a lot of sightly different statements and wanted to change them one after another whenever a file gets touched. However that never happened. Instead people start to take the old/"wrong" statements to use as a template for new files. So unify all of them in one go. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-05-16s390/kvm: bad rss-counter stateChristian Borntraeger
commit c3f0327f8e9d7a503f0d64573c311eddd61f197d mm: add rss counters consistency check detected the following problem with kvm on s390: BUG: Bad rss-counter state mm:00000004f73ef000 idx:0 val:-10 BUG: Bad rss-counter state mm:00000004f73ef000 idx:1 val:-5 We have to make sure that we accumulate all rss values into the mm before we replace the mm to avoid triggering this (harmless) bug message. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-04-11[S390] fix tlb flushing for page table pagesMartin Schwidefsky
Git commit 36409f6353fc2d7b6516e631415f938eadd92ffa "use generic RCU page-table freeing code" introduced a tlb flushing bug. Partially revert the above git commit and go back to s390 specific page table flush code. For s390 the TLB can contain three types of entries, "normal" TLB page-table entries, TLB combined region-and-segment-table (CRST) entries and real-space entries. Linux does not use real-space entries which leaves normal TLB entries and CRST entries. The CRST entries are intermediate steps in the page-table translation called translation paths. For example a 4K page access in a three-level page table setup will create two CRST TLB entries and one page-table TLB entry. The advantage of that approach is that a page access next to the previous one can reuse the CRST entries and needs just a single read from memory to create the page-table TLB entry. The disadvantage is that the TLB flushing rules are more complicated, before any page-table may be freed the TLB needs to be flushed. In short: the generic RCU page-table freeing code is incorrect for the CRST entries, in particular the check for mm_users < 2 is troublesome. This is applicable to 3.0+ kernels. Cc: <stable@vger.kernel.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2012-03-28Disintegrate asm/system.h for S390David Howells
Disintegrate asm/system.h for S390. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-s390@vger.kernel.org
2012-02-17[S390] incorrect PageTables counter for kvm page tablesMartin Schwidefsky
The page_table_free_pgste function is used for kvm processes to free page tables that have the pgste extension. It calls pgtable_page_ctor instead of pgtable_page_dtor which increases NR_PAGETABLE instead of decreasing it. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-12-27[S390] add support for physical memory > 4TBMartin Schwidefsky
The kernel address space of a 64 bit kernel currently uses a three level page table and the vmemmap array has a fixed address and a fixed maximum size. A three level page table is good enough for systems with less than 3.8TB of memory, for bigger systems four page table levels need to be used. Each page table level costs a bit of performance, use 3 levels for normal systems and 4 levels only for the really big systems. To avoid bloating sparse.o too much set MAX_PHYSMEM_BITS to 46 for a maximum of 64TB of memory. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-12-27[S390] kvm: fix sleeping function ... at mm/page_alloc.c:2260Christian Borntraeger
commit cc772456ac9b460693492b3a3d89e8c81eda5874 [S390] fix list corruption in gmap reverse mapping added a potential dead lock: BUG: sleeping function called from invalid context at mm/page_alloc.c:2260 in_atomic(): 1, irqs_disabled(): 0, pid: 1108, name: qemu-system-s39 3 locks held by qemu-system-s39/1108: #0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm] #1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298 #2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298 CPU: 0 Not tainted 3.1.3 #45 Process qemu-system-s39 (pid: 1108, task: 00000004f8b3cb30, ksp: 00000004fd5978d0) 00000004fd5979a0 00000004fd597920 0000000000000002 0000000000000000 00000004fd5979c0 00000004fd597938 00000004fd597938 0000000000617e96 0000000000000000 00000004f8b3cf58 0000000000000000 0000000000000000 000000000000000d 000000000000000c 00000004fd597988 0000000000000000 0000000000000000 0000000000100a18 00000004fd597920 00000004fd597960 Call Trace: ([<0000000000100926>] show_trace+0xee/0x144) [<0000000000131f3a>] __might_sleep+0x12a/0x158 [<0000000000217fb4>] __alloc_pages_nodemask+0x224/0xadc [<0000000000123086>] gmap_alloc_table+0x46/0x114 [<000000000012395c>] gmap_map_segment+0x268/0x298 [<000003e00486b014>] kvm_arch_commit_memory_region+0x44/0x6c [kvm] [<000003e004866414>] __kvm_set_memory_region+0x3b0/0x4a4 [kvm] [<000003e004866554>] kvm_set_memory_region+0x4c/0x6c [kvm] [<000003e004867c7a>] kvm_vm_ioctl+0x14a/0x314 [kvm] [<0000000000292100>] do_vfs_ioctl+0x94/0x588 [<0000000000292688>] SyS_ioctl+0x94/0xac [<000000000061e124>] sysc_noemu+0x22/0x28 [<000003fffcd5e7ca>] 0x3fffcd5e7ca 3 locks held by qemu-system-s39/1108: #0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm] #1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298 #2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298 Fix this by freeing the lock on the alloc path. This is ok, since the gmap table is never freed until we call gmap_free, so the table we are walking cannot go. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-10-30[S390] kvm: Handle diagnose 0x10 (release pages)Christian Borntraeger
Linux on System z uses a ballooner based on diagnose 0x10. (aka as collaborative memory management). This patch implements diagnose 0x10 on the guest address space. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-10-30[S390] take mmap_sem when walking guest page tableCarsten Otte
gmap_fault needs to walk the guest page table. However, parts of that may change if some other thread does munmap. In that case gmap_unmap_notifier will also unmap the corresponding parts from the guest page table. We need to take mmap_sem in order to serialize these operations. do_exception now calls __gmap_fault with mmap_sem held which does not get exported to modules. The exported function, which is called from KVM, now takes mmap_sem. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-10-30[S390] fix list corruption in gmap reverse mappingCarsten Otte
This introduces locking via mm->page_table_lock to protect the rmap list for guest mappings from being corrupted by concurrent operations. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-10-30[S390] fix possible deadlock in gmap_map_segmentCarsten Otte
Fix possible deadlock reported by lockdep: qemu-system-s39/2963 is trying to acquire lock: (&mm->mmap_sem){++++++}, at: gmap_alloc_table+0x9c/0x120 but task is already holding lock: (&mm->mmap_sem){++++++}, at: gmap_map_segment+0xa6/0x27c Actually gmap_alloc_table is the only called in gmap_map_segment with mmap_sem held, thus it's safe to simply remove the inner lock. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-10-30[S390] memory leak with RCU_TABLE_FREEMartin Schwidefsky
The rcu page table free code uses a couple of bits in the page table pointer passed to tlb_remove_table to discern the different page table types. __tlb_remove_table extracts the type with an incorrect mask which leads to memory leaks. The correct mask is ((FRAG_MASK << 4) | FRAG_MASK). Cc: stable@kernel.org Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-09-26[S390] gmap: always up mmap_sem properlyCarsten Otte
If gmap_unmap_segment figures that the segment was not mapped in the first place, it need to up mmap_sem on exit. Cc: <stable@kernel.org> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-09-20[S390] kvm: fix address mode switchingChristian Borntraeger
598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address spaces for kvm guest images) changed kvm to use a separate address space for kvm guests. This address space was switched in __vcpu_run In some cases (preemption, page fault) there is the possibility that this address space switch is lost. The typical symptom was a huge amount of validity intercepts or random guest addressing exceptions. Fix this by doing the switch in sie_loop and sie_exit and saving the address space in the gmap structure itself. Also use the preempt notifier. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2011-08-03[S390] missing return in page_table_alloc_pgsteJan Glauber
Fix the following compile warning for !CONFIG_PGSTE: CC arch/s390/mm/pgtable.o arch/s390/mm/pgtable.c: In function ‘page_table_alloc_pgste’: arch/s390/mm/pgtable.c:531:1: warning: no return statement in function returning non-void [-Wreturn-type] Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2011-07-24[S390] kvm guest address space mappingMartin Schwidefsky
Add code that allows KVM to control the virtual memory layout that is seen by a guest. The guest address space uses a second page table that shares the last level pte-tables with the process page table. If a page is unmapped from the process page table it is automatically unmapped from the guest page table as well. The guest address space mapping starts out empty, KVM can map any individual 1MB segments from the process virtual memory to any 1MB aligned location in the guest virtual memory. If a target segment in the process virtual memory does not exist or is unmapped while a guest mapping exists the desired target address is stored as an invalid segment table entry in the guest page table. The population of the guest page table is fault driven. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-06-06[S390] use generic RCU page-table freeing codeMartin Schwidefsky
Replace the s390 specific rcu page-table freeing code with the generic variant. This requires to duplicate the definition for the struct mmu_table_batch as s390 does not use the generic tlb flush code. While we are at it remove the restriction that page table fragments can not be reused after a single fragment has been freed with rcu and split out allocation and freeing of page tables with pgstes. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-05-29[S390] mm: fix mmu_gather reworkHeiko Carstens
Quite a few functions that get called from the tlb gather code require that preemption must be disabled. So disable preemption inside of the called functions instead. The only drawback is that rcu_table_freelist_finish() doesn't get necessarily called on the cpu(s) that filled the free lists. So we may see a delay, until we finally see an rcu callback. However over time this shouldn't matter. So we get rid of lots of "BUG: using smp_processor_id() in preemptible" messages. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2011-05-25mm: now that all old mmu_gather code is gone, remove the storagePeter Zijlstra
Fold all the mmu_gather rework patches into one for submission Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-23[S390] Remove data execution protectionMartin Schwidefsky
The noexec support on s390 does not rely on a bit in the page table entry but utilizes the secondary space mode to distinguish between memory accesses for instructions vs. data. The noexec code relies on the assumption that the cpu will always use the secondary space page table for data accesses while it is running in the secondary space mode. Up to the z9-109 class machines this has been the case. Unfortunately this is not true anymore with z10 and later machines. The load-relative-long instructions lrl, lgrl and lgfrl access the memory operand using the same addressing-space mode that has been used to fetch the instruction. This breaks the noexec mode for all user space binaries compiled with march=z10 or later. The only option is to remove the current noexec support. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-01-31[S390] pgtable_list corruptionMartin Schwidefsky
After page_table_free_rcu removed a page from the pgtable_list page_table_free better not add it again. Otherwise a page_table_alloc can reuse a page table fragment that is still in the rcu process. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-10-25[S390] kvm: Fix badness at include/asm/mmu_context.h:83Christian Borntraeger
commit 050eef364ad700590a605a0749f825cab4834b1e [S390] fix tlb flushing vs. concurrent /proc accesses broke KVM on s390x. On every schedule a Badness at include/asm/mmu_context.h:83 appears. s390_enable_sie replaces the mm on the __running__ task, therefore, we have to increase the attach count of the new mm. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-10-25[S390] lockless get_user_pages_fast()Martin Schwidefsky
Implement get_user_pages_fast without locking in the fastpath on s390. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2009-12-07[S390] Improve address space mode selection.Martin Schwidefsky
Introduce user_mode to replace the two variables switch_amode and s390_noexec. There are three valid combinations of the old values: 1) switch_amode == 0 && s390_noexec == 0 2) switch_amode == 1 && s390_noexec == 0 3) switch_amode == 1 && s390_noexec == 1 They get replaced by 1) user_mode == HOME_SPACE_MODE 2) user_mode == PRIMARY_SPACE_MODE 3) user_mode == SECONDARY_SPACE_MODE The new kernel parameter user_mode=[primary,secondary,home] lets you choose the address space mode the user space processes should use. In addition the CONFIG_S390_SWITCH_AMODE config option is removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>