summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2014-09-22Merge branch 'for-linus' into for-3.18/coreJens Axboe
Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.
2014-09-11Merge branch 'for-linus' into for-3.18/coreJens Axboe
A bit of churn on the for-linus side that would be nice to have in the core bits for 3.18, so pull it in to catch us up and make forward progress easier. Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/scsi_ioctl.c
2014-09-10mm/mmap.c: use pr_emerg when printing BUG related informationSasha Levin
Make sure we actually see the output of validate_mm() and browse_rb() before triggering a BUG(). pr_info isn't shown by default so the reason for the BUG() isn't obvious. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10mem-hotplug: let memblock skip the hotpluggable memory regions in ↵Xishi Qiu
__next_mem_range() Let memblock skip the hotpluggable memory regions in __next_mem_range(), it is used to to prevent memblock from allocating hotpluggable memory for the kernel at early time. The code is the same as __next_mem_range_rev(). Clear hotpluggable flag before releasing free pages to the buddy allocator. If we don't clear hotpluggable flag in free_low_memory_core_early(), the memory which marked hotpluggable flag will not free to buddy allocator. Because __next_mem_range() will skip them. free_low_memory_core_early for_each_free_mem_range for_each_mem_range __next_mem_range [akpm@linux-foundation.org: fix warning] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-08bdi: reimplement bdev_inode_switch_bdi()Tejun Heo
A block_device may be attached to different gendisks and thus different bdis over time. bdev_inode_switch_bdi() is used to switch the associated bdi. The function assumes that the inode could be dirty and transfers it between bdis if so. This is a bit nasty in that it reaches into bdi internals. This patch reimplements the function so that it writes out the inode if dirty. This is a lot simpler and can be implemented without exposing bdi internals. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08bdi: explain the dirty list transferring in bdi_destroy()Tejun Heo
bdi_destroy() has code to transfer the remaining dirty inodes to the default_backing_dev_info; however, given the shutdown sequence, it isn't clear how such condition would happen. Also, it isn't a full solution as the transferred inodes stlil point to the bdi which is being destroyed. Operations on those inodes can end up accessing already released fields such as the percpu stat fields. Digging through the history, it seems that the code was added as a quick workaround for a bug report without fully root-causing the issue. We probably want to remove the code in time but for now let's add a comment noting that it is a quick workaround. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08bdi: make backing_dev_info->wb.dwork canceling stricterTejun Heo
Canceling of bdi->wb.dwork is currently a bit mushy. bdi_wb_shutdown() performs cancel_delayed_work_sync() at the end after shutting down and flushing the delayed_work and bdi_destroy() tries yet again after bdi_unregister(). bdi->wb.dwork is queued only after checking BDI_registered while holding bdi->wb_lock and bdi_wb_shutdown() clears the flag while holding the same lock and then flushes the delayed_work. There's no way the delayed_work can be queued again after that. Replace the two unnecessary cancel_delayed_work_sync() invocations with WARNs on pending. This simplifies and clarifies the code a bit and will help future changes in further isolating bdi_writeback handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08bdi: remove bdi->wb_lock locking around bdi->dev clearing in bdi_unregister()Tejun Heo
The only places where NULL test on bdi->dev is used are bdi_[un]register(). The functions can't be called in parallel anyway and there's no point in protecting bdi->dev clearing with a lock. Remove bdi->wb_lock grabbing around bdi->dev clearing and move it after device_unregister() call so that bdi->dev doesn't have to be cached in a local variable. This patch shouldn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-07Merge branch 'for-3.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu fixes from Tejun Heo: "One patch to fix a failure path in the alloc path. The bug is dangerous but probably not too likely to actually trigger in the wild given that there hasn't been any report yet. The other two are low impact fixes" * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: free percpu allocation info for uniprocessor system percpu: perform tlb flush after pcpu_map_pages() failure percpu: fix pcpu_alloc_pages() failure path
2014-09-05mm: memcontrol: revert use of root_mem_cgroup res_counterJohannes Weiner
Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter") and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: Dave Hansen <dave@sr71.net> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29x86,mm: fix pte_special versus pte_numaHugh Dickins
Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfffffff and 100000000, which would need special access. Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: <stable@vger.kernel.org> [3.16] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29hugetlb_cgroup: use lockdep_assert_held rather than spin_is_lockedMichal Hocko
spin_lock may be an empty struct for !SMP configurations and so arch_spin_is_locked may return unconditional 0 and trigger the VM_BUG_ON even when the lock is held. Replace spin_is_locked by lockdep_assert_held. We will not BUG anymore but it is questionable whether crashing makes a lot of sense in the uncharge path. Uncharge happens after the last page reference was released so nobody should touch the page and the function doesn't update any shared state except for res counter which uses synchronization of its own. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29mm/zpool: use prefixed module loadingKees Cook
To avoid potential format string expansion via module parameters, do not use the zpool type directly in request_module() without a format string. Additionally, to avoid arbitrary modules being loaded via zpool API (e.g. via the zswap_zpool_type module parameter) add a "zpool-" prefix to the requested module, as well as module aliases for the existing zpool types (zbud and zsmalloc). Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29mm: actually clear pmd_numa before invalidatingMatthew Wilcox
Commit 67f87463d3a3 ("mm: clear pmd_numa before invalidating") cleared the NUMA bit in a copy of the PMD entry, but then wrote back the original Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29memblock, memhotplug: fix wrong type in memblock_find_in_range_node().Tang Chen
In memblock_find_in_range_node(), we defined ret as int. But it should be phys_addr_t because it is used to store the return value from __memblock_find_range_bottom_up(). The bug has not been triggered because when allocating low memory near the kernel end, the "int ret" won't turn out to be negative. When we started to allocate memory on other nodes, and the "int ret" could be minus. Then the kernel will panic. A simple way to reproduce this: comment out the following code in numa_init(), memblock_set_bottom_up(false); and the kernel won't boot. Reported-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Tested-by: Xishi Qiu <qiuxishi@huawei.com> Cc: <stable@vger.kernel.org> [3.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-16percpu: free percpu allocation info for uniprocessor systemHonggang Li
Currently, only SMP system free the percpu allocation info. Uniprocessor system should free it too. For example, one x86 UML virtual machine with 256MB memory, UML kernel wastes one page memory. Signed-off-by: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-08-15percpu: perform tlb flush after pcpu_map_pages() failureTejun Heo
If pcpu_map_pages() fails midway, it unmaps the already mapped pages. Currently, it doesn't flush tlb after the partial unmapping. This may be okay in most cases as the established mapping hasn't been used at that point but it can go wrong and when it goes wrong it'd be extremely difficult to track down. Flush tlb after the partial unmapping. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-08-15percpu: fix pcpu_alloc_pages() failure pathTejun Heo
When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to free what has already been allocated. The invocation is across the whole requested range and pcpu_free_pages() will try to free all non-NULL pages; unfortunately, this is incorrect as pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't clear the pages array and thus the array may have entries from the previous invocations making the partial failure path free incorrect pages. Fix it by open-coding the partial freeing of the already allocated pages. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-08-14Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds
Merge leftovers from Andrew Morton: "A few leftovers. I have a bunch of OCFS2 patches which are still out for review and which I might sneak along after -rc1. Partly my fault - I should send my review pokes out earlier" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: fix CROSS_MEMORY_ATTACH help text grammar drivers/mfd/rtsx_usb.c: export device table mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage size
2014-08-14mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage sizeDavid Rientjes
Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource counter since it makes no sense to allow a partial page to be charged. As a result of the hugetlb cgroup using the resource counter, it is also aligned to PAGE_SIZE but makes no sense unless aligned to the size of the hugepage being limited. Align hugetlb cgroup limit to hugepage size. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Stuff in here: - acct.c fixes and general rework of mnt_pin mechanism. That allows to go for delayed-mntput stuff, which will permit mntput() on deep stack without worrying about stack overflows - fs shutdown will happen on shallow stack. IOW, we can do Eric's umount-on-rmdir series without introducing tons of stack overflows on new mntput() call chains it introduces. - Bruce's d_splice_alias() patches - more Miklos' rename() stuff. - a couple of regression fixes (stable fodder, in the end of branch) and a fix for API idiocy in iov_iter.c. There definitely will be another pile, maybe even two. I'd like to get Eric's series in this time, but even if we miss it, it'll go right in the beginning of for-next in the next cycle - the tricky part of prereqs is in this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) fix copy_tree() regression __generic_file_write_iter(): fix handling of sync error after DIO switch iov_iter_get_pages() to passing maximal number of pages fs: mark __d_obtain_alias static dcache: d_splice_alias should detect loops exportfs: update Exporting documentation dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED dcache: remove unused d_find_alias parameter dcache: d_obtain_alias callers don't all want DISCONNECTED dcache: d_splice_alias should ignore DCACHE_DISCONNECTED dcache: d_splice_alias mustn't create directory aliases dcache: close d_move race in d_splice_alias dcache: move d_splice_alias namei: trivial fix to vfs_rename_dir comment VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode. cifs: support RENAME_NOREPLACE hostfs: support rename flags shmem: support RENAME_EXCHANGE shmem: support RENAME_NOREPLACE btrfs: add RENAME_NOREPLACE ...
2014-08-11__generic_file_write_iter(): fix handling of sync error after DIOAl Viro
If DIO results in short write and sync write fails, we want to bugger off whether the DIO part has written anything or not; the logics on the return will take care of the right return value. Cc: stable@vger.kernel.org [3.16] Reported-by: Anton Altaparmakov <aia21@cam.ac.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-08shm: wait for pins to be released when sealingDavid Herrmann
If we set SEAL_WRITE on a file, we must make sure there cannot be any ongoing write-operations on the file. For write() calls, we simply lock the inode mutex, for mmap() we simply verify there're no writable mappings. However, there might be pages pinned by AIO, Direct-IO and similar operations via GUP. We must make sure those do not write to the memfd file after we set SEAL_WRITE. As there is no way to notify GUP users to drop pages or to wait for them to be done, we implement the wait ourself: When setting SEAL_WRITE, we check all pages for their ref-count. If it's bigger than 1, we know there's some user of the page. We then mark the page and wait for up to 150ms for those ref-counts to be dropped. If the ref-counts are not dropped in time, we refuse the seal operation. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08shm: add memfd_create() syscallDavid Herrmann
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It can support sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it. memfd_create() returns the raw shmem file, so calls like ftruncate() can be used to modify the underlying inode. Also calls like fstat() will return proper information and mark the file as regular file. If you want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not supported (like on all other regular files). Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to a filesystem size limit. It is still properly accounted to memcg limits, though, and to the same overcommit or no-overcommit accounting as all user memory. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08shm: add sealing APIDavid Herrmann
If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the file. You cannot remove seals again. The fcntl() handler is currently specific to shmem and disabled on all files. A file needs to explicitly support sealing for this interface to work. A separate syscall is added in a follow-up, which creates files that support sealing. There is no intention to support this on other file-systems. Semantics are unclear for non-volatile files and we lack any use-case right now. Therefore, the implementation is specific to shmem. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: allow drivers to prevent new writable mappingsDavid Herrmann
This patch (of 6): The i_mmap_writable field counts existing writable mappings of an address_space. To allow drivers to prevent new writable mappings, make this counter signed and prevent new writable mappings if it is negative. This is modelled after i_writecount and DENYWRITE. This will be required by the shmem-sealing infrastructure to prevent any new writable mappings after the WRITE seal has been set. In case there exists a writable mapping, this operation will fail with EBUSY. Note that we rely on the fact that iff you already own a writable mapping, you can increase the counter without using the helpers. This is the same that we do for i_writecount. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate areaAndy Lutomirski
The core mm code will provide a default gate area based on FIXADDR_USER_START and FIXADDR_USER_END if !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR). This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit UML, and x86_32 have their own code just to disable it. arm, 32-bit UML, and x86_64 have gate areas, but they have their own implementations. This gets rid of the default and moves the code into ia64. This should save some code on architectures without a gate area: it's now possible to inline the gate_area functions in the default case. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Nathan Lynch <nathan_lynch@mentor.com> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle] Acked-by: Richard Weinberger <richard@nod.at> [for um] Acked-by: Will Deacon <will.deacon@arm.com> [for arm64] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nathan Lynch <Nathan_Lynch@mentor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm/zswap.c: add __init to zswap_entry_cache_destroy()Fabian Frederick
zswap_entry_cache_destroy() is only called by __init init_zswap(). This patch also fixes function name zswap_entry_cache_ s/destory/destroy Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: memcontrol: avoid charge statistics churn during page migrationJohannes Weiner
Charge migration currently disables IRQs twice to update the charge statistics for the old page and then again for the new page. But migration is a seamless transition of a charge from one physical page to another one of the same size, so this should be a non-event from an accounting point of view. Leave the statistics alone. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: memcontrol: use page lists for uncharge batchingJohannes Weiner
Pages are now uncharged at release time, and all sources of batched uncharges operate on lists of pages. Directly use those lists, and get rid of the per-task batching state. This also batches statistics accounting, in addition to the res counter charges, to reduce IRQ-disabling and re-enabling. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: memcontrol: rewrite uncharge APIJohannes Weiner
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: Jet Chen <jet.chen@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: memcontrol: rewrite charge APIJohannes Weiner
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [hughd@google.com: fix shmem_unuse] [hughd@google.com: Add comments on the private use of -EAGAIN] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08vm_is_stack: use for_each_thread() rather then buggy while_each_thread()Oleg Nesterov
Aleksei hit the soft lockup during reading /proc/PID/smaps. David investigated the problem and suggested the right fix. while_each_thread() is racy and should die, this patch updates vm_is_stack(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Aleksei Besogonov <alex.besogonov@gmail.com> Tested-by: Aleksei Besogonov <alex.besogonov@gmail.com> Suggested-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08Revert "slab: remove BAD_ALIEN_MAGIC"Joonsoo Kim
This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC"). commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the system with !CONFIG_NUMA has only one memory node. But, it turns out to be false by the report from Geert. His system, m68k, has many memory nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above change. Here goes his failure report. With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM: enable_cpucache failed for radix_tree_node, error 12. kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522! *** TRAP #7 *** FORMAT=0 Current process id is 0 BAD KERNEL TRAP: 00000000 Modules linked in: PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c SR: 2200 SP: 00345f90 a2: 0034c2e8 d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942 d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682 Process swapper (pid: 0, task=0034c2e8) Frame format=0 Stack from 00345fc4: 002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000 00000000 00000000 00000000 00000000 00000000 003ac942 00000000 003912d6 Call Trace: [<000360fa>] parse_args+0x0/0x2ca [<0017d806>] strlen+0x0/0x1a [<003921d4>] start_kernel+0x23c/0x428 [<003912d6>] _sinittext+0x2d6/0x95e Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c Disabling lock debugging due to kernel taint Kernel panic - not syncing: Attempted to kill the idle task! Although there is a alternative way to fix this issue such as disabling use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better to me in this time. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07switch iov_iter_get_pages() to passing maximal number of pagesAl Viro
... instead of maximal size. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07shmem: support RENAME_EXCHANGEMiklos Szeredi
This is really simple in tmpfs since the VFS already takes care of shuffling the dentries. Just adjust nlink on parent directories and touch c & mtimes. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07shmem: support RENAME_NOREPLACEMiklos Szeredi
Implement ->rename2 instead of ->rename. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-06mm/zpool: update zswap to use zpoolDan Streetman
Change zswap to use the zpool api instead of directly using zbud. Add a boot-time param to allow selecting which zpool implementation to use, with zbud as the default. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zpool: zbud/zsmalloc implement zpoolDan Streetman
Update zbud and zsmalloc to implement the zpool api. [fengguang.wu@intel.com: make functions static] Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zpool: implement common zpool api to zbud/zsmallocDan Streetman
Add zpool api. zpool provides an interface for memory storage, typically of compressed memory. Users can select what backend to use; currently the only implementations are zbud, a low density implementation with up to two compressed pages per storage page, and zsmalloc, a higher density implementation with multiple compressed pages per storage page. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zbud: change zbud_alloc size type to size_tDan Streetman
Change the type of the zbud_alloc() size param from unsigned int to size_t. Technically, this should not make any difference, as the zbud implementation already restricts the size to well within either type's limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use size_t, this brings the size parameter type in line with zsmalloc/zpool. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/highmem: make kmap cache coloring awareMax Filippov
User-visible effect: Architectures that choose this method of maintaining cache coherency (MIPS and xtensa currently) are able to use high memory on cores with aliasing data cache. Without this fix such architectures can not use high memory (in case of xtensa it means that at most 128 MBytes of physical memory is available). The problem: VIPT cache with way size larger than MMU page size may suffer from aliasing problem: a single physical address accessed via different virtual addresses may end up in multiple locations in the cache. Virtual mappings of a physical address that always get cached in different cache locations are said to have different colors. L1 caching hardware usually doesn't handle this situation leaving it up to software. Software must avoid this situation as it leads to data corruption. What can be done: One way to handle this is to flush and invalidate data cache every time page mapping changes color. The other way is to always map physical page at a virtual address with the same color. Low memory pages already have this property. Giving architecture a way to control color of high memory page mapping allows reusing of existing low memory cache alias handling code. How this is done with this patch: Provide hooks that allow architectures with aliasing cache to align mapping address of high pages according to their color. Such architectures may enforce similar coloring of low- and high-memory page mappings and reuse existing cache management functions to support highmem. This code is based on the implementation of similar feature for MIPS by Leonid Yegoshin. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Cc: Chris Zankel <chris@zankel.net> Cc: Marc Gauthier <marc@cadence.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Hill <Steven.Hill@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mmu_notifier: add call_srcu and sync function for listener to delay call and ↵Peter Zijlstra
sync When kernel device drivers or subsystems want to bind their lifespan to t= he lifespan of the mm_struct, they usually use one of the following methods: 1. Manually calling a function in the interested kernel module. The funct= ion call needs to be placed in mmput. This method was rejected by several ker= nel maintainers. 2. Registering to the mmu notifier release mechanism. The problem with the latter approach is that the mmu_notifier_release cal= lback is called from__mmu_notifier_release (called from exit_mmap). That functi= on iterates over the list of mmu notifiers and don't expect the release call= back function to remove itself from the list. Therefore, the callback function= in the kernel module can't release the mmu_notifier_object, which is actuall= y the kernel module's object itself. As a result, the destruction of the kernel module's object must to be done in a delayed fashion. This patch adds support for this delayed callback, by adding a new mmu_notifier_call_srcu function that receives a function ptr and calls th= at function with call_srcu. In that function, the kernel module releases its object. To use mmu_notifier_call_srcu, the calling module needs to call b= efore that a new function called mmu_notifier_unregister_no_release that as its= name implies, unregisters a notifier without calling its notifier release call= back. This patch also adds a function that will call barrier_srcu so those kern= el modules can sync with mmu_notifier. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Oded Gabbay <oded.gabbay@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: clean up reclaim size variable use in try_charge()Johannes Weiner
Charge reclaim and OOM currently use the charge batch variable, but batching is already disabled at that point. To simplify the charge logic, the batch variable is reset to the original request size when reclaim is entered, so it's functionally equal, but it's misleading. Switch reclaim/OOM to nr_pages, which is the original request size. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: change confusing #ifdef use in __access_remote_vmRik van Riel
This patch changes confusing #ifdef use in __access_remote_vm into merely ugly #ifdef use. Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651 Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: mark fault_around_bytes __read_mostlyKirill A. Shutemov
fault_around_bytes can only be changed via debugfs. Let's mark it read-mostly. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: close race between do_fault_around() and fault_around_bytes_set()Kirill A. Shutemov
Things can go wrong if fault_around_bytes will be changed under do_fault_around(): between fault_around_mask() and fault_around_pages(). Let's read fault_around_bytes only once during do_fault_around() and calculate mask based on the reading. Note: fault_around_bytes can only be updated via debug interface. Also I've tried but was not able to trigger a bad behaviour without the patch. So I would not consider this patch as urgent. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memcg, vmscan: Fix forced scan of anonymous pagesJerome Marchand
When memory cgoups are enabled, the code that decides to force to scan anonymous pages in get_scan_count() compares global values (free, high_watermark) to a value that is restricted to a memory cgroup (file). It make the code over-eager to force anon scan. For instance, it will force anon scan when scanning a memcg that is mainly populated by anonymous page, even when there is plenty of file pages to get rid of in others memcgs, even when swappiness == 0. It breaks user's expectation about swappiness and hurts performance. This patch makes sure that forced anon scan only happens when there not enough file pages for the all zone, not just in one random memcg. [hannes@cmpxchg.org: cleanups] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, vmscan: fix an outdated comment still mentioning get_scan_ratioJerome Marchand
Quite a while ago, get_scan_ratio() has been renamed get_scan_count(), however a comment in shrink_active_list() still mention it. This patch fixes the outdated comment. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, oom: remove unnecessary exit_state checkDavid Rientjes
The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>