summaryrefslogtreecommitdiff
path: root/mm/internal.h
AgeCommit message (Collapse)Author
11 daysmm: memory-failure: update ttu flag inside unmap_poisoned_folioMa Wupeng
Patch series "mm: memory_failure: unmap poisoned folio during migrate properly", v3. Fix two bugs during folio migration if the folio is poisoned. This patch (of 3): Commit 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") introduce TTU_HWPOISON to replace TTU_IGNORE_HWPOISON in order to stop send SIGBUS signal when accessing an error page after a memory error on a clean folio. However during page migration, anon folio must be set with TTU_HWPOISON during unmap_*(). For pagecache we need some policy just like the one in hwpoison_user_mappings to set this flag. So move this policy from hwpoison_user_mappings to unmap_poisoned_folio to handle this warning properly. Warning will be produced during unamp poison folio with the following log: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 365 at mm/rmap.c:1847 try_to_unmap_one+0x8fc/0xd3c Modules linked in: CPU: 1 UID: 0 PID: 365 Comm: bash Tainted: G W 6.13.0-rc1-00018-gacdb4bbda7ab #42 Tainted: [W]=WARN Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : try_to_unmap_one+0x8fc/0xd3c lr : try_to_unmap_one+0x3dc/0xd3c Call trace: try_to_unmap_one+0x8fc/0xd3c (P) try_to_unmap_one+0x3dc/0xd3c (L) rmap_walk_anon+0xdc/0x1f8 rmap_walk+0x3c/0x58 try_to_unmap+0x88/0x90 unmap_poisoned_folio+0x30/0xa8 do_migrate_range+0x4a0/0x568 offline_pages+0x5a4/0x670 memory_block_action+0x17c/0x374 memory_subsys_offline+0x3c/0x78 device_offline+0xa4/0xd0 state_store+0x8c/0xf0 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x118/0x1a8 vfs_write+0x3a8/0x4bc ksys_write+0x6c/0xf8 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0xc8/0xcc el0t_64_sync+0x198/0x19c ---[ end trace 0000000000000000 ]--- [mawupeng1@huawei.com: unmap_poisoned_folio(): remove shadowed local `mapping', per Miaohe] Link: https://lkml.kernel.org/r/20250219060653.3849083-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-2-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/truncate: add folio_unmap_invalidate() helperJens Axboe
Add a folio_unmap_invalidate() helper, which unmaps and invalidates a given folio. The caller must already have locked the folio. Embed the old invalidate_complete_folio2() helper in there as well, as nobody else calls it. Use this new helper in invalidate_inode_pages2_range(), rather than duplicate the code there. In preparation for using this elsewhere as well, have it take a gfp_t mask rather than assume GFP_KERNEL is the right choice. This bubbles back to invalidate_complete_folio2() as well. Link: https://lkml.kernel.org/r/20241220154831.1086649-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mseal: remove can_do_mseal()Jeff Xu
No code logic change. can_do_mseal() is called exclusively by mseal.c, and mseal.c is compiled only when CONFIG_64BIT flag is set in makefile. Therefore, it is unnecessary to have 32 bit stub function in the header file, remove this function and merge the logic into do_mseal(). Link: https://lkml.kernel.org/r/20241206013934.2782793-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20241206194839.3030596-2-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)Qi Zheng
Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). Once an empty PTE is detected, we first try to hold the pmd lock within the pte lock. If successful, we clear the pmd entry directly (fast path). Otherwise, we wait until the pte lock is released, then re-hold the pmd and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect whether the PTE page is empty and free it (slow path). For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB [zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl'] Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/ Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: make __alloc_contig_migrate_range() staticDavid Hildenbrand
The single user is in page_alloc.c. Link: https://lkml.kernel.org/r/20241203094732.200195-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/mempolicy: add alloc_frozen_pages()Matthew Wilcox (Oracle)
Provide an interface to allocate pages from the page allocator without incrementing their refcount. This saves an atomic operation on free, which may be beneficial to some users (eg slab). Link: https://lkml.kernel.org/r/20241125210149.2976098-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: add __alloc_frozen_pages()Matthew Wilcox (Oracle)
Defer the initialisation of the page refcount to the new __alloc_pages() wrapper and turn the old __alloc_pages() into __alloc_frozen_pages(). Link: https://lkml.kernel.org/r/20241125210149.2976098-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of post_alloc_hook()Matthew Wilcox (Oracle)
In preparation for allocating frozen pages, stop initialising the page refcount in post_alloc_hook(). Link: https://lkml.kernel.org/r/20241125210149.2976098-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: export free_frozen_pages() instead of free_unref_page()Matthew Wilcox (Oracle)
We already have the concept of "frozen pages" (eg page_ref_freeze()), so let's not complicate things by also having the concept of "unref pages". Link: https://lkml.kernel.org/r/20241125210149.2976098-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30mm, madvise: fix potential workingset node list_lru leaksKairui Song
Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when needed"), all list_lru users need to allocate the items using the new infrastructure that provides list_lru info for slab allocation, ensuring that the corresponding memcg list_lru is allocated before use. For workingset shadow nodes (which are xa_node), users are converted to use the new infrastructure by commit 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node"). The xas->xa_lru will be set correctly for filemap users. However, there is a missing case: xa_node allocations caused by madvise(..., MADV_COLLAPSE). madvise(..., MADV_COLLAPSE) will also read in the absent parts of file map, and there will be xa_nodes allocated for the caller's memcg (assuming it's not rootcg). However, these allocations won't trigger memcg list_lru allocation because the proper xas info was not set. If nothing else has allocated other xa_nodes for that memcg to trigger list_lru creation, and memory pressure starts to evict file pages, workingset_update_node will try to add these xa_nodes to their corresponding memcg list_lru, and it does not exist (NULL). So they will be added to rootcg's list_lru instead. This shouldn't be a significant issue in practice, but it is indeed unexpected behavior, and these xa_nodes will not be reclaimed effectively. And may lead to incorrect counting of the list_lru->nr_items counter. This problem wasn't exposed until recent commit 28e98022b31ef ("mm/list_lru: simplify reparenting and initial allocation") added a sanity check: only dying memcg could have a NULL list_lru when list_lru_{add,del} is called. This problem triggered this WARNING. So make madvise(..., MADV_COLLAPSE) also call xas_set_lru() to pass the list_lru which we may want to insert xa_node into later. And move mapping_set_update to mm/internal.h, and turn into a macro to avoid including extra headers in mm/internal.h. Link: https://lkml.kernel.org/r/20241222122936.67501-1-ryncsn@gmail.com Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node") Reported-by: syzbot+38a0cbd267eff2d286ff@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/675d01e9.050a0220.37aaf.00be.GAE@google.com/ Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sasha Levin <sashal@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm: use clear_user_(high)page() for arch with special user folio handlingZi Yan
Some architectures have special handling after clearing user folios: architectures, which set cpu_dcache_is_aliasing() to true, require flushing dcache; arc, which sets cpu_icache_is_aliasing() to true, changes folio->flags to make icache coherent to dcache. So __GFP_ZERO using only clear_page() is not enough to zero user folios and clear_user_(high)page() must be used. Otherwise, user data will be corrupted. Fix it by always clearing user folios with clear_user_(high)page() when cpu_dcache_is_aliasing() is true or cpu_icache_is_aliasing() is true. Rename alloc_zeroed() to user_alloc_needs_zeroing() and invert the logic to clarify its intend. Link: https://lkml.kernel.org/r/20241209182326.2955963-2-ziy@nvidia.com Fixes: 5708d96da20b ("mm: avoid zeroing user movable page twice with init_on_alloc=1") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Closes: https://lore.kernel.org/linux-mm/CAMuHMdV1hRp_NtR5YnJo=HsfgKQeH91J537Gh4gKk3PFZhSkbA@mail.gmail.com/ Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-30Merge tag 'kbuild-v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add generic support for built-in boot DTB files - Enable TAB cycling for dialog buttons in nconfig - Fix issues in streamline_config.pl - Refactor Kconfig - Add support for Clang's AutoFDO (Automatic Feedback-Directed Optimization) - Add support for Clang's Propeller, a profile-guided optimization. - Change the working directory to the external module directory for M= builds - Support building external modules in a separate output directory - Enable objtool for *.mod.o and additional kernel objects - Use lz4 instead of deprecated lz4c - Work around a performance issue with "git describe" - Refactor modpost * tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits) kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms gitignore: Don't ignore 'tags' directory kbuild: add dependency from vmlinux to resolve_btfids modpost: replace tdb_hash() with hash_str() kbuild: deb-pkg: add python3:native to build dependency genksyms: reduce indentation in export_symbol() modpost: improve error messages in device_id_check() modpost: rename alias symbol for MODULE_DEVICE_TABLE() modpost: rename variables in handle_moddevtable() modpost: move strstarts() to modpost.h modpost: convert do_usb_table() to a generic handler modpost: convert do_of_table() to a generic handler modpost: convert do_pnp_device_entry() to a generic handler modpost: convert do_pnp_card_entries() to a generic handler modpost: call module_alias_printf() from all do_*_entry() functions modpost: pass (struct module *) to do_*_entry() functions modpost: remove DEF_FIELD_ADDR_VAR() macro modpost: deduplicate MODULE_ALIAS() for all drivers modpost: introduce module_alias_printf() helper modpost: remove unnecessary check in do_acpi_entry() ...
2024-11-27Rename .data.once to .data..once to fix resetting WARN*_ONCEMasahiro Yamada
Commit b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") added support for clearing the state of once warnings. However, it is not functional when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.once matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro. Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1] Seven years have passed since then, yet the #ifdef workaround remains in place. Meanwhile, commit b1fca27d384e introduced the .data.once section, and commit dc5723b02e52 ("kbuild: add support for Clang LTO") extended the #ifdef. Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG. [1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/ Fixes: b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") Fixes: dc5723b02e52 ("kbuild: add support for Clang LTO") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-11mm: move ``get_order_from_str()`` to internal.hMaíra Canal
In order to implement a kernel parameter similar to ``thp_anon=`` for shmem, we'll need the function ``get_order_from_str()``. Instead of duplicating the function, move the function to a shared header, in which both mm/shmem.c and mm/huge_memory.c will be able to use it. Link: https://lkml.kernel.org/r/20241101165719.1074234-5-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: pagewalk: add the ability to install PTEsLorenzo Stoakes
Patch series "implement lightweight guard pages", v4. Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process. The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region. With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted. This series takes a different approach - an idea suggested by Vlastimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges. Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code. We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging. This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'. After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed. By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs. This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning. The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL which installs page table-level guard page markers - and MADV_GUARD_REMOVE - which clears them. Guard markers can be installed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the guard page markers in the page tables. There is no concept of 'nested' guard markers, multiple attempts to install guard markers in a range will, after the first attempt, have no effect. Importantly, removing guard markers over a range that contains both guard markers and ordinary backed memory has no effect on anything but the guard markers (including leaving huge pages un-split), so a user can safely remove guard markers over a range of memory leaving the rest intact. The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism. Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it. We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse). We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not remove guard markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics). It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the markers if they wish. We simply implement it in such a way as to provide the least surprising behaviour. An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of guard ranges. This patch (of 5): The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks. This is useful for traversing existing page table entries, but precludes the ability to establish new ones. Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere. Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal. If a THP huge page is encountered at either the PMD or PUD level we split it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), otherwise if there is only an ops->install_pte(), we avoid the unnecessary split. We do not support hugetlb at this stage. If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges. If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range. If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before. As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs. We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h. We take the opportunity to additionally clean up the page walker logic to be a little easier to follow. Link: https://lkml.kernel.org/r/cover.1730123433.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/51b432ebef013e3fdf9f92101533435de1bffadf.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Vlastimil Babka <vbabkba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: mass constification of folio/page pointersMatthew Wilcox (Oracle)
Now that page_pgoff() takes const pointers, we can constify the pointers to a lot of functions. Link: https://lkml.kernel.org/r/20241005200121.3231142-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: renovate page_address_in_vma()Matthew Wilcox (Oracle)
This function doesn't modify any of its arguments, so if we make a few other functions take const pointers, we can make page_address_in_vma() take const pointers too. All of its callers have the containing folio already, so pass that in as an argument instead of recalculating it. Also add kernel-doc Link: https://lkml.kernel.org/r/20241005200121.3231142-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07alloc_tag: populate memory for module tags as neededSuren Baghdasaryan
The memory reserved for module tags does not need to be backed by physical pages until there are tags to store there. Change the way we reserve this memory to allocate only virtual area for the tags and populate it with physical pages as needed when we load a module. [surenb@google.com: avoid execmem_vmap() when !MMU] Link: https://lkml.kernel.org/r/20241031233611.3833002-1-surenb@google.com Link: https://lkml.kernel.org/r/20241023170759.999909-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiongwei Song <xiongwei.song@windriver.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07execmem: add support for cache of large ROX pagesMike Rapoport (Microsoft)
Using large pages to map text areas reduces iTLB pressure and improves performance. Extend execmem_alloc() with an ability to use huge pages with ROX permissions as a cache for smaller allocations. To populate the cache, a writable large page is allocated from vmalloc with VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as ROX. The direct map alias of that large page is exculded from the direct map. Portions of that large page are handed out to execmem_alloc() callers without any changes to the permissions. When the memory is freed with execmem_free() it is invalidated again so that it won't contain stale instructions. An architecture has to implement execmem_fill_trapping_insns() callback and select ARCH_HAS_EXECMEM_ROX configuration option to be able to use the ROX cache. The cache is enabled on per-range basis when an architecture sets EXECMEM_ROX_CACHE flag in definition of an execmem_range. Link: https://lkml.kernel.org/r/20241023162711.2579610-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: avoid zeroing user movable page twice with init_on_alloc=1Zi Yan
Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") forces allocated page to be zeroed in post_alloc_hook() when init_on_alloc=1. For order-0 folios, if arch does not define vma_alloc_zeroed_movable_folio(), the default implementation again zeros the page return from the buddy allocator. So the page is zeroed twice. Fix it by passing __GFP_ZERO instead to avoid double page zeroing. At the moment, s390,arm64,x86,alpha,m68k are not impacted since they define their own vma_alloc_zeroed_movable_folio(). For >0 order folios (mTHP and PMD THP), folio_zero_user() is called to zero the folio again. Fix it by calling folio_zero_user() only if init_on_alloc is set. All arch are impacted. Add alloc_zeroed() helper to encapsulate the init_on_alloc check. [ziy@nvidia.com: comment fixes, per David] Link: https://lkml.kernel.org/r/97DB52E1-C594-49B5-9736-89AC302FAB01@nvidia.com Link: https://lkml.kernel.org/r/20241011150304.709590-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove PageKsm()Matthew Wilcox (Oracle)
All callers have been converted to use folio_test_ksm() or PageAnonNotKsm(), so we can remove this wrapper. Link: https://lkml.kernel.org/r/20241002152533.1350629-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: unconditionally close VMAs on errorLorenzo Stoakes
Incorrect invocation of VMA callbacks when the VMA is no longer in a consistent state is bug prone and risky to perform. With regards to the important vm_ops->close() callback We have gone to great lengths to try to track whether or not we ought to close VMAs. Rather than doing so and risking making a mistake somewhere, instead unconditionally close and reset vma->vm_ops to an empty dummy operations set with a NULL .close operator. We introduce a new function to do so - vma_close() - and simplify existing vms logic which tracked whether we needed to close or not. This simplifies the logic, avoids incorrect double-calling of the .close() callback and allows us to update error paths to simply call vma_close() unconditionally - making VMA closure idempotent. Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm: avoid unsafe VMA hook invocation when error arises on mmap hookLorenzo Stoakes
Patch series "fix error handling in mmap_region() and refactor (hotfixes)", v4. mmap_region() is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. This series goes to great lengths to simplify how mmap_region() works and to avoid unwinding errors late on in the process of setting up the VMA for the new mapping, and equally avoids such operations occurring while the VMA is in an inconsistent state. The patches in this series comprise the minimal changes required to resolve existing issues in mmap_region() error handling, in order that they can be hotfixed and backported. There is additionally a follow up series which goes further, separated out from the v1 series and sent and updated separately. This patch (of 5): After an attempted mmap() fails, we are no longer in a situation where we can safely interact with VMA hooks. This is currently not enforced, meaning that we need complicated handling to ensure we do not incorrectly call these hooks. We can avoid the whole issue by treating the VMA as suspect the moment that the file->f_ops->mmap() function reports an error by replacing whatever VMA operations were installed with a dummy empty set of VMA operations. We do so through a new helper function internal to mm - mmap_file() - which is both more logically named than the existing call_mmap() function and correctly isolates handling of the vm_op reassignment to mm. All the existing invocations of call_mmap() outside of mm are ultimately nested within the call_mmap() from mm, which we now replace. It is therefore safe to leave call_mmap() in place as a convenience function (and to avoid churn). The invokers are: ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap() coda_file_operations -> mmap -> coda_file_mmap() shm_file_operations -> shm_mmap() shm_file_operations_huge -> shm_mmap() dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops -> i915_gem_dmabuf_mmap() None of these callers interact with vm_ops or mappings in a problematic way on error, quickly exiting out. Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05mm/thp: fix deferred split unqueue naming and lockingHugh Dickins
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-17mm: change vmf_anon_prepare() to __vmf_anon_prepare()Vishal Moola (Oracle)
Some callers of vmf_anon_prepare() may not want us to release the per-VMA lock ourselves. Rename vmf_anon_prepare() to __vmf_anon_prepare() and let the callers drop the lock when desired. Also, make vmf_anon_prepare() a wrapper that releases the per-VMA lock itself for any callers that don't care. This is in preparation to fix this bug reported by syzbot: https://lore.kernel.org/linux-mm/00000000000067c20b06219fbc26@google.com/ Link: https://lkml.kernel.org/r/20240914194243.245-1-vishal.moola@gmail.com Fixes: 9acad7ba3e25 ("hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()") Reported-by: syzbot+2dab93857ee95f2eeb08@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/00000000000067c20b06219fbc26@google.com/ Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: remove putback_lru_page()Kefeng Wang
There are no more callers of putback_lru_page(), remove it. Link: https://lkml.kernel.org/r/20240826065814.1336616-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: remove isolate_lru_page()Kefeng Wang
There are no more callers of isolate_lru_page(), remove it. [wangkefeng.wang@huawei.com: convert page to folio in comment and document, per Matthew] Link: https://lkml.kernel.org/r/20240826144114.1928071-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240826065814.1336616-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: migrate_device: convert to migrate_device_coherent_folio()Kefeng Wang
Patch series "mm: finish isolate/putback_lru_page()". Convert to use more folios in migrate_device.c, then we could remove isolate_lru_page() and putback_lru_page(). This patch (of 6): Save a few calls to compound_head() and use folio throughout. Link: https://lkml.kernel.org/r/20240826065814.1336616-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240826065814.1336616-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: memory-failure: add unmap_poisoned_folio()Kefeng Wang
Add unmap_poisoned_folio() helper which will be reused by do_migrate_range() from memory hotplug soon. [akpm@linux-foundation.org: whitespace tweak, per Miaohe Lin] Link: https://lkml.kernel.org/r/1f80c7e3-c30d-1ac1-6a36-d1a5f5907f7c@huawei.com Link: https://lkml.kernel.org/r/20240827114728.3212578-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03printf: remove %pGt supportMatthew Wilcox (Oracle)
Patch series "Increase the number of bits available in page_type". Kent wants more than 16 bits in page_type, so I resurrected this old patch and expanded it a bit. It's a bit more efficient than our current scheme (1 4-byte insn vs 3 insns of 13 bytes total) to test a single page type. This patch (of 4): An upcoming patch will convert page type from being a bitfield to a single byte, so we will not be able to use %pG to print the page type any more. The printing of the symbolic name will be restored in that patch. Link: https://lkml.kernel.org/r/20240821173914.2270383-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240821173914.2270383-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: remove can_modify_mm()Pedro Falcato
With no more users in the tree, we can finally remove can_modify_mm(). Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-6-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mseal: replace can_modify_mm_madv with a vma variantPedro Falcato
Replace can_modify_mm_madv() with a single vma variant, and associated checks in madvise. While we're at it, also invert the order of checks in: if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma)) Checking if we can modify the vma itself (through vm_flags) is certainly cheaper than is_ro_anon() due to arch_vma_access_permitted() looking at e.g pkeys registers (with extra branches) in some architectures. This patch allows for partial madvise success when finding a sealed VMA, which historically has been allowed in Linux. Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-5-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: attempt to batch free swap entries for zap_pte_range()Barry Song
Zhiguo reported that swap release could be a serious bottleneck during process exits[1]. With mTHP, we have the opportunity to batch free swaps. Thanks to the work of Chris and Kairui[2], I was able to achieve this optimization with minimal code changes by building on their efforts. If swap_count is 1, which is likely true as most anon memory are private, we can free all contiguous swap slots all together. Ran the below test program for measuring the bandwidth of munmap using zRAM and 64KiB mTHP: #include <sys/mman.h> #include <sys/time.h> #include <stdlib.h> unsigned long long tv_to_ms(struct timeval tv) { return tv.tv_sec * 1000 + tv.tv_usec / 1000; } main() { struct timeval tv_b, tv_e; int i; #define SIZE 1024*1024*1024 void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (!p) { perror("fail to get memory"); exit(-1); } madvise(p, SIZE, MADV_HUGEPAGE); memset(p, 0x11, SIZE); /* write to get mem */ madvise(p, SIZE, MADV_PAGEOUT); gettimeofday(&tv_b, NULL); munmap(p, SIZE); gettimeofday(&tv_e, NULL); printf("munmap in bandwidth: %ld bytes/ms\n", SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b))); } The result is as below (munmap bandwidth): mm-unstable mm-unstable-with-patch round1 21053761 63161283 round2 21053761 63161283 round3 21053761 63161283 round4 20648881 67108864 round5 20648881 67108864 munmap bandwidth becomes 3X faster. [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/ [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ [v-songbaohua@oppo.com: check all swaps belong to same swap_cgroup in swap_pte_batch()] Link: https://lkml.kernel.org/r/20240815215308.55233-1-21cnbao@gmail.com [hughd@google.com: add mem_cgroup_disabled() check] Link: https://lkml.kernel.org/r/33f34a88-0130-5444-9b84-93198eeb50e7@google.com [21cnbao@gmail.com: add missing zswap_invalidate()] Link: https://lkml.kernel.org/r/20240821054921.43468-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240807215859.57491-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: add a helper to accept pageKirill A. Shutemov
Accept a given struct page and add it free list. The help is useful for physical memory scanners that want to use free unaccepted memory. Link: https://lkml.kernel.org/r/20240809114854.3745464-7-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)Zi Yan
do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To reduce redundancy, move common code to numa_migrate_prep() and rename the function to numa_migrate_check() to reflect its functionality. Now do_huge_pmd_numa_page() also checks shared folios to set TNF_SHARED flag. Link: https://lkml.kernel.org/r/20240809145906.1513458-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move internal core VMA manipulation functions to own fileLorenzo Stoakes
This patch introduces vma.c and moves internal core VMA manipulation functions to this file from mmap.c. This allows us to isolate VMA functionality in a single place such that we can create userspace testing code that invokes this functionality in an environment where we can implement simple unit tests of core functionality. This patch ensures that core VMA functionality is explicitly marked as such by its presence in mm/vma.h. It also places the header includes required by vma.c in vma_internal.h, which is simply imported by vma.c. This makes the VMA functionality testable, as userland testing code can simply stub out functionality as required. Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move vma_shrink(), vma_expand() to internal headerLorenzo Stoakes
The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory management code. To achieve this, we replace shift_arg_pages() in fs/exec.c with an invocation of a new relocate_vma_down() function implemented in mm/mmap.c, which enables us to also move move_page_tables() and vma_iter_prev_range() to internal.h. The purpose of doing this is to isolate key VMA manipulation functions in order that we can both abstract them and later render them easily testable. Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01mm: move vma_modify() and helpers to internal headerLorenzo Stoakes
These are core VMA manipulation functions which invoke VMA splitting and merging and should not be directly accessed from outside of mm/. Link: https://lkml.kernel.org/r/5efde0c6342a8860d5ffc90b415f3989fd8ed0b2.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-06Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fixAndrew Morton
crashes from deferred split racing folio migration", needed by "mm: migrate: split folio_migrate_mapping()".
2024-07-06mm: gup: stop abusing try_grab_folioYang Shi
A kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] <TASK> [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. In addition, the try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page() to try_grab_folio(), and use them in the proper paths. This solves both the abuse and the kernel warning. The proper naming makes their usecase more clear and should prevent from abusing in the future. peterx said: : The user will see the pin fails, for gpu-slow it further triggers the WARN : right below that failure (as in the original report): : : folio = try_grab_folio(page, page_increm - 1, : foll_flags); : if (WARN_ON_ONCE(!folio)) { <------------------------ here : /* : * Release the 1st page ref if the : * folio is problematic, fail hard. : */ : gup_put_folio(page_folio(page), 1, : foll_flags); : ret = -EFAULT; : goto out; : } [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast] Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: yangge <yangge1116@126.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: refactor folio_undo_large_rmappable()Kefeng Wang
Folios of order <= 1 are not in deferred list, the check of order is added into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support order-1 folios in the page cache"), but there is a repeated check for small folio (order 0) during each call of the folio_undo_large_rmappable(), so only keep folio_order() check inside the function. In addition, move all the checks into header file to save a function call for non-large-rmappable or empty deferred_list folio. Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: pass meminit_context to __free_pages_core()David Hildenbrand
Patch series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". This can be a considered a long-overdue follow-up to some parts of [1]. The patches are based on [2], but they are not strictly required -- just makes it clearer why we can use adjust_managed_page_count() for memory hotplug without going into details about highmem. We stop initializing pages with PageReserved() in memory hotplug code -- except when dealing with ZONE_DEVICE for now. Instead, we use PageOffline(): all pages are initialized to PageOffline() when onlining a memory section, and only the ones actually getting exposed to the system/page allocator will get PageOffline cleared. This way, we enlighten memory hotplug more about PageOffline() pages and can cleanup some hacks we have in virtio-mem code. What about ZONE_DEVICE? PageOffline() is wrong, but we might just stop using PageReserved() for them later by simply checking for is_zone_device_page() at suitable places. That will be a separate patch set / proposal. This primarily affects virtio-mem, HV-balloon and XEN balloon. I only briefly tested with virtio-mem, which benefits most from these cleanups. [1] https://lore.kernel.org/all/20191024120938.11237-1-david@redhat.com/ [2] https://lkml.kernel.org/r/20240607083711.62833-1-david@redhat.com This patch (of 3): In preparation for further changes, let's teach __free_pages_core() about the differences of memory hotplug handling. Move the memory hotplug specific handling from generic_online_page() to __free_pages_core(), use adjust_managed_page_count() on the memory hotplug path, and spell out why memory freed via memblock cannot currently use adjust_managed_page_count(). [david@redhat.com: add missed CONFIG_DEFERRED_STRUCT_PAGE_INIT] Link: https://lkml.kernel.org/r/b72e6efd-fb0a-459c-b1a0-88a98e5b19e2@redhat.com [david@redhat.com: fix up the memblock comment, per Oscar] Link: https://lkml.kernel.org/r/2ed64218-7f3b-4302-a5dc-27f060654fe2@redhat.com [david@redhat.com: add the parameter name also in the declaration] Link: https://lkml.kernel.org/r/ca575956-f0dd-4fb9-a307-6b7621681ed9@redhat.com Link: https://lkml.kernel.org/r/20240607090939.89524-1-david@redhat.com Link: https://lkml.kernel.org/r/20240607090939.89524-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Marco Elver <elver@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: rename alloc_demote_folio to alloc_migrate_folioHonggyu Kim
The alloc_demote_folio can also be used for general migration including both demotion and promotion so it'd be better to rename it from alloc_demote_folio to alloc_migrate_folio. Link: https://lkml.kernel.org/r/20240614030010.751-3-honggyu.kim@sk.com Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: make alloc_demote_folio externally invokable for migrationHonggyu Kim
Patch series "DAMON based tiered memory management for CXL memory", v6. Introduction ============ With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 11% to 3~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration =================== The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[3]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload =================== The performance evaluation is done with redis[4], which is a widely used in-memory database and the memory access patterns are generated via YCSB[5]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these migrate_{hot,cold} actions covers system-wide memory management rather than partitioning hot/cold pages of a single workload. The default memory allocation policy creates pages to the fast tier DRAM node first, then allocates newly created pages to the slow tier CXL node when the DRAM node has insufficient free space. Once the page allocation is done then those pages never move between NUMA nodes. It's not true when using numa balancing, but it is not the scope of this DAMON based tiered memory management support. If the working set of redis can be fit fully into the DRAM node, then the redis will access the fast DRAM only. Since the performance of DRAM only is faster than partially accessing CXL memory in slow tiers, this environment is not useful to evaluate this patch series. To make pages of redis be distributed across fast DRAM node and slow CXL node to evaluate our migrate_{hot,cold} actions, we pre-allocate some cold memory externally using mmap and memset before launching redis-server. We assumed that there are enough amount of cold memory in datacenters as TMO[6] and TPP[7] papers mentioned. The evaluation sequence is as follows. 1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and DAMOS_MIGRATE_HOT action for CXL node. It demotes cold pages on DRAM node and promotes hot pages on CXL node in a regular interval. 2. Allocate a huge block of cold memory by calling mmap and memset at the fast tier DRAM node, then make the process sleep to make the fast tier has insufficient space for redis-server. 3. Launch redis-server and load prebaked snapshot image, dump.rdb. The redis-server consumes 52GB of anon pages and 33GB of file pages, but due to the cold memory allocated at 2, it fails allocating the entire memory of redis-server on the fast tier DRAM node so it partially allocates the remaining on the slow tier CXL node. The ratio of DRAM:CXL depends on the size of the pre-allocated cold memory. 4. Run YCSB to make zipfian or latest distribution of memory accesses to redis-server, then measure its execution time when it's completed. 5. Repeat 4 over 50 times to measure the average execution time for each run. 6. Increase the cold memory size then repeat goes to 2. For each test at 4 took about a minute so repeating it 50 times almost took about 1 hour for each test with a specific cold memory from 440GB to 500GB in 10GB increments for each evaluation. So it took about more than 10 hours for both zipfian and latest workloads to get the entire evaluation results. Repeating the same test set multiple times doesn't show much difference so I think it might be enough to make the result reliable. Evaluation Results ================== All the result values are normalized to DRAM-only execution time because the workload cannot be faster than DRAM-only unless the workload hits the peak bandwidth but our redis test doesn't go beyond the bandwidth limit. So the DRAM-only execution time is the ideal result without affected by the gap between DRAM and CXL performance difference. The NUMA node environment is as follows. node0 - local DRAM, 512GB with a CPU socket (fast tier) node1 - disabled node2 - CXL DRAM, 96GB, no CPU attached (slow tier) The following is the result of generating zipfian distribution to redis-server and the numbers are averaged by 50 times of execution. 1. YCSB zipfian distribution read only workload memory pressure with cold memory on node0 with 512GB of local DRAM. ====================+================================================+========= | cold memory occupied by mmap and memset | | 0G 440G 450G 460G 470G 480G 490G 500G | ====================+================================================+========= Execution time normalized to DRAM-only values | GEOMEAN --------------------+------------------------------------------------+--------- DRAM-only | 1.00 - - - - - - - | 1.00 CXL-only | 1.19 - - - - - - - | 1.19 default | - 1.00 1.05 1.08 1.12 1.14 1.18 1.18 | 1.11 DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 *1.05 | 1.04 DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 *1.06 | 1.05 ====================+================================================+========= CXL usage of redis-server in GB | AVERAGE --------------------+------------------------------------------------+--------- DRAM-only | 0.0 - - - - - - - | 0.0 CXL-only | 51.4 - - - - - - - | 51.4 default | - 0.6 10.6 20.5 30.5 40.5 47.6 50.4 | 28.7 DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2 DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5 ====================+================================================+========= Each test result is based on the execution environment as follows. DRAM-only: redis-server uses only local DRAM memory. CXL-only: redis-server uses only CXL memory. default: default memory policy(MPOL_DEFAULT). numa balancing disabled. DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and DAMOS_MIGRATE_HOT for CXL nodes. DAMON lazy: same as DAMON tiered, but turn on DAMON just before making memory access request via YCSB. The above result shows the "default" execution time goes up as the size of cold memory is increased from 440G to 500G because the more cold memory used, the more CXL memory is used for the target redis workload and this makes the execution time increase. However, "DAMON tiered" and other DAMON results show less slowdown because the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated cold memory to CXL node and this free space at DRAM increases more chance to allocate hot or warm pages of redis-server to fast DRAM node. Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages of redis-server to DRAM node actively. As a result, it makes more memory of redis-server stay in DRAM node compared to "default" memory policy and this makes the performance improvement. Please note that the result numbers of "DAMON tiered" and "DAMON lazy" at 500G are marked with * stars, which means their test results are replaced with reproduced tests that didn't have OOM issue. That was needed because sometimes the test processes get OOM when DRAM has insufficient space. The DAMOS_MIGRATE_HOT doesn't kick reclaim but just gives up migration when there is not enough space at DRAM side. The problem happens when there is competition between normal allocation and migration and the migration is done before normal allocation, then the completely unrelated normal allocation can trigger reclaim, which incurs OOM. Because of this issue, I have also tested more cases with "demotion_enabled" flag enabled to make such reclaim doesn't trigger OOM, but just demote reclaimed pages. The following test results show more tests with "kswapd" marked. 2. YCSB zipfian distribution read only workload (with demotion_enabled true) memory pressure with cold memory on node0 with 512GB of local DRAM. ====================+================================================+========= | cold memory occupied by mmap and memset | | 0G 440G 450G 460G 470G 480G 490G 500G | ====================+================================================+========= Execution time normalized to DRAM-only values | GEOMEAN --------------------+------------------------------------------------+--------- DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 1.05 | 1.04 DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 1.06 | 1.05 DAMON tiered kswapd | - 1.03 1.03 1.03 1.03 1.02 1.02 1.03 | 1.03 DAMON lazy kswapd | - 1.04 1.04 1.04 1.03 1.05 1.04 1.05 | 1.04 ====================+================================================+========= CXL usage of redis-server in GB | AVERAGE --------------------+------------------------------------------------+--------- DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2 DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5 DAMON tiered kswapd | - 0.0 0.0 0.4 0.5 0.1 0.8 1.0 | 0.4 DAMON lazy kswapd | - 4.2 4.6 5.3 1.7 6.8 8.1 5.8 | 5.2 ====================+================================================+========= Each test result is based on the exeuction environment as follows. DAMON tiered: same as before DAMON lazy: same as before DAMON tiered kswapd: same as DAMON tiered, but turn on /sys/kernel/mm/numa/demotion_enabled to make kswapd or direct reclaim does demotion. DAMON lazy kswapd: same as DAMON lazy, but turn on /sys/kernel/mm/numa/demotion_enabled to make kswapd or direct reclaim does demotion. The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at all unlike other tests because kswapd and direct reclaim from DRAM node can demote reclaimed pages to CXL node independently from DAMON actions and their results are slightly better than without having "demotion_enabled". In summary, the evaluation results show that DAMON memory management with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown compared to the "default" memory policy from 11% to 3~5% when the system runs with high memory pressure on its fast tier DRAM nodes. Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make tiered memory systems run more efficiently under high memory pressures. This patch (of 7): The alloc_demote_folio can be used out of vmscan.c so it'd be better to remove static keyword from it. Link: https://lkml.kernel.org/r/20240614030010.751-1-honggyu.kim@sk.com Link: https://lkml.kernel.org/r/20240614030010.751-2-honggyu.kim@sk.com Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: move some function declarations into internal.hMiaohe Lin
There are some functions only used inside mm. Move them into internal.h. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-11-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405251049.hxjwX7zO-lkp@intel.com/ Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: introduce pmd|pte_needs_soft_dirty_wp helpers for softdirty write-protectBarry Song
Patch series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them", v2. This patchset introduces the pte_need_soft_dirty_wp and pmd_need_soft_dirty_wp helpers to determine if write protection is required for softdirty tracking. These helpers enhance code readability and improve the overall appearance. They are then utilized in gup, mprotect, swap, and other related functions. This patch (of 2): This patch introduces the pte_needs_soft_dirty_wp and pmd_needs_soft_dirty_wp helpers to determine if write protection is required for softdirty tracking. This can enhance code readability and improve its overall appearance. These new helpers are then utilized in gup, huge_memory, and mprotect. Link: https://lkml.kernel.org/r/20240607211358.4660-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240607211358.4660-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: introduce pte_move_swp_offset() helper which can move offset bidirectionallyBarry Song
There could arise a necessity to obtain the first pte_t from a swap pte_t located in the middle. For instance, this may occur within the context of do_swap_page(), where a page fault can potentially occur in any PTE of a large folio. To address this, the following patch introduces pte_move_swp_offset(), a function capable of bidirectional movement by a specified delta argument. Consequently, pte_next_swp_offset() will directly invoke it with delta = 1. Link: https://lkml.kernel.org/r/20240529082824.150954-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: batch unlink_file_vma calls in free_pgd_rangeMateusz Guzik
Execs of dynamically linked binaries at 20-ish cores are bottlenecked on the i_mmap_rwsem semaphore, while the biggest singular contributor is free_pgd_range inducing the lock acquire back-to-back for all consecutive mappings of a given file. Tracing the count of said acquires while building the kernel shows: [1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 3) 0 | | [3, 4) 3009 | | [4, 5) 3009 | | [5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ | So in particular there were 326442 opportunities to coalesce 5 acquires into 1. Doing so increases execs per second by 4% (~50k to ~52k) when running the benchmark linked below. The lock remains the main bottleneck, I have not looked at other spots yet. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc) Note this particular test makes sure binaries are separate, but the loader is shared. Stats collected on the patched kernel (+ "noinline") with: bpftrace -e 'kprobe:unlink_file_vma_batch_process { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }' Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24/proc/pid/smaps: add mseal info for vmaJeff Xu
Add sl in /proc/pid/smaps to indicate vma is sealed Link: https://lkml.kernel.org/r/20240614232014.806352-2-jeffxu@google.com Fixes: 8be7258aad44 ("mseal: add mseal syscall") Signed-off-by: Jeff Xu <jeffxu@chromium.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>