summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2026-04-05mm: change the interface of prep_compound_tail()Kiryl Shutsemau
Instead of passing down the head page and tail page index, pass the tail and head pages directly, as well as the order of the compound page. This is a preparation for changing how the head position is encoded in the tail page. Link: https://lkml.kernel.org/r/20260227194302.274384-3-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: move MAX_FOLIO_ORDER definition to mmzone.hKiryl Shutsemau
Patch series "mm: Eliminate fake head pages from vmemmap optimization", v7. This series removes "fake head pages" from the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page. It simplifies compound_head() and page_ref_add_unless(). Both are in the hot path. Background ========== HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages and remapping the freed virtual addresses to a single physical page. Previously, all tail page vmemmap entries were remapped to the first vmemmap page (containing the head struct page), creating "fake heads" - tail pages that appear to have PG_head set when accessed through the deduplicated vmemmap. This required special handling in compound_head() to detect and work around fake heads, adding complexity and overhead to a very hot path. New Approach ============ For architectures/configs where sizeof(struct page) is a power of 2 (the common case), this series changes how position of the head page is encoded in the tail pages. Instead of storing a pointer to the head page, the ->compound_info (renamed from ->compound_head) now stores a mask. The mask can be applied to any tail page's virtual address to compute the head page address. Critically, all tail pages of the same order now have identical compound_info values, regardless of which compound page they belong to. The key insight is that all tail pages of the same order now have identical compound_info values, regardless of which compound page they belong to. In v7, these shared tail pages are allocated per-zone. This ensures that zone information (stored in page->flags) is correct even for shared tail pages, removing the need for the special-casing in page_zonenum() proposed in earlier versions. To support per-zone shared pages for boot-allocated gigantic pages, the vmemmap population is deferred until zones are initialized. This simplifies the logic significantly and allows the removal of vmemmap_undo_hvo(). Benefits ======== 1. Simplified compound_head(): No fake head detection needed, can be implemented in a branchless manner. 2. Simplified page_ref_add_unless(): RCU protection removed since there's no race with fake head remapping. 3. Cleaner architecture: The shared tail pages are truly read-only and contain valid tail page metadata. If sizeof(struct page) is not power-of-2, there are no functional changes. HVO is not supported in this configuration. I had hoped to see performance improvement, but my testing thus far has shown either no change or only a slight improvement within the noise. Series Organization =================== Patch 1: Move MAX_FOLIO_ORDER definition to mmzone.h. Patches 2-4: Refactoring of field names and interfaces. Patches 5-6: Architecture alignment for LoongArch and RISC-V. Patch 7: Mask-based compound_head() implementation. Patch 8: Add memmap alignment checks. Patch 9: Branchless compound_head() optimization. Patch 10: Defer vmemmap population for bootmem hugepages. Patch 11: Refactor vmemmap_walk. Patch 12: x86 vDSO build fix. Patch 13: Eliminate fake heads with per-zone shared tail pages. Patches 14-16: Cleanup of fake head infrastructure. Patch 17: Documentation update. Patch 18: Use compound_head() in page_slab(). This patch (of 17): Move MAX_FOLIO_ORDER definition from mm.h to mmzone.h. This is preparation for adding the vmemmap_tails array to struct zone, which requires MAX_FOLIO_ORDER to be available in mmzone.h. Link: https://lkml.kernel.org/r/20260227194302.274384-1-kas@kernel.org Link: https://lkml.kernel.org/r/20260227194302.274384-2-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05folio_batch: rename PAGEVEC_SIZE to FOLIO_BATCH_SIZETal Zussman
struct pagevec no longer exists. Rename the macro appropriately. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-4-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05folio_batch: rename pagevec.h to folio_batch.hTal Zussman
struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct pagevec"). Rename include/linux/pagevec.h to reflect reality and update includes tree-wide. Add the new filename to MAINTAINERS explicitly, as it no longer matches the "include/linux/page[-_]*" pattern in MEMORY MANAGEMENT - CORE. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: remove stray references to struct pagevecTal Zussman
Patch series "mm: Remove stray references to pagevec", v2. struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct pagevec"). Remove any stray references to it and rename relevant files and macros accordingly. While at it, remove unnecessary #includes of pagevec.h (now folio_batch.h) in .c files. There are probably more of these that could be removed in .h files, but those are more complex to verify. This patch (of 4): struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct pagevec"). Remove remaining forward declarations and change __folio_batch_release()'s declaration to match its definition. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-0-716868cc2d11@columbia.edu Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-1-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: introduce vm_mmap_shadow_stack() as a helper for VM_SHADOW_STACK mappingsCatalin Marinas
Patch series "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE", v2. A series to extract the common shadow stack mmap into a separate helper for arm64, riscv and x86. This patch (of 5): arm64, riscv and x86 use a similar pattern for mapping the user shadow stack (cloned from x86). Extract this into a helper to facilitate code reuse. The call to do_mmap() from the new helper uses PROT_READ|PROT_WRITE prot bits instead of the PROT_READ with an explicit VM_WRITE vm_flag. The x86 intent was to avoid PROT_WRITE implying normal write since the shadow stack is not writable by normal stores. However, from a kernel perspective, the vma is writeable. Functionally there is no difference. Link: https://lkml.kernel.org/r/20260225161404.3157851-1-catalin.marinas@arm.com Link: https://lkml.kernel.org/r/20260225161404.3157851-2-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Paul Walmsley <pjw@kernel.org> Cc: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation failsLance Yang
When freeing page tables, we try to batch them. If batch allocation fails (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without batching. On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single process is unmapping memory. IPI broadcast was reported to hurt RT workloads[1]. tlb_remove_table_sync_one() synchronizes with lockless page-table walkers (e.g. GUP-fast) that rely on IRQ disabling. These walkers use local_irq_disable(), which is also an RCU read-side critical section. This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace period (synchronize_rcu()) instead of IPI broadcast. This provides the same guarantee as IPI but without disrupting all CPUs. Since batch allocation already failed, we are in a slow path where sleeping is acceptable - we are in process context (unmap_region, exit_mmap) with only mmap_lock held. tlb_remove_table_sync_one() is retained for other callers (e.g., khugepaged after pmdp_collapse_flush(), tlb_finish_mmu() when tlb->fully_unshared_tables) that are not slow paths. Converting those may require different approaches such as targeted IPIs. Link: https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/ [1] Link: https://lore.kernel.org/linux-mm/20260202150957.GD1282955@noisy.programming.kicks-ass.net/ Link: https://lore.kernel.org/linux-mm/dfdfeac9-5cd5-46fc-a5c1-9ccf9bd3502a@intel.com/ Link: https://lore.kernel.org/linux-mm/bc489455-bb18-44dc-8518-ae75abda6bec@kernel.org/ Link: https://lkml.kernel.org/r/20260224142101.20500-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: vmscan: add PIDs to vmscan tracepointsThomas Ballasi
The changes aims at adding additionnal tracepoints variables to help debuggers attribute them to specific processes. Link: https://lkml.kernel.org/r/20260316160908.42727-4-tballasi@linux.microsoft.com Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: vmscan: add cgroup IDs to vmscan tracepointsThomas Ballasi
Memory reclaim events are currently difficult to attribute to specific cgroups, making debugging memory pressure issues challenging. This patch adds memory cgroup ID (memcg_id) to key vmscan tracepoints to enable better correlation and analysis. For operations not associated with a specific cgroup, the field is defaulted to 0. Link: https://lkml.kernel.org/r/20260316160908.42727-3-tballasi@linux.microsoft.com Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05tracing: add __event_in_*irq() helpersSteven Rostedt
Patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints", v8. This patch (of 3): Some trace events want to expose in their output if they were triggered in an interrupt or softirq context. Instead of recording this in the event structure itself, as this information is stored in the flags portion of the event header, add helper macros that can be used in the print format: TP_printk("val=%d %s", __entry->val, __event_in_irq() ? "(in-irq)" : "") This will output "(in-irq)" for the event in the trace data if the event was triggered in hard or soft interrupt context. Link: https://lkml.kernel.org/r/20260316160908.42727-1-tballasi@linux.microsoft.com Link: https://lore.kernel.org/all/20251229132942.31a2b583@gandalf.local.home/ Link: https://lkml.kernel.org/r/20260316160908.42727-2-tballasi@linux.microsoft.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: memcontrol: switch to native NR_VMALLOC vmstat counterJohannes Weiner
Eliminates the custom memcg counter and results in a single, consolidated accounting call in vmalloc code. Link: https://lkml.kernel.org/r/20260223160147.3792777-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: vmalloc: streamline vmalloc memory accountingJohannes Weiner
Use a vmstat counter instead of a custom, open-coded atomic. This has the added benefit of making the data available per-node, and prepares for cleaning up the memcg accounting as well. Link: https://lkml.kernel.org/r/20260223160147.3792777-1-hannes@cmpxchg.org Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05kho: adopt radix tree for preserved memory trackingJason Miu
Patch series "Make KHO Stateless", v9. This series transitions KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel. The key motivations for this change are to: - Eliminate the need for data serialization before kexec. - Remove the KHO finalize state. - Pass preservation metadata more directly to the next kernel via the FDT. The new approach uses a radix tree to mark preserved pages. A page's physical address and its order are encoded into a single value. The tree is composed of multiple levels of page-sized tables, with leaf nodes being bitmaps where each set bit represents a preserved page. The physical address of the radix tree's root is passed in the FDT, allowing the next kernel to reconstruct the preserved memory map. This series is broken down into the following patches: 1. kho: Adopt radix tree for preserved memory tracking: Replaces the xarray-based tracker with the new radix tree implementation and increments the ABI version. 2. kho: Remove finalize state and clients: Removes the now-obsolete kho_finalize() function and its usage from client code and debugfs. This patch (of 2): Introduce a radix tree implementation for tracking preserved memory pages and switch the KHO memory tracking mechanism to use it. This lays the groundwork for a stateless KHO implementation that eliminates the need for serialization and the associated "finalize" state. This patch introduces the core radix tree data structures and constants to the KHO ABI. It adds the radix tree node and leaf structures, along with documentation for the radix tree key encoding scheme that combines a page's physical address and order. To support broader use by other kernel subsystems, such as hugetlb preservation, the core radix tree manipulation functions are exported as a public API. The xarray-based memory tracking is replaced with this new radix tree implementation. The core KHO preservation and unpreservation functions are wired up to use the radix tree helpers. On boot, the second kernel restores the preserved memory map by walking the radix tree whose root physical address is passed via the FDT. The ABI `compatible` version is bumped to "kho-v2" to reflect the structural changes in the preserved memory map and sub-FDT property names. This includes renaming "fdt" to "preserved-data" to better reflect that preserved state may use formats other than FDT. [ran.xiaokai@zte.com.cn: fix child node parsing for debugfs in/sub_fdts] Link: https://lkml.kernel.org/r/20260309033530.244508-1-ranxiaokai627@163.com Link: https://lkml.kernel.org/r/20260206021428.3386442-1-jasonmiu@google.com Link: https://lkml.kernel.org/r/20260206021428.3386442-2-jasonmiu@google.com Signed-off-by: Jason Miu <jasonmiu@google.com> Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: khugepaged: skip lazy-free foliosVernon Yang
For example, create three task: hot1 -> cold -> hot2. After all three task are created, each allocate memory 128MB. the hot1/hot2 task continuously access 128 MB memory, while the cold task only accesses its memory briefly and then call madvise(MADV_FREE). However, khugepaged still prioritizes scanning the cold task and only scans the hot2 task after completing the scan of the cold task. All folios in VM_DROPPABLE are lazyfree, Collapsing maintains that property, so we can just collapse and memory pressure in the future will free it up. In contrast, collapsing in !VM_DROPPABLE does not maintain that property, the collapsed folio will not be lazyfree and memory pressure in the future will not be able to free it up. So if the user has explicitly informed us via MADV_FREE that this memory will be freed, and this vma does not have VM_DROPPABLE flags, it is appropriate for khugepaged to skip it only, thereby avoiding unnecessary scan and collapse operations to reducing CPU wastage. Here are the performance test results: (Throughput bigger is better, other smaller is better) Testing on x86_64 machine: | task hot2 | without patch | with patch | delta | |---------------------|---------------|---------------|---------| | total accesses time | 3.14 sec | 2.93 sec | -6.69% | | cycles per access | 4.96 | 2.21 | -55.44% | | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% | | dTLB-load-misses | 284814532 | 69597236 | -75.56% | Testing on qemu-system-x86_64 -enable-kvm: | task hot2 | without patch | with patch | delta | |---------------------|---------------|---------------|---------| | total accesses time | 3.35 sec | 2.96 sec | -11.64% | | cycles per access | 7.29 | 2.07 | -71.60% | | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% | | dTLB-load-misses | 241600871 | 3216108 | -98.67% | [vernon2gm@gmail.com: add comment about VM_DROPPABLE in code, make it clearer] Link: https://lkml.kernel.org/r/i4uowkt4h2ev47obm5h2vtd4zbk6fyw5g364up7kkjn2vmcikq@auepvqethj5r Link: https://lkml.kernel.org/r/20260221093918.1456187-5-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: add folio_test_lazyfree helperVernon Yang
Add folio_test_lazyfree() function to identify lazy-free folios to improve code readability. Link: https://lkml.kernel.org/r/20260221093918.1456187-4-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: khugepaged: add trace_mm_khugepaged_scan eventVernon Yang
Patch series "Improve khugepaged scan logic", v8. This series improves the khugepaged scan logic and reduces CPU consumption by prioritizing scanning tasks that access memory frequently. The following data is traced by bpftrace[1] on a desktop system. After the system has been left idle for 10 minutes upon booting, a lot of SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan by khugepaged. @scan_pmd_status[1]: 1 ## SCAN_SUCCEED @scan_pmd_status[6]: 2 ## SCAN_EXCEED_SHARED_PTE @scan_pmd_status[3]: 142 ## SCAN_PMD_MAPPED @scan_pmd_status[2]: 178 ## SCAN_NO_PTE_TABLE total progress size: 674 MB Total time : 419 seconds ## include khugepaged_scan_sleep_millisecs The khugepaged has below phenomenon: the khugepaged list is scanned in a FIFO manner, as long as the task is not destroyed, 1. the task no longer has memory that can be collapsed into hugepage, continues scan it always. 2. the task at the front of the khugepaged scan list is cold, they are still scanned first. 3. everyone scan at intervals of khugepaged_scan_sleep_millisecs (default 10s). If we always scan the above two cases first, the valid scan will have to wait for a long time. For the first case, when the memory is either SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE or SCAN_PTE_MAPPED_HUGEPAGE [5], just skip it. For the second case, if the user has explicitly informed us via MADV_FREE that these folios will be freed, just skip it only. The below is some performance test results. kernbench results (testing on x86_64 machine): baseline w/o patches test w/ patches Amean user-32 18522.51 ( 0.00%) 18333.64 * 1.02%* Amean syst-32 1137.96 ( 0.00%) 1113.79 * 2.12%* Amean elsp-32 666.04 ( 0.00%) 659.44 * 0.99%* BAmean-95 user-32 18520.01 ( 0.00%) 18323.57 ( 1.06%) BAmean-95 syst-32 1137.68 ( 0.00%) 1110.50 ( 2.39%) BAmean-95 elsp-32 665.92 ( 0.00%) 659.06 ( 1.03%) BAmean-99 user-32 18520.01 ( 0.00%) 18323.57 ( 1.06%) BAmean-99 syst-32 1137.68 ( 0.00%) 1110.50 ( 2.39%) BAmean-99 elsp-32 665.92 ( 0.00%) 659.06 ( 1.03%) Create three task[2]: hot1 -> cold -> hot2. After all three task are created, each allocate memory 128MB. the hot1/hot2 task continuously access 128 MB memory, while the cold task only accesses its memory briefly andthen call madvise(MADV_FREE). Here are the performance test results: (Throughput bigger is better, other smaller is better) Testing on x86_64 machine: | task hot2 | without patch | with patch | delta | |---------------------|---------------|---------------|---------| | total accesses time | 3.14 sec | 2.93 sec | -6.69% | | cycles per access | 4.96 | 2.21 | -55.44% | | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% | | dTLB-load-misses | 284814532 | 69597236 | -75.56% | Testing on qemu-system-x86_64 -enable-kvm: | task hot2 | without patch | with patch | delta | |---------------------|---------------|---------------|---------| | total accesses time | 3.35 sec | 2.96 sec | -11.64% | | cycles per access | 7.29 | 2.07 | -71.60% | | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% | | dTLB-load-misses | 241600871 | 3216108 | -98.67% | This patch (of 4): Add mm_khugepaged_scan event to track the total time for full scan and the total number of pages scanned of khugepaged. Link: https://lkml.kernel.org/r/20260221093918.1456187-2-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: cache struct page for empty_zero_page and return it from ZERO_PAGE()Mike Rapoport (Microsoft)
For most architectures every invocation of ZERO_PAGE() does virt_to_page(empty_zero_page). But empty_zero_page is in BSS and it is enough to get its struct page once at initialization time and then use it whenever a zero page should be accessed. Add yet another __zero_page variable that will be initialized as virt_to_page(empty_zero_page) for most architectures in a weak arch_setup_zero_pages() function. For architectures that use colored zero pages (MIPS and s390) rename their setup_zero_pages() to arch_setup_zero_pages() and make it global rather than static. For architectures that cannot use virt_to_page() for BSS (arm64 and sparc64) add override of arch_setup_zero_pages(). Link: https://lkml.kernel.org/r/20260211103141.3215197-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05arch, mm: consolidate empty_zero_pageMike Rapoport (Microsoft)
Reduce 22 declarations of empty_zero_page to 3 and 23 declarations of ZERO_PAGE() to 4. Every architecture defines empty_zero_page that way or another, but for the most of them it is always a page aligned page in BSS and most definitions of ZERO_PAGE do virt_to_page(empty_zero_page). Move Linus vetted x86 definition of empty_zero_page and ZERO_PAGE() to the core MM and drop these definitions in architectures that do not implement colored zero page (MIPS and s390). ZERO_PAGE() remains a macro because turning it to a wrapper for a static inline causes severe pain in header dependencies. For the most part the change is mechanical, with these being noteworthy: * alpha: aliased empty_zero_page with ZERO_PGE that was also used for boot parameters. Switching to a generic empty_zero_page removes the aliasing and keeps ZERO_PGE for boot parameters only * arm64: uses __pa_symbol() in ZERO_PAGE() so that definition of ZERO_PAGE() is kept intact. * m68k/parisc/um: allocated empty_zero_page from memblock, although they do not support zero page coloring and having it in BSS will work fine. * sparc64 can have empty_zero_page in BSS rather allocate it, but it can't use virt_to_page() for BSS. Keep it's definition of ZERO_PAGE() but instead of allocating it, make mem_map_zero point to empty_zero_page. * sh: used empty_zero_page for boot parameters at the very early boot. Rename the parameters page to boot_params_page and let sh use the generic empty_zero_page. * hexagon: had an amusing comment about empty_zero_page /* A handy thing to have if one has the RAM. Declared in head.S */ that unfortunately had to go :) Link: https://lkml.kernel.org/r/20260211103141.3215197-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Helge Deller <deller@gmx.de> [parisc] Tested-by: Helge Deller <deller@gmx.de> [parisc] Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Magnus Lindholm <linmag7@gmail.com> [alpha] Acked-by: Dinh Nguyen <dinguyen@kernel.org> [nios2] Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: rename my_zero_pfn() to zero_pfn()Mike Rapoport (Microsoft)
my_zero_pfn() is a silly name. Rename zero_pfn variable to zero_page_pfn and my_zero_pfn() function to zero_pfn(). While on it, move extern declarations of zero_page_pfn outside the functions that use it and add a comment about what ZERO_PAGE is. Link: https://lkml.kernel.org/r/20260211103141.3215197-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: don't special case !MMU for is_zero_pfn() and my_zero_pfn()Mike Rapoport (Microsoft)
Patch series "arch, mm: consolidate empty_zero_page", v3. These patches cleanup handling of ZERO_PAGE() and zero_pfn. This patch (of 4): nommu architectures have empty_zero_page and define ZERO_PAGE() and although they don't really use it to populate page tables, there is no reason to hardwire !MMU implementation of is_zero_pfn() and my_zero_pfn() to 0. Drop #ifdef CONFIG_MMU around implementations of is_zero_pfn() and my_zero_pfn() and remove !MMU version. While on it, make zero_pfn __ro_after_init. Link: https://lkml.kernel.org/r/20260211103141.3215197-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20260211103141.3215197-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: name the anonymous MMOP enum as enum mmopGregory Price
Give the MMOP enum (MMOP_OFFLINE, MMOP_ONLINE, etc) a proper type name so the compiler can help catch invalid values being assigned to variables of this type. Leave the existing functions returning int alone to allow for value-or-error pattern to remain unchanged without churn. mmop_default_online_type is left as int because it uses the -1 sentinal value to signal it hasn't been initialized yet. Keep the uint8_t buffer in offline_and_remove_memory() as-is for space efficiency, with an explicit cast when we consume the value. Move the enum definition before the CONFIG_MEMORY_HOTPLUG guard so it is unconditionally available for struct memory_block in memory.h. No functional change. Link: https://lore.kernel.org/linux-mm/3424eba7-523b-4351-abd0-3a888a3e5e61@kernel.org/ Link: https://lkml.kernel.org/r/20260211215447.2194189-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com> Suggested-by: "David Hildenbrand (arm)" <david@kernel.org> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: zswap: add per-memcg stat for incompressible pagesJiayuan Chen
Patch series "mm: zswap: add per-memcg stat for incompressible pages", v3. In containerized environments, knowing which cgroup is contributing incompressible pages to zswap is essential for effective resource management. This series adds a new per-memcg stat 'zswap_incomp' to track incompressible pages, along with a selftest. This patch (of 2): The global zswap_stored_incompressible_pages counter was added in commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page as-is") to track how many pages are stored in raw (uncompressed) form in zswap. However, in containerized environments, knowing which cgroup is contributing incompressible pages is essential for effective resource management [1]. Add a new memcg stat 'zswap_incomp' to track incompressible pages per cgroup. This helps administrators and orchestrators to: 1. Identify workloads that produce incompressible data (e.g., encrypted data, already-compressed media, random data) and may not benefit from zswap. 2. Make informed decisions about workload placement - moving incompressible workloads to nodes with larger swap backing devices rather than relying on zswap. 3. Debug zswap efficiency issues at the cgroup level without needing to correlate global stats with individual cgroups. While the compression ratio can be estimated from existing stats (zswap / zswapped * PAGE_SIZE), this doesn't distinguish between "uniformly poor compression" and "a few completely incompressible pages mixed with highly compressible ones". The zswap_incomp stat provides direct visibility into the latter case. Link: https://lkml.kernel.org/r/20260213071827.5688-1-jiayuan.chen@linux.dev Link: https://lkml.kernel.org/r/20260213071827.5688-2-jiayuan.chen@linux.dev Link: https://lore.kernel.org/linux-mm/CAF8kJuONDFj4NAksaR4j_WyDbNwNGYLmTe-o76rqU17La=nkOw@mail.gmail.com/ [1] Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon: remove unused target param of get_scheme_score()Asier Gutierrez
damon_target is not used by get_scheme_score operations, nor with virtual neither with physical addresses. Link: https://lkml.kernel.org/r/20260213145032.1740407-1-gutierrez.asier@huawei-partners.com Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Quanmin Yan <yanquanmin1@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: memfd_luo: preserve file sealsPratyush Yadav (Google)
File seals are used on memfd for making shared memory communication with untrusted peers safer and simpler. Seals provide a guarantee that certain operations won't be allowed on the file such as writes or truncations. Maintaining these guarantees across a live update will help keeping such use cases secure. These guarantees will also be needed for IOMMUFD preservation with LUO. Normally when IOMMUFD maps a memfd, it pins all its pages to make sure any truncation operations on the memfd don't lead to IOMMUFD using freed memory. This doesn't work with LUO since the preserved memfd might have completely different pages after a live update, and mapping them back to the IOMMUFD will cause all sorts of problems. Using and preserving the seals allows IOMMUFD preservation logic to trust the memfd. Since the uABI defines seals as an int, preserve them by introducing a new u32 field. There are currently only 6 possible seals, so the extra bits are unused and provide room for future expansion. Since the seals are uABI, it is safe to use them directly in the ABI. While at it, also add a u32 flags field. It makes sure the struct is nicely aligned, and can be used later to support things like MFD_CLOEXEC. Since the serialization structure is changed, bump the version number to "memfd-v2". It is important to note that the memfd-v2 version only supports seals that existed when this version was defined. This set is defined by MEMFD_LUO_ALL_SEALS. Any new seal might bring a completely different semantic with it and the parser for memfd-v2 cannot be expected to deal with that. If there are any future seals added, they will need another version bump. Link: https://lkml.kernel.org/r/20260216185946.1215770-3-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Tested-by: Samiullah Khawaja <skhawaja@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05memfd: export memfd_{add,get}_seals()Pratyush Yadav (Google)
Patch series "mm: memfd_luo: preserve file seals", v2. This series adds support for preserving file seals when preserving a memfd using LUO. Patch 1 exports some memfd seal manipulation functions and patch 2 adds support for preserving them. Since it makes changes to the serialized data structure for memfd, it also bumps the version number. This patch (of 2): Support for preserving file seals will be added to memfd preservation using the Live Update Orchestrator (LUO). Export memfd_{add,get}_seals)() so memfd_luo can use them to manipulate the seals. Link: https://lkml.kernel.org/r/20260216185946.1215770-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260216185946.1215770-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Samiullah Khawaja <skhawaja@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm, swap: use the swap table to track the swap countKairui Song
Now all the infrastructures are ready, switch to using the swap table only. This is unfortunately a large patch because the whole old counting mechanism, especially SWP_CONTINUED, has to be gone and switch to the new mechanism together, with no intermediate steps available. The swap table is capable of holding up to SWP_TB_COUNT_MAX - 1 counts in the higher bits of each table entry, so using that, the swap_map can be completely dropped. swap_map also had a limit of SWAP_CONT_MAX. Any value beyond that limit will require a COUNT_CONTINUED page. COUNT_CONTINUED is a bit complex to maintain, so for the swap table, a simpler approach is used: when the count goes beyond SWP_TB_COUNT_MAX - 1, the cluster will have an extend_table allocated, which is a swap cluster-sized array of unsigned int. The counting is basically offloaded there until the count drops below SWP_TB_COUNT_MAX again. Both the swap table and the extend table are cluster-based, so they exhibit good performance and sparsity. To make the switch from swap_map to swap table clean, this commit cleans up and introduces a new set of functions based on the swap table design, for manipulating swap counts: - __swap_cluster_dup_entry, __swap_cluster_put_entry, __swap_cluster_alloc_entry, __swap_cluster_free_entry: Increase/decrease the count of a swap slot, or alloc / free a swap slot. This is the internal routine that does the counting work based on the swap table and handles all the complexities. The caller will need to lock the cluster before calling them. All swap count-related update operations are wrapped by these four helpers. - swap_dup_entries_cluster, swap_put_entries_cluster: Increase/decrease the swap count of one or a set of swap slots in the same cluster range. These two helpers serve as the common routines for folio_dup_swap & swap_dup_entry_direct, or folio_put_swap & swap_put_entries_direct. And use these helpers to replace all existing callers. This helps to simplify the count tracking by a lot, and the swap_map is gone. [ryncsn@gmail.com: fix build] Link: https://lkml.kernel.org/r/aZWuLZi-vYi3vAWe@KASONG-MC4 Link: https://lkml.kernel.org/r/20260218-swap-table-p3-v3-9-f4e34be021a7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <lkp@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: move pgscan, pgsteal, pgrefill to node statsJP Kobryn (Meta)
There are situations where reclaim kicks in on a system with free memory. One possible cause is a NUMA imbalance scenario where one or more nodes are under pressure. It would help if we could easily identify such nodes. Move the pgscan, pgsteal, and pgrefill counters from vm_event_item to node_stat_item to provide per-node reclaim visibility. With these counters as node stats, the values are now displayed in the per-node section of /proc/zoneinfo, which allows for quick identification of the affected nodes. /proc/vmstat continues to report the same counters, aggregated across all nodes. But the ordering of these items within the readout changes as they move from the vm events section to the node stats section. Memcg accounting of these counters is preserved. The relocated counters remain visible in memory.stat alongside the existing aggregate pgscan and pgsteal counters. However, this change affects how the global counters are accumulated. Previously, the global event count update was gated on !cgroup_reclaim(), excluding memcg-based reclaim from /proc/vmstat. Now that mod_lruvec_state() is being used to update the counters, the global counters will include all reclaim. This is consistent with how pgdemote counters are already tracked. Finally, the virtio_balloon driver is updated to use global_node_page_state() to fetch the counters, as they are no longer accessible through the vm_events array. Link: https://lkml.kernel.org/r/20260219235846.161910-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn <jp.kobryn@linux.dev> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05maple_tree: start using maple copy node for destinationLiam R. Howlett
Stop using the maple subtree state and big node in favour of using three destinations in the maple copy node. That is, expand the way leaves were handled to all levels of the tree and use the maple copy node to track the new nodes. Extract out the sibling init into the data calculation since this is where the insufficient data can be detected. The remainder of the sibling code to shift the next iteration is moved to the spanning_ascend() function, since it is not always needed. Next introduce the dst_setup() function which will decide how many nodes are needed to contain the data at this level. Using the destination count, populate the copy node's dst array with the new nodes and set d_count to the correct value. Note that this can be tricky in the case of a leaf node with exactly enough room because of the rule against NULLs at the end of leaves. Once the destinations are ready, copy the data by altering the cp_data_write() function to copy from the sources to the destinations directly. This eliminates the use of the big node in this code path. On node completion, node_finalise() will zero out the remaining area and set the metadata, if necessary. spanning_ascend() is used to decide if the operation is complete. It may create a new root, converge into one destination, or continue upwards by ascending the left and right write maple states. One test case setup needed to be tweaked so that the targeted node was surrounded by full nodes. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20260130205935.2559335-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05maple_tree: add gap support, slot and pivot sizes for maple copyLiam R. Howlett
Add plumbing work for using maple copy as a normal node for a source of copy operations. This is needed later. Link: https://lkml.kernel.org/r/20260130205935.2559335-17-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05maple_tree: change initial big node setup in mas_wr_spanning_rebalance()Liam R. Howlett
Instead of copying the data into the big node and finding out that the data may need to be moved or appended to, calculate the data space up front (in the maple copy node) and set up another source for the copy. The additional copy source is tracked in the maple state sib (short for sibling), and is put into the maple write states for future operations after the data is in the big node. To facilitate the newly moved node, some initial setup of the maple subtree state are relocated after the potential shift caused by the new way of rebalancing against a sibling. Link: https://lkml.kernel.org/r/20260130205935.2559335-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05maple_tree: introduce maple_copy node and use it in mas_spanning_rebalance()Liam R. Howlett
Introduce an internal-memory only node type called maple_copy to facilitate internal copy operations. Use it in mas_spanning_rebalance() for just the leaf nodes. Initially, the maple_copy node is used to configure the source nodes and copy the data into the big_node. The maple_copy contains a list of source entries with start and end offsets. One of the maple_copy entries can be itself with an offset of 0 to 2, representing the data where the store partially overwrites entries, or fully overwrites the entry. The side effect is that the source nodes no longer have to worry about partially copying the existing offset if it is not fully overwritten. This is in preparation of removal of the maple big_node, but for the time being the data is copied to the big node to limit the change size. Link: https://lkml.kernel.org/r/20260130205935.2559335-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-24Merge tag 'mm-hotfixes-stable-2026-03-23-17-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "6 hotfixes. 2 are cc:stable. All are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-23-17-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/damon/stat: monitor all System RAM resources mm/zswap: add missing kunmap_local() mailmap: update email address for Muhammad Usama Anjum zram: do not slot_free() written-back slots mm/damon/core: avoid use of half-online-committed context mm/rmap: clear vma->anon_vma on error
2026-03-23Merge tag 'xsa482-7.0-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Restrict the xen privcmd driver in unprivileged domU to only allow hypercalls to target domain when using secure boot" * tag 'xsa482-7.0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/privcmd: add boot control for restricted usage in domU xen/privcmd: restrict usage in unprivileged domU
2026-03-22Merge tag 'trace-v7.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Revert "tracing: Remove pid in task_rename tracing output" A change was made to remove the pid field from the task_rename event because it was thought that it was always done for the current task and recording the pid would be redundant. This turned out to be incorrect and there are a few corner case where this is not true and caused some regressions in tooling. - Fix the reading from user space for migration The reading of user space uses a seq lock type of logic where it uses a per-cpu temporary buffer and disables migration, then enables preemption, does the copy from user space, disables preemption, enables migration and checks if there was any schedule switches while preemption was enabled. If there was a context switch, then it is considered that the per-cpu buffer could be corrupted and it tries again. There's a protection check that tests if it takes a hundred tries, it issues a warning and exits out to prevent a live lock. This was triggered because the task was selected by the load balancer to be migrated to another CPU, every time preemption is enabled the migration task would schedule in try to migrate the task but can't because migration is disabled and let it run again. This caused the scheduler to schedule out the task every time it enabled preemption and made the loop never exit (until the 100 iteration test triggered). Fix this by enabling and disabling preemption and keeping migration enabled if the reading from user space needs to be done again. This will let the migration thread migrate the task and the copy from user space will likely pass on the next iteration. - Fix trace_marker copy option freeing The "copy_trace_marker" option allows a tracing instance to get a copy of a write to the trace_marker file of the top level instance. This is managed by a link list protected by RCU. When an instance is removed, a check is made if the option is set, and if so synchronized_rcu() is called. The problem is that an iteration is made to reset all the flags to what they were when the instance was created (to perform clean ups) was done before the check of the copy_trace_marker option and that option was cleared, so the synchronize_rcu() was never called. Move the clearing of all the flags after the check of copy_trace_marker to do synchronize_rcu() so that the option is still set if it was before and the synchronization is performed. - Fix entries setting when validating the persistent ring buffer When validating the persistent ring buffer on boot up, the number of events per sub-buffer is added to the sub-buffer meta page. The validator was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu buffer) and not the "head_page" variable that was iterating the sub-buffers. This was causing the first sub-buffer to be assigned the entries for each sub-buffer and not the sub-buffer that was supposed to be updated. - Use "hash" value to update the direct callers When updating the ftrace direct callers, it assigned a temporary callback to all the callback functions of the ftrace ops and not just the functions represented by the passed in hash. This causes an unnecessary slow down of the functions of the ftrace_ops that is not being modified. Only update the functions that are going to be modified to call the ftrace loop function so that the update can be made on those functions. * tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod ring-buffer: Fix to update per-subbuf entries of persistent ring buffer tracing: Fix trace_marker copy link list updates tracing: Fix failure to read user space from system call trace events tracing: Revert "tracing: Remove pid in task_rename tracing output"
2026-03-22Merge tag 'locking-urgent-2026-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix a sparse build error regression in <linux/local_lock_internal.h> caused by the locking context-analysis changes" * tag 'locking-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: include/linux/local_lock_internal.h: Make this header file again compatible with sparse
2026-03-21mm/damon/core: avoid use of half-online-committed contextSeongJae Park
One major usage of damon_call() is online DAMON parameters update. It is done by calling damon_commit_ctx() inside the damon_call() callback function. damon_commit_ctx() can fail for two reasons: 1) invalid parameters and 2) internal memory allocation failures. In case of failures, the damon_ctx that attempted to be updated (commit destination) can be partially updated (or, corrupted from a perspective), and therefore shouldn't be used anymore. The function only ensures the damon_ctx object can safely deallocated using damon_destroy_ctx(). The API callers are, however, calling damon_commit_ctx() only after asserting the parameters are valid, to avoid damon_commit_ctx() fails due to invalid input parameters. But it can still theoretically fail if the internal memory allocation fails. In the case, DAMON may run with the partially updated damon_ctx. This can result in unexpected behaviors including even NULL pointer dereference in case of damos_commit_dests() failure [1]. Such allocation failure is arguably too small to fail, so the real world impact would be rare. But, given the bad consequence, this needs to be fixed. Avoid such partially-committed (maybe-corrupted) damon_ctx use by saving the damon_commit_ctx() failure on the damon_ctx object. For this, introduce damon_ctx->maybe_corrupted field. damon_commit_ctx() sets it when it is failed. kdamond_call() checks if the field is set after each damon_call_control->fn() is executed. If it is set, ignore remaining callback requests and return. All kdamond_call() callers including kdamond_fn() also check the maybe_corrupted field right after kdamond_call() invocations. If the field is set, break the kdamond_fn() main loop so that DAMON sill doesn't use the context that might be corrupted. [sj@kernel.org: let kdamond_call() with cancel regardless of maybe_corrupted] Link: https://lkml.kernel.org/r/20260320031553.2479-1-sj@kernel.org Link: https://sashiko.dev/#/patchset/20260319145218.86197-1-sj%40kernel.org Link: https://lkml.kernel.org/r/20260319145218.86197-1-sj@kernel.org Link: https://lore.kernel.org/20260319043309.97966-1-sj@kernel.org [1] Fixes: 3301f1861d34 ("mm/damon/sysfs: handle commit command using damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-21Merge tag 'driver-core-7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Generalize driver_override in the driver core, providing a common sysfs implementation and concurrency-safe accessors for bus implementations - Do not use driver_override as IRQ name in the hwmon axi-fan driver - Remove an unnecessary driver_override check in sh platform_early - Migrate the platform bus to use the generic driver_override infrastructure, fixing a UAF condition caused by accessing the driver_override field without proper locking in the platform_match() callback * tag 'driver-core-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: driver core: platform: use generic driver_override infrastructure sh: platform_early: remove pdev->driver_override check hwmon: axi-fan: don't use driver_override as IRQ name docs: driver-model: document driver_override driver core: generalize driver_override in struct device
2026-03-21tracing: Revert "tracing: Remove pid in task_rename tracing output"Xuewen Yan
This reverts commit e3f6a42272e028c46695acc83fc7d7c42f2750ad. The commit says that the tracepoint only deals with the current task, however the following case is not current task: comm_write() { p = get_proc_task(inode); if (!p) return -ESRCH; if (same_thread_group(current, p)) set_task_comm(p, buffer); } where set_task_comm() calls __set_task_comm() which records the update of p and not current. So revert the patch to show pid. Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Cc: <elver@google.com> Cc: <kees@kernel.org> Link: https://patch.msgid.link/20260306075954.4533-1-xuewen.yan@unisoc.com Fixes: e3f6a42272e0 ("tracing: Remove pid in task_rename tracing output") Reported-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-20Merge tag 'execve-v7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fixes from Kees Cook: - binfmt_elf_fdpic: fix AUXV size calculation (Andrei Vagin) - fs/tests: exec: Remove bad test vector * tag 'execve-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: fs/tests: exec: Remove bad test vector binfmt_elf_fdpic: fix AUXV size calculation for ELF_HWCAP3 and ELF_HWCAP4
2026-03-20Merge tag 'tty-7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some small tty/vt and serial driver fixes for 7.0-rc5. Included in here are: - 8250 driver fixes for reported problems - serial core lockup fix - uartlite driver bugfix - vt save/restore bugfix All of these have been in linux-next for over a week with no reported problems" * tag 'tty-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: vt: save/restore unicode screen buffer for alternate screen serial: 8250_dw: Ensure BUSY is deasserted serial: 8250: Add late synchronize_irq() to shutdown to handle DW UART BUSY serial: 8250_dw: Rework IIR_NO_INT handling to stop interrupt storm serial: 8250_dw: Rework dw8250_handle_irq() locking and IIR handling serial: 8250: Add serial8250_handle_irq_locked() serial: 8250_dw: Avoid unnecessary LCR writes serial: 8250: Protect LCR write in shutdown serial: 8250_pci: add support for the AX99100 serial: core: fix infinite loop in handle_tx() for PORT_UNKNOWN serial: uartlite: fix PM runtime usage count underflow on probe serial: 8250: always disable IRQ during THRE test serial: 8250: Fix TX deadlock when using DMA
2026-03-20Merge tag 'io_uring-7.0-20260320' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A bit of a work-around for AF_UNIX recv multishot, as the in-kernel implementation doesn't properly signal EOF. We'll likely rework this one going forward, but the fix is sufficient for now - Two fixes for incrementally consumed buffers, for non-pollable files and for 0 byte reads * tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/kbuf: propagate BUF_MORE through early buffer commit path io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF io_uring/poll: fix multishot recv missing EOF on wakeup race
2026-03-20Merge tag 'iommu-fixes-v7.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fixes from Joerg Roedel: "Intel VT-d: - Abort all pending requests on dev_tlb_inv timeout to avoid hardlockup - Limit IOPF handling to PRI-capable device to avoid SVA attach failure AMD-Vi: - Make sure identity domain is not used when SNP is active Core fixes: - Handle mapping IOVA 0x0 correctly - Fix crash in SVA code - Kernel-doc fix in IO-PGTable code" * tag 'iommu-fixes-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: iommu/amd: Block identity domain when SNP enabled iommu/sva: Fix crash in iommu_sva_unbind_device() iommu/io-pgtable: fix all kernel-doc warnings in io-pgtable.h iommu: Fix mapping check for 0x0 to avoid re-mapping it iommu/vt-d: Only handle IOPF for SVA when PRI is supported iommu/vt-d: Fix intel iommu iotlb sync hardlockup and retry
2026-03-20Merge tag 'hyperv-fixes-signed-20260319' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull Hyper-V fixes from Wei Liu: - Fix ARM64 MSHV support (Anirudh Rayabharam) - Fix MSHV driver memory handling issues (Stanislav Kinsburskii) - Update maintainers for Hyper-V DRM driver (Saurabh Sengar) - Misc clean up in MSHV crashdump code (Ard Biesheuvel, Uros Bizjak) - Minor improvements to MSHV code (Mukesh R, Wei Liu) - Revert not yet released MSHV scrub partition hypercall (Wei Liu) * tag 'hyperv-fixes-signed-20260319' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: mshv: Fix error handling in mshv_region_pin MAINTAINERS: Update maintainers for Hyper-V DRM driver mshv: Fix use-after-free in mshv_map_user_memory error path mshv: pass struct mshv_user_mem_region by reference x86/hyperv: Use any general-purpose register when saving %cr2 and %cr8 x86/hyperv: Use current_stack_pointer to avoid asm() in hv_hvcrash_ctxt_save() x86/hyperv: Save segment registers directly to memory in hv_hvcrash_ctxt_save() x86/hyperv: Use __naked attribute to fix stackless C function Revert "mshv: expose the scrub partition hypercall" mshv: add arm64 support for doorbell & intercept SINTs mshv: refactor synic init and cleanup x86/hyperv: print out reserved vectors in hexadecimal
2026-03-20xen/privcmd: add boot control for restricted usage in domUJuergen Gross
When running in an unprivileged domU under Xen, the privcmd driver is restricted to allow only hypercalls against a target domain, for which the current domU is acting as a device model. Add a boot parameter "unrestricted" to allow all hypercalls (the hypervisor will still refuse destructive hypercalls affecting other guests). Make this new parameter effective only in case the domU wasn't started using secure boot, as otherwise hypercalls targeting the domU itself might result in violating the secure boot functionality. This is achieved by adding another lockdown reason, which can be tested to not being set when applying the "unrestricted" option. This is part of XSA-482 Signed-off-by: Juergen Gross <jgross@suse.com> --- V2: - new patch
2026-03-19io_uring/kbuf: propagate BUF_MORE through early buffer commit pathJens Axboe
When io_should_commit() returns true (eg for non-pollable files), buffer commit happens at buffer selection time and sel->buf_list is set to NULL. When __io_put_kbufs() generates CQE flags at completion time, it calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence cannot determine whether the buffer was consumed or not. This means that IORING_CQE_F_BUF_MORE is never set for non-pollable input with incrementally consumed buffers. Likewise for io_buffers_select(), which always commits upfront and discards the return value of io_kbuf_commit(). Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early commit. Then __io_put_kbuf_ring() can check this flag and set IORING_F_BUF_MORE accordingy. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-19Merge tag 'net-7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless, Bluetooth and netfilter. Nothing too exciting here, mostly fixes for corner cases. Current release - fix to a fix: - bonding: prevent potential infinite loop in bond_header_parse() Current release - new code bugs: - wifi: mac80211: check tdls flag in ieee80211_tdls_oper Previous releases - regressions: - af_unix: give up GC if MSG_PEEK intervened - netfilter: conntrack: add missing netlink policy validations - NFC: nxp-nci: allow GPIOs to sleep" * tag 'net-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits) MPTCP: fix lock class name family in pm_nl_create_listen_socket icmp: fix NULL pointer dereference in icmp_tag_validation() net: dsa: bcm_sf2: fix missing clk_disable_unprepare() in error paths net: shaper: protect from late creation of hierarchy net: shaper: protect late read accesses to the hierarchy net: mvpp2: guard flow control update with global_tx_fc in buffer switching nfnetlink_osf: validate individual option lengths in fingerprints netfilter: nf_tables: release flowtable after rcu grace period on error netfilter: bpf: defer hook memory release until rcu readers are done net: bonding: fix NULL deref in bond_debug_rlb_hash_show udp_tunnel: fix NULL deref caused by udp_sock_create6 when CONFIG_IPV6=n net/mlx5e: Fix race condition during IPSec ESN update net/mlx5e: Prevent concurrent access to IPSec ASO context net/mlx5: qos: Restrict RTNL area to avoid a lock cycle ipv6: add NULL checks for idev in SRv6 paths NFC: nxp-nci: allow GPIOs to sleep net: macb: fix uninitialized rx_fs_lock net: macb: fix use-after-free access to PTP clock netdevsim: drop PSP ext ref on forward failure wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure ...
2026-03-18Merge tag 'wireless-2026-03-18' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just a few updates: - cfg80211: - guarantee pmsr work is cancelled - mac80211: - reject TDLS operations on non-TDLS stations - fix crash in AP_VLAN bandwidth change - fix leak or double-free on some TX preparation failures - remove keys needed for beacons _after_ stopping those - fix debugfs static branch race - avoid underflow in inactive time - fix another NULL dereference in mesh on invalid frames - ti/wlcore: avoid infinite realloc loop * tag 'wireless-2026-03-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure wifi: wlcore: Return -ENOMEM instead of -EAGAIN if there is not enough headroom wifi: mac80211: fix NULL deref in mesh_matches_local() wifi: mac80211: check tdls flag in ieee80211_tdls_oper wifi: cfg80211: cancel pmsr_free_wk in cfg80211_pmsr_wdev_down wifi: mac80211: Fix static_branch_dec() underflow for aql_disable. mac80211: fix crash in ieee80211_chan_bw_change for AP_VLAN stations wifi: mac80211: use jiffies_delta_to_msecs() for sta_info inactive times wifi: mac80211: remove keys after disabling beaconing wifi: mac80211_hwsim: fully initialise PMSR capabilities ==================== Link: https://patch.msgid.link/20260318172515.381148-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-18udp_tunnel: fix NULL deref caused by udp_sock_create6 when CONFIG_IPV6=nXiang Mei
When CONFIG_IPV6 is disabled, the udp_sock_create6() function returns 0 (success) without actually creating a socket. Callers such as fou_create() then proceed to dereference the uninitialized socket pointer, resulting in a NULL pointer dereference. The captured NULL deref crash: BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:fou_nl_add_doit (net/ipv4/fou_core.c:590 net/ipv4/fou_core.c:764) [...] Call Trace: <TASK> genl_family_rcv_msg_doit.constprop.0 (net/netlink/genetlink.c:1114) genl_rcv_msg (net/netlink/genetlink.c:1194 net/netlink/genetlink.c:1209) [...] netlink_rcv_skb (net/netlink/af_netlink.c:2550) genl_rcv (net/netlink/genetlink.c:1219) netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sock_sendmsg (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1)) __sys_sendto (./include/linux/file.h:62 (discriminator 1) ./include/linux/file.h:83 (discriminator 1) net/socket.c:2183 (discriminator 1)) __x64_sys_sendto (net/socket.c:2213 (discriminator 1) net/socket.c:2209 (discriminator 1) net/socket.c:2209 (discriminator 1)) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (net/arch/x86/entry/entry_64.S:130) This patch makes udp_sock_create6 return -EPFNOSUPPORT instead, so callers correctly take their error paths. There is only one caller of the vulnerable function and only privileged users can trigger it. Fixes: fd384412e199b ("udp_tunnel: Seperate ipv6 functions into its own file.") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260317010241.1893893-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-18wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failureFelix Fietkau
ieee80211_tx_prepare_skb() has three error paths, but only two of them free the skb. The first error path (ieee80211_tx_prepare() returning TX_DROP) does not free it, while invoke_tx_handlers() failure and the fragmentation check both do. Add kfree_skb() to the first error path so all three are consistent, and remove the now-redundant frees in callers (ath9k, mt76, mac80211_hwsim) to avoid double-free. Document the skb ownership guarantee in the function's kdoc. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/20260314065455.2462900-1-nbd@nbd.name Fixes: 06be6b149f7e ("mac80211: add ieee80211_tx_prepare_skb() helper function") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-03-17Merge tag 'hid-for-linus-2026031701' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: - various fixes dealing with (intentionally) broken devices in HID core, logitech-hidpp and multitouch drivers (Lee Jones) - fix for OOB in wacom driver (Benoît Sevens) - fix for potentialy HID-bpf-induced buffer overflow in () (Benjamin Tissoires) - various other small fixes and device ID / quirk additions * tag 'hid-for-linus-2026031701' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: multitouch: Check to ensure report responses match the request HID: logitech-hidpp: Prevent use-after-free on force feedback initialisation failure HID: bpf: prevent buffer overflow in hid_hw_request selftests/hid: fix compilation when bpf_wq and hid_device are not exported HID: core: Mitigate potential OOB by removing bogus memset() HID: intel-thc-hid: Set HID_PHYS with PCI BDF HID: appletb-kbd: add .resume method in PM HID: logitech-hidpp: Enable MX Master 4 over bluetooth HID: input: Add HID_BATTERY_QUIRK_DYNAMIC for Elan touchscreens HID: input: Drop Asus UX550* touchscreen ignore battery quirks HID: asus: add xg mobile 2022 external hardware support HID: wacom: fix out-of-bounds read in wacom_intuos_bt_irq