From f58df566524ebcdfa394329c64f47e3c9257516e Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 17 Mar 2026 17:29:55 +0800 Subject: mm: filemap: fix nr_pages calculation overflow in filemap_map_pages() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When running stress-ng on my Arm64 machine with v7.0-rc3 kernel, I encountered some very strange crash issues showing up as "Bad page state": " [ 734.496287] BUG: Bad page state in process stress-ng-env pfn:415735fb [ 734.496427] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0x4cf316 pfn:0x415735fb [ 734.496434] flags: 0x57fffe000000800(owner_2|node=1|zone=2|lastcpupid=0x3ffff) [ 734.496439] raw: 057fffe000000800 0000000000000000 dead000000000122 0000000000000000 [ 734.496440] raw: 00000000004cf316 0000000000000000 0000000000000000 0000000000000000 [ 734.496442] page dumped because: nonzero mapcount " After analyzing this page’s state, it is hard to understand why the mapcount is not 0 while the refcount is 0, since this page is not where the issue first occurred. By enabling the CONFIG_DEBUG_VM config, I can reproduce the crash as well and captured the first warning where the issue appears: " [ 734.469226] page: refcount:33 mapcount:0 mapping:00000000bef2d187 index:0x81a0 pfn:0x415735c0 [ 734.469304] head: order:5 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 734.469315] memcg:ffff000807a8ec00 [ 734.469320] aops:ext4_da_aops ino:100b6f dentry name(?):"stress-ng-mmaptorture-9397-0-2736200540" [ 734.469335] flags: 0x57fffe400000069(locked|uptodate|lru|head|node=1|zone=2|lastcpupid=0x3ffff) ...... [ 734.469364] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio) [ 734.469390] ------------[ cut here ]------------ [ 734.469393] WARNING: ./include/linux/rmap.h:351 at folio_add_file_rmap_ptes+0x3b8/0x468, CPU#90: stress-ng-mlock/9430 [ 734.469551] folio_add_file_rmap_ptes+0x3b8/0x468 (P) [ 734.469555] set_pte_range+0xd8/0x2f8 [ 734.469566] filemap_map_folio_range+0x190/0x400 [ 734.469579] filemap_map_pages+0x348/0x638 [ 734.469583] do_fault_around+0x140/0x198 ...... [ 734.469640] el0t_64_sync+0x184/0x188 " The code that triggers the warning is: "VM_WARN_ON_FOLIO(page_folio(page + nr_pages - 1) != folio, folio)", which indicates that set_pte_range() tried to map beyond the large folio’s size. By adding more debug information, I found that 'nr_pages' had overflowed in filemap_map_pages(), causing set_pte_range() to establish mappings for a range exceeding the folio size, potentially corrupting fields of pages that do not belong to this folio (e.g., page->_mapcount). After above analysis, I think the possible race is as follows: CPU 0 CPU 1 filemap_map_pages() ext4_setattr() //get and lock folio with old inode->i_size next_uptodate_folio() ....... //shrink the inode->i_size i_size_write(inode, attr->ia_size); //calculate the end_pgoff with the new inode->i_size file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; end_pgoff = min(end_pgoff, file_end); ...... //nr_pages can be overflowed, cause xas.xa_index > end_pgoff end = folio_next_index(folio) - 1; nr_pages = min(end, end_pgoff) - xas.xa_index + 1; ...... //map large folio filemap_map_folio_range() ...... //truncate folios truncate_pagecache(inode, inode->i_size); To fix this issue, move the 'end_pgoff' calculation before next_uptodate_folio(), so the retrieved folio stays consistent with the file end to avoid 'nr_pages' calculation overflow. After this patch, the crash issue is gone. Link: https://lkml.kernel.org/r/1cf1ac59018fc647a87b0dad605d4056a71c14e4.1773739704.git.baolin.wang@linux.alibaba.com Fixes: 743a2753a02e ("filemap: cap PTE range to be created to allowed zero fill in folio_map_range()") Signed-off-by: Baolin Wang Reported-by: Yuanhe Shu Tested-by: Yuanhe Shu Acked-by: Kiryl Shutsemau (Meta) Acked-by: David Hildenbrand (Arm) Cc: Christian Brauner Cc: Daniel Gomez Cc: "Darrick J. Wong" Cc: Dave Chinner Cc: David Howells Cc: Hannes Reinecke Cc: Lorenzo Stoakes (Oracle) Cc: Luis Chamberalin Cc: Matthew Wilcox (Oracle) Cc: Pankaj Raghav Cc: Signed-off-by: Andrew Morton --- mm/filemap.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 406cef06b684..3c1e785542dd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3883,14 +3883,19 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned int nr_pages = 0, folio_type; unsigned short mmap_miss = 0, mmap_miss_saved; + /* + * Recalculate end_pgoff based on file_end before calling + * next_uptodate_folio() to avoid races with concurrent + * truncation. + */ + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; + end_pgoff = min(end_pgoff, file_end); + rcu_read_lock(); folio = next_uptodate_folio(&xas, mapping, end_pgoff); if (!folio) goto out; - file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; - end_pgoff = min(end_pgoff, file_end); - /* * Do not allow to map with PMD across i_size to preserve * SIGBUS semantics. -- cgit v1.2.3 From cece9dc61daab6006d3ac9d36a0df2dd58fef18d Mon Sep 17 00:00:00 2001 From: Joanne Koong Date: Thu, 26 Mar 2026 14:51:27 -0700 Subject: mm: reinstate unconditional writeback start in balance_dirty_pages() Commit 64dd89ae01f2 ("mm/block/fs: remove laptop_mode") removed this unconditional writeback start from balance_dirty_pages(): if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); This logic needs to be reinstated to prevent performance regressions for strictlimited BDIs and memcg setups. The problem occurs because: a) For strictlimited BDIs, throttling is calculated using per-wb thresholds. The per-wb threshold can be exceeded even when the global dirty threshold was not exceeded (nr_dirty < gdtc->bg_thresh) b) For memcg-based throttling, memcg uses its own dirty count / thresholds and can trigger throttling even when the global threshold isn't exceeded Without the unconditional writeback start, IO is throttled as it waits for dirty pages to be written back but there is no writeback running. This leads to severe stalls. On fuse, buffered write performance dropped from 1400 MiB/s to 2000 KiB/s. Reinstate the unconditional writeback start so that writeback is guaranteed to be running whenever IO needs to be throttled. Link: https://lkml.kernel.org/r/20260326215127.3857682-2-joannelkoong@gmail.com Fixes: 64dd89ae01f2 ("mm/block/fs: remove laptop_mode") Signed-off-by: Joanne Koong Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Acked-by: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Signed-off-by: Andrew Morton --- mm/page-writeback.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 601a5e048d12..c1a4b32af1a7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1858,6 +1858,27 @@ free_running: break; } + /* + * Unconditionally start background writeback if it's not + * already in progress. We need to do this because the global + * dirty threshold check above (nr_dirty > gdtc->bg_thresh) + * doesn't account for these cases: + * + * a) strictlimit BDIs: throttling is calculated using per-wb + * thresholds. The per-wb threshold can be exceeded even when + * nr_dirty < gdtc->bg_thresh + * + * b) memcg-based throttling: memcg uses its own dirty count and + * thresholds and can trigger throttling even when global + * nr_dirty < gdtc->bg_thresh + * + * Writeback needs to be started else the writer stalls in the + * throttle loop waiting for dirty pages to be written back + * while no writeback is running. + */ + if (unlikely(!writeback_in_progress(wb))) + wb_start_background_writeback(wb); + mem_cgroup_flush_foreign(wb); /* -- cgit v1.2.3 From 0199390a6b92fc21860e1b858abf525c7e73b956 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 26 Mar 2026 17:32:22 -0700 Subject: mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails damon_call() for repeat_call_control of DAMON_SYSFS could fail if somehow the kdamond is stopped before the damon_call(). It could happen, for example, when te damon context was made for monitroing of a virtual address processes, and the process is terminated immediately, before the damon_call() invocation. In the case, the dyanmically allocated repeat_call_control is not deallocated and leaked. Fix the leak by deallocating the repeat_call_control under the damon_call() failure. This issue is discovered by sashiko [1]. Link: https://lkml.kernel.org/r/20260327003224.55752-1-sj@kernel.org Link: https://lore.kernel.org/20260320020630.962-1-sj@kernel.org [1] Fixes: 04a06b139ec0 ("mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control") Signed-off-by: SeongJae Park Cc: [6.17+] Signed-off-by: Andrew Morton --- mm/damon/sysfs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 6a44a2f3d8fc..eefa959aa30a 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1670,7 +1670,8 @@ static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond) repeat_call_control->data = kdamond; repeat_call_control->repeat = true; repeat_call_control->dealloc_on_cancel = true; - damon_call(ctx, repeat_call_control); + if (damon_call(ctx, repeat_call_control)) + kfree(repeat_call_control); return err; } -- cgit v1.2.3 From 2ecbe06abf9bfb2261cd6464a6bc3a3615625402 Mon Sep 17 00:00:00 2001 From: Hao Li Date: Mon, 30 Mar 2026 11:57:49 +0800 Subject: mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug N_NORMAL_MEMORY is initialized from zone population at boot, but memory hotplug currently only updates N_MEMORY. As a result, a node that gains normal memory via hotplug can remain invisible to users iterating over N_NORMAL_MEMORY, while a node that loses its last normal memory can stay incorrectly marked as such. The most visible effect is that /sys/devices/system/node/has_normal_memory does not report a node even after that node has gained normal memory via hotplug. Also, list_lru-based shrinkers can undercount objects on such a node and may skip reclaim on that node entirely, which can lead to a higher memory footprint than expected. Restore N_NORMAL_MEMORY maintenance directly in online_pages() and offline_pages(). Set the bit when a node that currently lacks normal memory onlines pages into a zone <= ZONE_NORMAL, and clear it when offlining removes the last present pages from zones <= ZONE_NORMAL. This restores the intended semantics without bringing back the old status_change_nid_normal notifier plumbing which was removed in 8d2882a8edb8. Current users that benefit include list_lru, zswap, nfsd filecache, hugetlb_cgroup, and has_normal_memory sysfs reporting. Link: https://lkml.kernel.org/r/20260330035941.518186-1-hao.li@linux.dev Fixes: 8d2882a8edb8 ("mm,memory_hotplug: remove status_change_nid_normal and update documentation") Signed-off-by: Hao Li Reviewed-by: Harry Yoo (Oracle) Acked-by: Vlastimil Babka (SUSE) Reviewed-by: Joshua Hahn Acked-by: David Hildenbrand (Arm) Cc: Oscar Salvador Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index bc805029da51..05a47953ef21 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1209,6 +1209,13 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, if (node_arg.nid >= 0) node_set_state(nid, N_MEMORY); + /* + * Check whether we are adding normal memory to the node for the first + * time. + */ + if (!node_state(nid, N_NORMAL_MEMORY) && zone_idx(zone) <= ZONE_NORMAL) + node_set_state(nid, N_NORMAL_MEMORY); + if (need_zonelists_rebuild) build_all_zonelists(NULL); @@ -1908,6 +1915,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, unsigned long flags; char *reason; int ret; + unsigned long normal_pages = 0; + enum zone_type zt; /* * {on,off}lining is constrained to full memory sections (or more @@ -2055,6 +2064,17 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, /* reinitialise watermarks and update pcp limits */ init_per_zone_wmark_min(); + /* + * Check whether this operation removes the last normal memory from + * the node. We do this before clearing N_MEMORY to avoid the possible + * transient "!N_MEMORY && N_NORMAL_MEMORY" state. + */ + if (zone_idx(zone) <= ZONE_NORMAL) { + for (zt = 0; zt <= ZONE_NORMAL; zt++) + normal_pages += pgdat->node_zones[zt].present_pages; + if (!normal_pages) + node_clear_state(node, N_NORMAL_MEMORY); + } /* * Make sure to mark the node as memory-less before rebuilding the zone * list. Otherwise this node would still appear in the fallback lists. -- cgit v1.2.3 From 894f99eb535edc4514f756818f3c4f688ba53a59 Mon Sep 17 00:00:00 2001 From: Sechang Lim Date: Tue, 31 Mar 2026 18:08:11 +0000 Subject: mm/vma: fix memory leak in __mmap_region() commit 605f6586ecf7 ("mm/vma: do not leak memory when .mmap_prepare swaps the file") handled the success path by skipping get_file() via file_doesnt_need_get, but missed the error path. When /dev/zero is mmap'd with MAP_SHARED, mmap_zero_prepare() calls shmem_zero_setup_desc() which allocates a new shmem file to back the mapping. If __mmap_new_vma() subsequently fails, this replacement file is never fput()'d - the original is released by ksys_mmap_pgoff(), but nobody releases the new one. Add fput() for the swapped file in the error path. Reproducible with fault injection. FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 1 CPU: 2 UID: 0 PID: 366 Comm: syz.7.14 Not tainted 7.0.0-rc6 #2 PREEMPT(full) Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: dump_stack_lvl+0x164/0x1f0 should_fail_ex+0x525/0x650 should_failslab+0xdf/0x140 kmem_cache_alloc_noprof+0x78/0x630 vm_area_alloc+0x24/0x160 __mmap_region+0xf6b/0x2660 mmap_region+0x2eb/0x3a0 do_mmap+0xc79/0x1240 vm_mmap_pgoff+0x252/0x4c0 ksys_mmap_pgoff+0xf8/0x120 __x64_sys_mmap+0x12a/0x190 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x76/0x7e kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) BUG: memory leak unreferenced object 0xffff8881118aca80 (size 360): comm "syz.7.14", pid 366, jiffies 4294913255 hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff c0 28 4d ae ff ff ff ff .........(M..... backtrace (crc db0f53bc): kmem_cache_alloc_noprof+0x3ab/0x630 alloc_empty_file+0x5a/0x1e0 alloc_file_pseudo+0x135/0x220 __shmem_file_setup+0x274/0x420 shmem_zero_setup_desc+0x9c/0x170 mmap_zero_prepare+0x123/0x140 __mmap_region+0xdda/0x2660 mmap_region+0x2eb/0x3a0 do_mmap+0xc79/0x1240 vm_mmap_pgoff+0x252/0x4c0 ksys_mmap_pgoff+0xf8/0x120 __x64_sys_mmap+0x12a/0x190 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x76/0x7e Found by syzkaller. Link: https://lkml.kernel.org/r/20260331180811.1333348-1-rhkrqnwk98@gmail.com Fixes: 605f6586ecf7 ("mm/vma: do not leak memory when .mmap_prepare swaps the file") Signed-off-by: Sechang Lim Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: Vlastimil Babka (SUSE) Cc: Jann Horn Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Pedro Falcato Cc: Signed-off-by: Andrew Morton --- mm/vma.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'mm') diff --git a/mm/vma.c b/mm/vma.c index be64f781a3aa..c8df5f561ad7 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2781,6 +2781,13 @@ unacct_error: if (map.charged) vm_unacct_memory(map.charged); abort_munmap: + /* + * This indicates that .mmap_prepare has set a new file, differing from + * desc->vm_file. But since we're aborting the operation, only the + * original file will be cleaned up. Ensure we clean up both. + */ + if (map.file_doesnt_need_get) + fput(map.file); vms_abort_munmap_vmas(&map.vms, &map.mas_detach); return error; } -- cgit v1.2.3 From 4c04c6b47c361612b1d70cec8f7a60b1482d1400 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 2 Apr 2026 06:44:17 -0700 Subject: mm/damon/stat: deallocate damon_call() failure leaking damon_ctx MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit damon_stat_start() always allocates the module's damon_ctx object (damon_stat_context). Meanwhile, if damon_call() in the function fails, the damon_ctx object is not deallocated. Hence, if the damon_call() is failed, and the user writes Y to “enabled” again, the previously allocated damon_ctx object is leaked. This cannot simply be fixed by deallocating the damon_ctx object when damon_call() fails. That's because damon_call() failure doesn't guarantee the kdamond main function, which accesses the damon_ctx object, is completely finished. In other words, if damon_stat_start() deallocates the damon_ctx object after damon_call() failure, the not-yet-terminated kdamond could access the freed memory (use-after-free). Fix the leak while avoiding the use-after-free by keeping returning damon_stat_start() without deallocating the damon_ctx object after damon_call() failure, but deallocating it when the function is invoked again and the kdamond is completely terminated. If the kdamond is not yet terminated, simply return -EAGAIN, as the kdamond will soon be terminated. The issue was discovered [1] by sashiko. Link: https://lkml.kernel.org/r/20260402134418.74121-1-sj@kernel.org Link: https://lore.kernel.org/20260401012428.86694-1-sj@kernel.org [1] Fixes: 405f61996d9d ("mm/damon/stat: use damon_call() repeat mode instead of damon_callback") Signed-off-by: SeongJae Park Cc: # 6.17.x Signed-off-by: Andrew Morton --- mm/damon/stat.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'mm') diff --git a/mm/damon/stat.c b/mm/damon/stat.c index cf2c5a541eee..60351a719460 100644 --- a/mm/damon/stat.c +++ b/mm/damon/stat.c @@ -245,6 +245,12 @@ static int damon_stat_start(void) { int err; + if (damon_stat_context) { + if (damon_is_running(damon_stat_context)) + return -EAGAIN; + damon_destroy_ctx(damon_stat_context); + } + damon_stat_context = damon_stat_build_ctx(); if (!damon_stat_context) return -ENOMEM; @@ -261,6 +267,7 @@ static void damon_stat_stop(void) { damon_stop(&damon_stat_context, 1); damon_destroy_ctx(damon_stat_context); + damon_stat_context = NULL; } static int damon_stat_enabled_store( -- cgit v1.2.3