summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
12 hoursMerge tag 'mm-hotfixes-stable-2026-04-19-00-14' of ↵HEADmasterLinus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "7 hotfixes. 6 are cc:stable and all are for MM. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/damon/core: disallow non-power of two min_region_sz on damon_start() mm/vmalloc: take vmap_purge_lock in shrinker mm: call ->free_folio() directly in folio_unmap_invalidate() mm: blk-cgroup: fix use-after-free in cgwb_release_workfn() mm/zone_device: do not touch device folio after calling ->folio_free() mm/damon/core: disallow time-quota setting zero esz mm/mempolicy: fix weighted interleave auto sysfs name
18 hoursMerge tag 'mm-stable-2026-04-18-02-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
27 hoursmm/damon/core: disallow non-power of two min_region_sz on damon_start()SeongJae Park
Commit d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") introduced a bug that allows unaligned DAMON region address ranges. Commit c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz") fixed it, but only for damon_commit_ctx() use case. Still, DAMON sysfs interface can emit non-power of two min_region_sz via damon_start(). Fix the path by adding the is_power_of_2() check on damon_start(). The issue was discovered by sashiko [1]. Link: https://lore.kernel.org/20260411213638.77768-1-sj@kernel.org Link: https://lore.kernel.org/20260403155530.64647-1-sj@kernel.org [1] Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hoursmm/vmalloc: take vmap_purge_lock in shrinkerUladzislau Rezki (Sony)
decay_va_pool_node() can be invoked concurrently from two paths: __purge_vmap_area_lazy() when pools are being purged, and the shrinker via vmap_node_shrink_scan(). However, decay_va_pool_node() is not safe to run concurrently, and the shrinker path currently lacks serialization, leading to races and possible leaks. Protect decay_va_pool_node() by taking vmap_purge_lock in the shrinker path to ensure serialization with purge users. Link: https://lore.kernel.org/20260413192646.14683-1-urezki@gmail.com Fixes: 7679ba6b36db ("mm: vmalloc: add a shrinker to drain vmap pools") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <baoquan.he@linux.dev> Cc: chenyichong <chenyichong@uniontech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hoursmm: call ->free_folio() directly in folio_unmap_invalidate()Matthew Wilcox (Oracle)
We can only call filemap_free_folio() if we have a reference to (or hold a lock on) the mapping. Otherwise, we've already removed the folio from the mapping so it no longer pins the mapping and the mapping can be removed, causing a use-after-free when accessing mapping->a_ops. Follow the same pattern as __remove_mapping() and load the free_folio function pointer before dropping the lock on the mapping. That lets us make filemap_free_folio() static as this was the only caller outside filemap.c. Link: https://lore.kernel.org/20260413184314.3419945-1-willy@infradead.org Fixes: fb7d3bc41493 ("mm/filemap: drop streaming/uncached pages when writeback completes") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-501448199@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hoursmm: blk-cgroup: fix use-after-free in cgwb_release_workfn()Breno Leitao
cgwb_release_workfn() calls css_put(wb->blkcg_css) and then later accesses wb->blkcg_css again via blkcg_unpin_online(). If css_put() drops the last reference, the blkcg can be freed asynchronously (css_free_rwork_fn -> blkcg_css_free -> kfree) before blkcg_unpin_online() dereferences the pointer to access blkcg->online_pin, resulting in a use-after-free: BUG: KASAN: slab-use-after-free in blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367) Write of size 4 at addr ff11000117aa6160 by task kworker/71:1/531 Workqueue: cgwb_release cgwb_release_workfn Call Trace: <TASK> blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367) cgwb_release_workfn (mm/backing-dev.c:629) process_scheduled_works (kernel/workqueue.c:3278 kernel/workqueue.c:3385) Freed by task 1016: kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6246 mm/slub.c:6561) css_free_rwork_fn (kernel/cgroup/cgroup.c:5542) process_scheduled_works (kernel/workqueue.c:3302 kernel/workqueue.c:3385) ** Stack based on commit 66672af7a095 ("Add linux-next specific files for 20260410") I am seeing this crash sporadically in Meta fleet across multiple kernel versions. A full reproducer is available at: https://github.com/leitao/debug/blob/main/reproducers/repro_blkcg_uaf.sh (The race window is narrow. To make it easily reproducible, inject a msleep(100) between css_put() and blkcg_unpin_online() in cgwb_release_workfn(). With that delay and a KASAN-enabled kernel, the reproducer triggers the splat reliably in less than a second.) Fix this by moving blkcg_unpin_online() before css_put(), so the cgwb's CSS reference keeps the blkcg alive while blkcg_unpin_online() accesses it. Link: https://lore.kernel.org/20260413-blkcg-v1-1-35b72622d16c@debian.org Fixes: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: JP Kobryn <inwardvessel@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hoursmm/zone_device: do not touch device folio after calling ->folio_free()Matthew Brost
The contents of a device folio can immediately change after calling ->folio_free(), as the folio may be reallocated by a driver with a different order. Instead of touching the folio again to extract the pgmap, use the local stack variable when calling percpu_ref_put_many(). Link: https://lore.kernel.org/20260410230346.4009855-1-matthew.brost@intel.com Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hoursmm/damon/core: disallow time-quota setting zero eszSeongJae Park
When the throughput of a DAMOS scheme is very slow, DAMOS time quota can make the effective size quota smaller than damon_ctx->min_region_sz. In the case, damos_apply_scheme() will skip applying the action, because the action is tried at region level, which requires >=min_region_sz size. That is, the quota is effectively exceeded for the quota charge window. Because no action will be applied, the total_charged_sz and total_charged_ns are also not updated. damos_set_effective_quota() will try to update the effective size quota before starting the next charge window. However, because the total_charged_sz and total_charged_ns have not updated, the throughput and effective size quota are also not changed. Since effective size quota can only be decreased, other effective size quota update factors including DAMOS quota goals and size quota cannot make any change, either. As a result, the scheme is unexpectedly deactivated until the user notices and mitigates the situation. The users can mitigate this situation by changing the time quota online or re-install the scheme. While the mitigation is somewhat straightforward, finding the situation would be challenging, because DAMON is not providing good observabilities for that. Even if such observability is provided, doing the additional monitoring and the mitigation is somewhat cumbersome and not aligned to the intention of the time quota. The time quota was intended to help reduce the user's administration overhead. Fix the problem by setting time quota-modified effective size quota be at least min_region_sz always. The issue was discovered [1] by sashiko. Link: https://lore.kernel.org/20260407003153.79589-1-sj@kernel.org Link: https://lore.kernel.org/20260405192504.110014-1-sj@kernel.org [1] Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
27 hoursmm/mempolicy: fix weighted interleave auto sysfs nameJoshua Hahn
The __ATTR macro is a utility that makes defining kobj_attributes easier by stringfying the name, verifying the mode, and setting the show/store fields in a single initializer. It takes a raw token as the first value, rather than a string, so that __ATTR family macros like __ATTR_RW can token-paste it for inferring the _show / _store function names. Commit e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") used the __ATTR macro to define the "auto" sysfs for weighted interleave. A few months later, commit 2fb6915fa22d ("compiler_types.h: add "auto" as a macro for "__auto_type"") introduced a #define macro which expanded auto into __auto_type. This led to the "auto" token passed into __ATTR to be expanded out into __auto_type, and the sysfs entry to be displayed as __auto_type as well. Expand out the __ATTR macro and directly pass a string "auto" instead of the raw token 'auto' to prevent it from being expanded out. Also bypass the VERIFY_OCTAL_PERMISSIONS check by triple checking that 0664 is indeed the intended permissions for this sysfs file. Before: $ ls /sys/kernel/mm/mempolicy/weighted_interleave __auto_type node0 After: $ ls /sys/kernel/mm/mempolicy/weighted_interleave/ auto node0 Link: https://lore.kernel.org/20260407141415.3080960-1-joshua.hahnjy@gmail.com Fixes: 2fb6915fa22d ("compiler_types.h: add "auto" as a macro for "__auto_type"") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Rakie Kim <rakie.kim@sk.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
39 hoursMerge tag 'memblock-v7.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - improve debuggability of reserve_mem kernel parameter handling with print outs in case of a failure and debugfs info showing what was actually reserved - Make memblock_free_late() and free_reserved_area() use the same core logic for freeing the memory to buddy and ensure it takes care of updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled. * tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/alternative: delay freeing of smp_locks section memblock: warn when freeing reserved memory before memory map is initialized memblock, treewide: make memblock_free() handle late freeing memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y memblock: extract page freeing from free_reserved_area() into a helper memblock: make free_reserved_area() more robust mm: move free_reserved_area() to mm/memblock.c powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact() powerpc: fadump: pair alloc_pages_exact() with free_pages_exact() memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name() memblock: move reserve_bootmem_range() to memblock.c and make it static memblock: Add reserve_mem debugfs info memblock: Print out errors on reserve_mem parser
2 daysmm/vmscan: avoid false-positive -Wuninitialized warningArnd Bergmann
When the -fsanitize=bounds sanitizer is enabled, gcc-16 sometimes runs into a corner case in the read_ctrl_pos() pos function, where it sees possible undefined behavior from the 'tier' index overflowing, presumably in the case that this was called with a negative tier: In function 'get_tier_idx', inlined from 'isolate_folios' at mm/vmscan.c:4671:14: mm/vmscan.c: In function 'isolate_folios': mm/vmscan.c:4645:29: error: 'pv.refaulted' is used uninitialized [-Werror=uninitialized] Part of the problem seems to be that read_ctrl_pos() has unusual calling conventions since commit 37a260870f2c ("mm/mglru: rework type selection") where passing MAX_NR_TIERS makes it accumulate all tiers but passing a smaller positive number makes it read a single tier instead. Shut up the warning by adding a fake initialization to the two instances of this variable that can run into that corner case. Link: https://lore.kernel.org/all/CAJHvVcjtFW86o5FoQC8MMEXCHAC0FviggaQsd5EmiCHP+1fBpg@mail.gmail.com/ Link: https://lore.kernel.org/20260414065206.3236176-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Koichiro Den <koichiro.den@canonical.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/migrate_device: remove dead migration entry check in ↵Davidlohr Bueso
migrate_vma_collect_huge_pmd() The softleaf_is_migration() check is unreachable as entries that are not device_private are filtered out. Similarly, the PTE-level equivalent in migrate_vma_collect_pmd() skips migration entries. This dead branch also contained a double spin_unlock(ptl) bug. Link: https://lore.kernel.org/20260212014611.416695-1-dave@stgolabs.net Fixes: a30b48bf1b244 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/vmstat: fix vmstat_shepherd double-scheduling vmstat_updateBreno Leitao
vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update is already scheduled for a given CPU before queuing it. However, delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is cleared the moment a worker thread picks up the work to execute it. This means that while vmstat_update is actively running on a CPU, delayed_work_pending() returns false. If need_update() also returns true at that point (per-cpu counters not yet zeroed mid-flush), the shepherd queues a second invocation with delay=0, causing vmstat_update to run again immediately after finishing. On a 72-CPU system this race is readily observable: before the fix, many CPUs show invocation gaps well below 500 jiffies (the minimum round_jiffies_relative() can produce), with the most extreme cases reaching 0 jiffies—vmstat_update called twice within the same jiffy. Fix this by replacing delayed_work_pending() with work_busy(), which returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued) and WORK_BUSY_RUNNING (work currently executing). The shepherd now correctly skips a CPU in all busy states. After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear. The remaining early invocations have gaps in the 700–999 jiffie range, attributable to round_jiffies_relative() aligning to a nearer jiffie-second boundary rather than to this race. Each spurious vmstat_update invocation has a measurable side effect: refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(), taking the zone spinlock each time. Eliminating the double-scheduling therefore reduces zone lock contention directly. On a 72-CPU stress-ng workload measured with perf lock contention: free_pcppages_bulk contention count: ~55% reduction free_pcppages_bulk total wait time: ~57% reduction free_pcppages_bulk max wait time: ~47% reduction Note: work_busy() is inherently racy—between the check and the subsequent queue_delayed_work_on() call, vmstat_update can finish execution, leaving the work neither pending nor running. In that narrow window the shepherd can still queue a second invocation. After the fix, this residual race is rare and produces only occasional small gaps, a significant improvement over the systematic double-scheduling seen with delayed_work_pending(). Link: https://lore.kernel.org/20260409-vmstat-v2-1-e9d9a6db08ad@debian.org Fixes: 7b8da4c7f07774 ("vmstat: get rid of the ugly cpu_stat_off variable") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/hugetlb: fix early boot crash on parameters without '=' separatorThorsten Blum
If hugepages, hugepagesz, or default_hugepagesz are specified on the kernel command line without the '=' separator, early parameter parsing passes NULL to hugetlb_add_param(), which dereferences it in strlen() and can crash the system during early boot. Reject NULL values in hugetlb_add_param() and return -EINVAL instead. Link: https://lore.kernel.org/20260409105437.108686-4-thorsten.blum@linux.dev Fixes: 5b47c02967ab ("mm/hugetlb: convert cmdline parameters from setup to early") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/mprotect: special-case small folios when applying permissionsPedro Falcato
The common order-0 case is important enough to want its own branch, and avoids the hairy, large loop logic that the CPU does not seem to handle particularly well. While at it, encourage the compiler to inline batch PTE logic and resolve constant branches by adding __always_inline strategically. Link: https://lore.kernel.org/20260402141628.3367596-3-pfalcato@suse.de Signed-off-by: Pedro Falcato <pfalcato@suse.de> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/mprotect: move softleaf code out of the main functionPedro Falcato
Patch series "mm/mprotect: micro-optimization work", v3. Micro-optimize the change_protection functionality and the change_pte_range() routine. This set of functions works in an incredibly tight loop, and even small inefficiencies are incredibly evident when spun hundreds, thousands or hundreds of thousands of times. There was an attempt to keep the batching functionality as much as possible, which introduced some part of the slowness, but not all of it. Removing it for !arm64 architectures would speed mprotect() up even further, but could easily pessimize cases where large folios are mapped (which is not as rare as it seems, particularly when it comes to the page cache these days). The micro-benchmark used for the tests was [0] (usable using google/benchmark and g++ -O2 -lbenchmark repro.cpp) This resulted in the following (first entry is baseline): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- mprotect_bench 85967 ns 85967 ns 6935 mprotect_bench 70684 ns 70684 ns 9887 After the patchset we can observe an ~18% speedup in mprotect. Wonderful for the elusive mprotect-based workloads! Testing & more ideas welcome. I suspect there is plenty of improvement possible but it would require more time than what I have on my hands right now. The entire inlined function (which inlines into change_protection()) is gigantic - I'm not surprised this is so finnicky. Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think there's a properly safe way to go about it since we do depend on the D bit quite a lot. This might not be such an issue on other architectures. Luke Yang reported [1]: : On average, we see improvements ranging from a minimum of 5% to a : maximum of 55%, with most improvements showing around a 25% speed up in : the libmicro/mprot_tw4m micro benchmark. This patch (of 2): Move softleaf change_pte_range code into a separate function. This makes the change_pte_range() function a good bit smaller, and lessens cognitive load when reading through the function. Link: https://lore.kernel.org/20260402141628.3367596-1-pfalcato@suse.de Link: https://lore.kernel.org/20260402141628.3367596-2-pfalcato@suse.de Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ Link: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 [0] Link: https://lore.kernel.org/CAL2CeBxT4jtJ+LxYb6=BNxNMGinpgD_HYH5gGxOP-45Q2OncqQ@mail.gmail.com [1] Signed-off-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm: remove '!root_reclaim' checking in should_abort_scan()Zhaoyang Huang
Android systems usually use memory.reclaim interface to implement user space memory management which expects that the requested reclaim target and actually reclaimed amount memory are not diverging by too much. With the current MGRLU implementation there is, however, no bail out when the reclaim target is reached and this could lead to an excessive reclaim that scales with the reclaim hierarchy size.For example, we can get a nr_reclaimed=394/nr_to_reclaim=32 proactive reclaim under a common 1-N cgroup hierarchy. This defect arose from the goal of keeping fairness among memcgs that is, for try_to_free_mem_cgroup_pages -> shrink_node_memcgs -> shrink_lruvec -> lru_gen_shrink_lruvec -> try_to_shrink_lruvec, the !root_reclaim(sc) check was there for reclaim fairness, which was necessary before commit b82b530740b9 ("mm: vmscan: restore incremental cgroup iteration") because the fairness depended on attempted proportional reclaim from every memcg under the target memcg. However after commit b82b530740b9 there is no longer a need to visit every memcg to ensure fairness. Let's have try_to_shrink_lruvec bail out when the nr_reclaimed achieved. Link: https://lore.kernel.org/20260318011558.1696310-1-zhaoyang.huang@unisoc.com Link: https://lore.kernel.org/20260212032111.408865-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: T.J.Mercier <tjmercier@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Qi Zheng <qi.zheng@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()David Carlier
sio_read_complete() uses sio->pages to account global PSWPIN vm events, but sio->pages tracks the number of bvec entries (folios), not base pages. While large folios cannot currently reach this path (SWP_FS_OPS and SWP_SYNCHRONOUS_IO are mutually exclusive, and mTHP swap-in allocation is gated on SWP_SYNCHRONOUS_IO), the accounting is semantically inconsistent with the per-memcg path which correctly uses folio_nr_pages(). Use sio->len >> PAGE_SHIFT instead, which gives the correct base page count since sio->len is accumulated via folio_size(folio). Link: https://lore.kernel.org/20260402061408.36119-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: NeilBrown <neil@brown.name> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: mfill_atomic(): remove retry logicMike Rapoport (Microsoft)
Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). [rppt@kernel.org: remove safety measure of not returning ENOENT from _copy] Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysshmem, userfaultfd: implement shmem uffd operations using vm_uffd_opsMike Rapoport (Microsoft)
Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: introduce vm_uffd_ops->alloc_folio()Mike Rapoport (Microsoft)
and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Link: https://lore.kernel.org/20260402041156.1377214-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysshmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUEMike Rapoport (Microsoft)
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Link: https://lore.kernel.org/20260402041156.1377214-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: introduce vm_uffd_opsMike Rapoport (Microsoft)
Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. [rppt@kernel.org: relocate VM_DROPPABLE test, per Tal] Link: https://lore.kernel.org/adffgfM5ANxtPIEF@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Tal Zussman <tz2294@columbia.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: move vma_can_userfault out of lineMike Rapoport (Microsoft)
vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Link: https://lore.kernel.org/20260402041156.1377214-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()Mike Rapoport (Microsoft)
Implementation of UFFDIO_COPY for anonymous memory might fail to copy data from userspace buffer when the destination VMA is locked (either with mm_lock or with per-VMA lock). In that case, mfill_atomic() releases the locks, retries copying the data with locks dropped and then re-locks the destination VMA and re-establishes PMD. Since this retry-reget dance is only relevant for UFFDIO_COPY and it never happens for other UFFDIO_ operations, make it a part of mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous memory. As a temporal safety measure to avoid breaking biscection mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the loop in mfill_atomic() won't retry copiyng outside of mmap_lock. This is removed later when shmem implementation will be updated later and the loop in mfill_atomic() will be adjusted. [akpm@linux-foundation.org: update mfill_copy_folio_retry()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260306171815.3160826-6-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: introduce mfill_get_vma() and mfill_put_vma()Mike Rapoport (Microsoft)
Split the code that finds, locks and verifies VMA from mfill_atomic() into a helper function. This function will be used later during refactoring of mfill_atomic_pte_copy(). Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases map_changing_lock. [avagin@google.com: fix lock leak in mfill_get_vma()] Link: https://lore.kernel.org/20260316173829.1126728-1-avagin@google.com Link: https://lore.kernel.org/20260402041156.1377214-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: introduce mfill_establish_pmd() helperMike Rapoport (Microsoft)
There is a lengthy code chunk in mfill_atomic() that establishes the PMD for UFFDIO operations. This code may be called twice: first time when the copy is performed with VMA/mm locks held and the other time after the copy is retried with locks dropped. Move the code that establishes a PMD into a helper function so it can be reused later during refactoring of mfill_atomic_pte_copy(). Link: https://lore.kernel.org/20260402041156.1377214-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: introduce struct mfill_stateMike Rapoport (Microsoft)
mfill_atomic() passes a lot of parameters down to its callees. Aggregate them all into mfill_state structure and pass this structure to functions that implement various UFFDIO_ commands. Tracking the state in a structure will allow moving the code that retries copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped memory. The mfill_state definition is deliberately local to mm/userfaultfd.c, hence shmem_mfill_atomic_pte() is not updated. [harry.yoo@oracle.com: properly initialize mfill_state.len to fix folio_add_new_anon_rmap() WARN] Link: https://lore.kernel.org/abehBY7QakYF9bK4@hyeyoo Link: https://lore.kernel.org/20260402041156.1377214-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysuserfaultfd: introduce mfill_copy_folio_locked() helperMike Rapoport (Microsoft)
Patch series "mm, kvm: allow uffd support in guest_memfd", v4. These patches enable support for userfaultfd in guest_memfd. As the groundwork I refactored userfaultfd handling of PTE-based memory types (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a folio or getting an existing folio from the page cache. shmem also implements callbacks that add a folio to the page cache after the data passed in UFFDIO_COPY was copied and remove the folio from the page cache if page table update fails. In order for guest_memfd to notify userspace about page faults, there are new VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault. Nikita helped to plumb these new goodies into guest_memfd and provided basic tests to verify that guest_memfd works with userfaultfd. The handling of UFFDIO_MISSING in guest_memfd requires ability to remove a folio from page cache, the best way I could find was exporting filemap_remove_folio() to KVM. I deliberately left hugetlb out, at least for the most part. hugetlb handles acquisition of VMA and more importantly establishing of parent page table entry differently than PTE-based memory types. This is a different abstraction level than what vm_uffd_ops provides and people objected to exposing such low level APIs as a part of VMA operations. Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I prefer to delay it until the dust settles after the changes in this set. This patch (of 4): Split copying of data when locks held from mfill_atomic_pte_copy() into a helper function mfill_copy_folio_locked(). This makes improves code readability and makes complex mfill_atomic_pte_copy() function easier to comprehend. No functional change. Link: https://lore.kernel.org/20260402041156.1377214-1-rppt@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/memfd_luo: remove folio from page cache when accounting failsChenghao Duan
In memfd_luo_retrieve_folios(), when shmem_inode_acct_blocks() fails after successfully adding the folio to the page cache, the code jumps to unlock_folio without removing the folio from the page cache. While the folio eventually will be freed when the file is released by memfd_luo_retrieve(), it is a good idea to directly remove a folio that was not fully added to the file. This avoids the possibility of accounting mismatches in shmem or filemap core. Fix by adding a remove_from_cache label that calls filemap_remove_folio() before unlocking, matching the error handling pattern in shmem_alloc_and_add_folio(). This issue was identified by AI review: https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn [pratyush@kernel.org: changelog alterations] Link: https://lore.kernel.org/2vxzzf3lfujq.fsf@kernel.org Link: https://lore.kernel.org/20260326084727.118437-7-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/memfd_luo: fix physical address conversion in put_folios cleanupChenghao Duan
In memfd_luo_retrieve_folios()'s put_folios cleanup path: 1. kho_restore_folio() expects a phys_addr_t (physical address) but receives a raw PFN (pfolio->pfn). This causes kho_restore_page() to check the wrong physical address (pfn << PAGE_SHIFT instead of the actual physical address). 2. This loop lacks the !pfolio->pfn check that exists in the main retrieval loop and memfd_luo_discard_folios(), which could incorrectly process sparse file holes where pfn=0. Fix by converting PFN to physical address with PFN_PHYS() and adding the !pfolio->pfn check, matching the pattern used elsewhere in this file. This issue was identified by the AI review. https://sashiko.dev/#/patchset/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-6-duanchenghao@kylinos.cn Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd") Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/memfd_luo: use i_size_write() to set inode size during retrieveChenghao Duan
Use i_size_write() instead of directly assigning to inode->i_size when restoring the memfd size in memfd_luo_retrieve(), to keep code consistency. No functional change intended. Link: https://lore.kernel.org/20260326084727.118437-5-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/memfd_luo: remove unnecessary memset in zero-size memfd pathChenghao Duan
The memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)) call in the zero-size file handling path is unnecessary because the allocation of the ser structure already uses the __GFP_ZERO flag, ensuring the memory is already zero-initialized. Link: https://lore.kernel.org/20260326084727.118437-4-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/memfd_luo: optimize shmem_recalc_inode calls in retrieve pathChenghao Duan
Move shmem_recalc_inode() out of the loop in memfd_luo_retrieve_folios() to improve performance when restoring large memfds. Currently, shmem_recalc_inode() is called for each folio during restore, which is O(n) expensive operations. This patch collects the number of successfully added folios and calls shmem_recalc_inode() once after the loop completes, reducing complexity to O(1). Additionally, fix the error path to also call shmem_recalc_inode() for the folios that were successfully added before the error occurred. Link: https://lore.kernel.org/20260326084727.118437-3-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/memfd: use folio_nr_pages() for shmem inode accountingChenghao Duan
I found several modifiable points while reading the code. This patch (of 6): Patch series "Modify memfd_luo code", v3. memfd_luo_retrieve_folios() called shmem_inode_acct_blocks() and shmem_recalc_inode() with hardcoded 1 instead of the actual folio page count. memfd may use large folios (THP/hugepages), causing quota/limit under-accounting and incorrect stat output. Fix by using folio_nr_pages(folio) for both functions. Issue found by AI review and suggested by Pratyush Yadav <pratyush@kernel.org>. https://sashiko.dev/#/patchset/20260319012845.29570-1-duanchenghao%40kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260326084727.118437-2-duanchenghao@kylinos.cn Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Haoran Jiang <jianghaoran@kylinos.cn> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/sparse: fix preinited section_mem_map clobbering on failure pathMuchun Song
sparse_init_nid() is careful to leave alone every section whose vmemmap has already been set up by sparse_vmemmap_init_nid_early(); it only clears section_mem_map for the rest: if (!preinited_vmemmap_section(ms)) ms->section_mem_map = 0; A leftover line after that conditional block ms->section_mem_map = 0; was supposed to be deleted but was missed in the failure path, causing the field to be overwritten for all sections when memory allocation fails, effectively destroying the pre-initialization check. Drop the stray assignment so that preinited sections retain their already valid state. Those pre-inited sections (HugeTLB pages) are not activated. However, such failures are extremely rare, so I don't see any major userspace issues. Link: https://lore.kernel.org/20260331113724.2080833-1-songmuchun@bytedance.com Fixes: d65917c42373 ("mm/sparse: allow for alternate vmemmap section init at boot") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed by: Donet Tom <donettom@linux.ibm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/mempolicy: fix memory leaks in weighted_interleave_auto_store()Jackie Liu
weighted_interleave_auto_store() fetches old_wi_state inside the if (!input) block only. This causes two memory leaks: 1. When a user writes "false" and the current mode is already manual, the function returns early without freeing the freshly allocated new_wi_state. 2. When a user writes "true", old_wi_state stays NULL because the fetch is skipped entirely. The old state is then overwritten by rcu_assign_pointer() but never freed, since the cleanup path is gated on old_wi_state being non-NULL. A user can trigger this repeatedly by writing "1" in a loop. Fix both leaks by moving the old_wi_state fetch before the input check, making it unconditional. This also allows a unified early return for both "true" and "false" when the requested mode matches the current mode. Link: https://lore.kernel.org/20260401005702.7096-1-liu.yun@linux.dev Link: https://sashiko.dev/#/patchset/20260331100740.84906-1-liu.yun@linux.dev Fixes: e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed by: Donet Tom <donettom@linux.ibm.com> Cc: Gregory Price <gourry@gourry.net> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/damon/core: use time_in_range_open() for damos quota window startSeongJae Park
damos_adjust_quota() uses time_after_eq() to show if it is time to start a new quota charge window, comparing the current jiffies and the scheduled next charge window start time. If it is, the next charge window start time is updated and the new charge window starts. The time check and next window start time update is skipped while the scheme is deactivated by the watermarks. Let's suppose the deactivation is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than 99 days in 32 bit systems and more than one billion years in 64 bit systems), resulting in having the jiffies larger than the next charge window start time + LONG_MAX. Then, the time_after_eq() call can return false until another LONG_MAX jiffies are passed. This means the scheme can continue working after being reactivated by the watermarks. But, soon, the quota will be exceeded and the scheme will again effectively stop working until the next charge window starts. Because the current charge window is extended to up to LONG_MAX jiffies, however, it will look like it stopped unexpectedly and indefinitely, from the user's perspective. Fix this by using !time_in_range_open() instead. The issue was discovered [1] by sashiko. Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org [1] Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/damon/core: validate damos_quota_goal->nid for node_memcg_{used,free}_bpSeongJae Park
Users can set damos_quota_goal->nid with arbitrary value for node_memcg_{used,free}_bp. But DAMON core is using those for NODE-DATA() without a validation of the value. This can result in out of bounds memory access. The issue can actually triggered using DAMON user-space tool (damo), like below. $ sudo mkdir /sys/fs/cgroup/foo $ sudo ./damo start --damos_action stat --damos_quota_interval 1s \ --damos_quota_goal node_memcg_used_bp 50% -1 /foo $ sudo dmseg [...] [ 524.181426] Unable to handle kernel paging request at virtual address 0000000000002c00 Fix this issue by adding the validation of the given node id. If an invalid node id is given, it returns 0% for used memory ratio, and 100% for free memory ratio. Link: https://lore.kernel.org/20260329043902.46163-3-sj@kernel.org Fixes: b74a120bcf50 ("mm/damon/core: implement DAMOS_QUOTA_NODE_MEMCG_USED_BP") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.19.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/damon/core: validate damos_quota_goal->nid for node_mem_{used,free}_bpSeongJae Park
Patch series "mm/damon/core: validate damos_quota_goal->nid". node_mem[cg]_{used,free}_bp DAMOS quota goals receive the node id. The node id is used for si_meminfo_node() and NODE_DATA() without proper validation. As a result, privileged users can trigger an out of bounds memory access using DAMON_SYSFS. Fix the issues. The issue was originally reported [1] with a fix by another author. The original author announced [2] that they will stop working including the fix that was still in the review stage. Hence I'm restarting this. This patch (of 2): Users can set damos_quota_goal->nid with arbitrary value for node_mem_{used,free}_bp. But DAMON core is using those for si_meminfo_node() without the validation of the value. This can result in out of bounds memory access. The issue can actually triggered using DAMON user-space tool (damo), like below. $ sudo ./damo start --damos_action stat \ --damos_quota_goal node_mem_used_bp 50% -1 \ --damos_quota_interval 1s $ sudo dmesg [...] [ 65.565986] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000098 Fix this issue by adding the validation of the given node. If an invalid node id is given, it returns 0% for used memory ratio, and 100% for free memory ratio. Link: https://lore.kernel.org/20260329043902.46163-2-sj@kernel.org Link: https://lore.kernel.org/20260325073034.140353-1-objecting@objecting.org [1] Link: https://lore.kernel.org/20260327040924.68553-1-sj@kernel.org [2] Fixes: 0e1c773b501f ("mm/damon/core: introduce damos quota goal metrics for memory node utilization") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/damon/stat: fix memory leak on damon_start() failure in damon_stat_start()Jackie Liu
Destroy the DAMON context and reset the global pointer when damon_start() fails. Otherwise, the context allocated by damon_stat_build_ctx() is leaked, and the stale damon_stat_context pointer will be overwritten on the next enable attempt, making the old allocation permanently unreachable. Link: https://lore.kernel.org/20260331101553.88422-1-liu.yun@linux.dev Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/damon/core: fix damos_walk() vs kdamond_fn() exit raceSeongJae Park
When kdamond_fn() main loop is finished, the function cancels remaining damos_walk() request and unset the damon_ctx->kdamond so that API callers and API functions themselves can show the context is terminated. damos_walk() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damos_walk() starts waiting for the kdamond's handling of the newly added request. The damos_walk() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damos_walk() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damow_walk() request cancelling. Right after that, damos_walk() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damos_walk() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damos_walk() caller thread infinitely waits. Fix this by introducing another damon_ctx field, namely walk_control_obsolete. It is protected by the damon_ctx->walk_control_lock, which protects damos_walk() request registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of the remaining damos_walk() request is executed. damos_walk() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-3-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.14.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/damon/core: fix damon_call() vs kdamond_fn() exit raceSeongJae Park
Patch series "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race". damon_call() and damos_walk() can leak memory and/or deadlock when they race with kdamond terminations. Fix those. This patch (of 2); When kdamond_fn() main loop is finished, the function cancels all remaining damon_call() requests and unset the damon_ctx->kdamond so that API callers and API functions themselves can know the context is terminated. damon_call() adds the caller's request to the queue first. After that, it shows if the kdamond of the damon_ctx is still running (damon_ctx->kdamond is set). Only if the kdamond is running, damon_call() starts waiting for the kdamond's handling of the newly added request. The damon_call() requests registration and damon_ctx->kdamond unset are protected by different mutexes, though. Hence, damon_call() could race with damon_ctx->kdamond unset, and result in deadlocks. For example, let's suppose kdamond successfully finished the damon_call() requests cancelling. Right after that, damon_call() is called for the context. It registers the new request, and shows the context is still running, because damon_ctx->kdamond unset is not yet done. Hence the damon_call() caller starts waiting for the handling of the request. However, the kdamond is already on the termination steps, so it never handles the new request. As a result, the damon_call() caller threads infinitely waits. Fix this by introducing another damon_ctx field, namely call_controls_obsolete. It is protected by the damon_ctx->call_controls_lock, which protects damon_call() requests registration. Initialize (unset) it in kdamond_fn() before letting damon_start() returns and set it just before the cancelling of remaining damon_call() requests is executed. damon_call() reads the obsolete field under the lock and avoids adding a new request. After this change, only requests that are guaranteed to be handled or cancelled are registered. Hence the after-registration DAMON context termination check is no longer needed. Remove it together. Note that the deadlock will not happen when damon_call() is called for repeat mode request. In tis case, damon_call() returns instead of waiting for the handling when the request registration succeeds and it shows the kdamond is running. However, if the request also has dealloc_on_cancel, the request memory would be leaked. The issue is found by sashiko [1]. Link: https://lore.kernel.org/20260327233319.3528-1-sj@kernel.org Link: https://lore.kernel.org/20260327233319.3528-2-sj@kernel.org Link: https://lore.kernel.org/20260325141956.87144-1-sj@kernel.org [1] Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.14.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm: zswap: tie per-CPU acomp_ctx lifetime to the poolKanchana P. Sridhar
Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU hotplug, and destroyed on pool destruction or CPU hotunplug. This complicates the lifetime management to save memory while a CPU is offlined, which is not very common. Simplify lifetime management by allocating per-CPU acomp_ctx once on pool creation (or CPU hotplug for CPUs onlined later), and keeping them allocated until the pool is destroyed. Refactor cleanup code from zswap_cpu_comp_dead() into acomp_ctx_free() to be used elsewhere. The main benefit of using the CPU hotplug multi state instance startup callback to allocate the acomp_ctx resources is that it prevents the cores from being offlined until the multi state instance addition call returns. From Documentation/core-api/cpu_hotplug.rst: "The node list add/remove operations and the callback invocations are serialized against CPU hotplug operations." Furthermore, zswap_[de]compress() cannot contend with zswap_cpu_comp_prepare() because: - During pool creation/deletion, the pool is not in the zswap_pools list. - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed out. zswap_cpu_comp_prepare() will be run on a control CPU, since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum cpuhp_state". In both these cases, any recursions into zswap reclaim from zswap_cpu_comp_prepare() will be handled by the old pool. The above two observations enable the following simplifications: 1) zswap_cpu_comp_prepare(): a) acomp_ctx mutex locking: If the process gets migrated while zswap_cpu_comp_prepare() is running, it will complete on the new CPU. In case of failures, we pass the acomp_ctx pointer obtained at the start of zswap_cpu_comp_prepare() to acomp_ctx_free(), which again, can only undergo migration. There appear to be no contention scenarios that might cause inconsistent values of acomp_ctx's members. Hence, it seems there is no need for mutex_lock(&acomp_ctx->mutex) in zswap_cpu_comp_prepare(). b) acomp_ctx mutex initialization: Since the pool is not yet on zswap_pools list, we don't need to initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This has been restored to occur in zswap_cpu_comp_prepare(). c) Subsequent CPU offline-online transitions: zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is valid. If so, it returns success. This should handle any CPU hotplug online-offline transitions after pool creation is done. 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is migrated to another CPU before the current CPU is dysfunctional. If zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined CPU, that mutex will be released once it completes on the new CPU. Since there is no teardown callback, there is no possibility of UAF. 3) Pool creation/deletion and process migration to another CPU: During pool creation/deletion, the pool is not in the zswap_pools list. Hence it cannot contend with zswap ops on that CPU. However, the process can get migrated. a) Pool creation --> zswap_cpu_comp_prepare() --> process migrated: * Old CPU offline: no-op. * zswap_cpu_comp_prepare() continues to run on the new CPU to finish allocating acomp_ctx resources for the offlined CPU. b) Pool deletion --> acomp_ctx_free() --> process migrated: * Old CPU offline: no-op. * acomp_ctx_free() continues to run on the new CPU to finish de-allocating acomp_ctx resources for the offlined CPU. 4) Pool deletion vis-a-vis CPU onlining: The call to cpuhp_state_remove_instance() cannot race with zswap_cpu_comp_prepare() because of hotplug synchronization. The current acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock() are deleted. Instead, zswap_[de]compress() directly call mutex_[un]lock(&acomp_ctx->mutex). The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU offlining, and only deleting them when the pool is destroyed, is 8.28 KB on x86_64. This cost is only paid when a CPU is offlined, until it is onlined again. Link: https://lore.kernel.org/20260331183351.29844-3-kanchanapsridhar2026@gmail.com Co-developed-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com> Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosry@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm: zswap: remove redundant checks in zswap_cpu_comp_dead()Kanchana P. Sridhar
Patch series "zswap pool per-CPU acomp_ctx simplifications", v3. This patchset first removes redundant checks on the acomp_ctx and its "req" member in zswap_cpu_comp_dead(). Next, it persists the zswap pool's per-CPU acomp_ctx resources to last until the pool is destroyed. It then simplifies the per-CPU acomp_ctx mutex locking in zswap_compress()/zswap_decompress(). Code comments added after allocation and before checking to deallocate the per-CPU acomp_ctx's members, based on expected crypto API return values and zswap changes this patchset makes. Patch 2 is an independent submission of patch 23 from [1], to facilitate merging. This patch (of 2): There are presently redundant checks on the per-CPU acomp_ctx and it's "req" member in zswap_cpu_comp_dead(): redundant because they are inconsistent with zswap_pool_create() handling of failure in allocating the acomp_ctx, and with the expected NULL return value from the acomp_request_alloc() API when it fails to allocate an acomp_req. Fix these by converting to them to be NULL checks. Add comments in zswap_cpu_comp_prepare() clarifying the expected return values of the crypto_alloc_acomp_node() and acomp_request_alloc() API. Link: https://lore.kernel.org/20260331183351.29844-2-kanchanapsridhar2026@gmail.com Link: https://patchwork.kernel.org/project/linux-mm/list/?series=1046677 Signed-off-by: Kanchana P. Sridhar <kanchanapsridhar2026@gmail.com> Suggested-by: Yosry Ahmed <yosry@kernel.org> Acked-by: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/alloc_tag: clear codetag for pages allocated before page_ext initializationHao Ge
Due to initialization ordering, page_ext is allocated and initialized relatively late during boot. Some pages have already been allocated and freed before page_ext becomes available, leaving their codetag uninitialized. A clear example is in init_section_page_ext(): alloc_page_ext() calls kmemleak_alloc(). If the slab cache has no free objects, it falls back to the buddy allocator to allocate memory. However, at this point page_ext is not yet fully initialized, so these newly allocated pages have no codetag set. These pages may later be reclaimed by KASAN, which causes the warning to trigger when they are freed because their codetag ref is still empty. Use a global array to track pages allocated before page_ext is fully initialized. The array size is fixed at 8192 entries, and will emit a warning if this limit is exceeded. When page_ext initialization completes, set their codetag to empty to avoid warnings when they are freed later. This warning is only observed with CONFIG_MEM_ALLOC_PROFILING_DEBUG=Y and mem_profiling_compressed disabled: [ 9.582133] ------------[ cut here ]------------ [ 9.582137] alloc_tag was not set [ 9.582139] WARNING: ./include/linux/alloc_tag.h:164 at __pgalloc_tag_sub+0x40f/0x550, CPU#5: systemd/1 [ 9.582190] CPU: 5 UID: 0 PID: 1 Comm: systemd Not tainted 7.0.0-rc4 #1 PREEMPT(lazy) [ 9.582192] Hardware name: Red Hat KVM, BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 9.582194] RIP: 0010:__pgalloc_tag_sub+0x40f/0x550 [ 9.582196] Code: 00 00 4c 29 e5 48 8b 05 1f 88 56 05 48 8d 4c ad 00 48 8d 2c c8 e9 87 fd ff ff 0f 0b 0f 0b e9 f3 fe ff ff 48 8d 3d 61 2f ed 03 <67> 48 0f b9 3a e9 b3 fd ff ff 0f 0b eb e4 e8 5e cd 14 02 4c 89 c7 [ 9.582197] RSP: 0018:ffffc9000001f940 EFLAGS: 00010246 [ 9.582200] RAX: dffffc0000000000 RBX: 1ffff92000003f2b RCX: 1ffff110200d806c [ 9.582201] RDX: ffff8881006c0360 RSI: 0000000000000004 RDI: ffffffff9bc7b460 [ 9.582202] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff3a62324 [ 9.582203] R10: ffffffff9d311923 R11: 0000000000000000 R12: ffffea0004001b00 [ 9.582204] R13: 0000000000002000 R14: ffffea0000000000 R15: ffff8881006c0360 [ 9.582206] FS: 00007ffbbcf2d940(0000) GS:ffff888450479000(0000) knlGS:0000000000000000 [ 9.582208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.582210] CR2: 000055ee3aa260d0 CR3: 0000000148b67005 CR4: 0000000000770ef0 [ 9.582211] PKRU: 55555554 [ 9.582212] Call Trace: [ 9.582213] <TASK> [ 9.582214] ? __pfx___pgalloc_tag_sub+0x10/0x10 [ 9.582216] ? check_bytes_and_report+0x68/0x140 [ 9.582219] __free_frozen_pages+0x2e4/0x1150 [ 9.582221] ? __free_slab+0xc2/0x2b0 [ 9.582224] qlist_free_all+0x4c/0xf0 [ 9.582227] kasan_quarantine_reduce+0x15d/0x180 [ 9.582229] __kasan_slab_alloc+0x69/0x90 [ 9.582232] kmem_cache_alloc_noprof+0x14a/0x500 [ 9.582234] do_getname+0x96/0x310 [ 9.582237] do_readlinkat+0x91/0x2f0 [ 9.582239] ? __pfx_do_readlinkat+0x10/0x10 [ 9.582240] ? get_random_bytes_user+0x1df/0x2c0 [ 9.582244] __x64_sys_readlinkat+0x96/0x100 [ 9.582246] do_syscall_64+0xce/0x650 [ 9.582250] ? __x64_sys_getrandom+0x13a/0x1e0 [ 9.582252] ? __pfx___x64_sys_getrandom+0x10/0x10 [ 9.582254] ? do_syscall_64+0x114/0x650 [ 9.582255] ? ksys_read+0xfc/0x1d0 [ 9.582258] ? __pfx_ksys_read+0x10/0x10 [ 9.582260] ? do_syscall_64+0x114/0x650 [ 9.582262] ? do_syscall_64+0x114/0x650 [ 9.582264] ? __pfx_fput_close_sync+0x10/0x10 [ 9.582266] ? file_close_fd_locked+0x178/0x2a0 [ 9.582268] ? __x64_sys_faccessat2+0x96/0x100 [ 9.582269] ? __x64_sys_close+0x7d/0xd0 [ 9.582271] ? do_syscall_64+0x114/0x650 [ 9.582273] ? do_syscall_64+0x114/0x650 [ 9.582275] ? clear_bhb_loop+0x50/0xa0 [ 9.582277] ? clear_bhb_loop+0x50/0xa0 [ 9.582279] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9.582280] RIP: 0033:0x7ffbbda345ee [ 9.582282] Code: 0f 1f 40 00 48 8b 15 29 38 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 0b 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fa 37 0d 00 f7 d8 64 89 01 48 [ 9.582284] RSP: 002b:00007ffe2ad8de58 EFLAGS: 00000202 ORIG_RAX: 000000000000010b [ 9.582286] RAX: ffffffffffffffda RBX: 000055ee3aa25570 RCX: 00007ffbbda345ee [ 9.582287] RDX: 000055ee3aa25570 RSI: 00007ffe2ad8dee0 RDI: 00000000ffffff9c [ 9.582288] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000001001 [ 9.582289] R10: 0000000000001000 R11: 0000000000000202 R12: 0000000000000033 [ 9.582290] R13: 00007ffe2ad8dee0 R14: 00000000ffffff9c R15: 00007ffe2ad8deb0 [ 9.582292] </TASK> [ 9.582293] ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/20260331081312.123719-1-hao.ge@linux.dev Fixes: dcfe378c81f72 ("lib: introduce support for page allocation tagging") Signed-off-by: Hao Ge <hao.ge@linux.dev> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm/vmscan: prevent MGLRU reclaim from pinning address spaceSuren Baghdasaryan
When shrinking lruvec, MGLRU pins address space before walking it. This is excessive since all it needs for walking the page range is a stable mm_struct to be able to take and release mmap_read_lock and a stable mm->mm_mt tree to walk. This address space pinning results in delays when releasing the memory of a dying process. This also prevents mm reapers (both in-kernel oom-reaper and userspace process_mrelease()) from doing their job during MGLRU scan because they check task_will_free_mem() which will yield negative result due to the elevated mm->mm_users. This affects the system in the sense that if the MM of the killed process is being reclaimed by kswapd then reapers won't be able to reap it. Even the process itself (which might have higher-priority than kswapd) will not free its memory until kswapd drops the last reference. IOW, we delay freeing the memory because kswapd is reclaiming it. In Android the visible result for us is that process_mrelease() (userspace reaper) skips MM in such cases and we see process memory not released for an unusually long time (secs). Replace unnecessary address space pinning with mm_struct pinning by replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained within mm_struct itself, therefore it won't be freed as long as mm_struct is stable and it won't change during the walk because mmap_read_lock is being held. Link: https://lore.kernel.org/20260322070843.941997-1-surenb@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRUBaolin Wang
The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1. See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback on traditional hierarchies"). Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no longer attempt to write back filesystem folios through reclaim. On large memory systems, the flusher may not be able to write back quickly enough. Consequently, MGLRU will encounter many folios that are already under writeback. Since we cannot reclaim these dirty folios, the system may run out of memory and trigger the OOM killer. Hence, for cgroup v1, let's throttle reclaim after waking up the flusher, which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1"), to avoid unnecessary OOM. The following test program can easily reproduce the OOM issue. With this patch applied, the test passes successfully. $mkdir /sys/fs/cgroup/memory/test $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800 Link: https://lore.kernel.org/3445af0f09e8ca945492e052e82594f8c4f2e2f6.1774606060.git.baolin.wang@linux.alibaba.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 daysmemfd: implement get_id for memfd_luoPasha Tatashin
Memfds are identified by their underlying inode. Implement get_id for memfd_luo to return the inode pointer. This prevents the same memfd from being managed twice by LUO if the same inode is pointed by multiple file objects. Link: https://lore.kernel.org/20260326163943.574070-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 dayskho: persist blob size in KHO FDTBreno Leitao
kho_add_subtree() accepts a size parameter but only forwards it to debugfs. The size is not persisted in the KHO FDT, so it is lost across kexec. This makes it impossible for the incoming kernel to determine the blob size without understanding the blob format. Store the blob size as a "blob-size" property in the KHO FDT alongside the "preserved-data" physical address. This allows the receiving kernel to recover the size for any blob regardless of format. Also extend kho_retrieve_subtree() with an optional size output parameter so callers can learn the blob size without needing to understand the blob format. Update all callers to pass NULL for the new parameter. Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>