diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-15 12:59:16 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-15 12:59:16 -0700 |
| commit | 334fbe734e687404f346eba7d5d96ed2b44d35ab (patch) | |
| tree | 65d5c8f4de18335209b2529146e6b06960a48b43 /mm | |
| parent | 5bdb4078e1efba9650c03753616866192d680718 (diff) | |
| parent | 3bac01168982ec3e3bf87efdc1807c7933590a85 (diff) | |
| download | lwn-334fbe734e687404f346eba7d5d96ed2b44d35ab.tar.gz lwn-334fbe734e687404f346eba7d5d96ed2b44d35ab.zip | |
Merge tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "maple_tree: Replace big node with maple copy" (Liam Howlett)
Mainly prepararatory work for ongoing development but it does reduce
stack usage and is an improvement.
- "mm, swap: swap table phase III: remove swap_map" (Kairui Song)
Offers memory savings by removing the static swap_map. It also yields
some CPU savings and implements several cleanups.
- "mm: memfd_luo: preserve file seals" (Pratyush Yadav)
File seal preservation to LUO's memfd code
- "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
Chen)
Additional userspace stats reportng to zswap
- "arch, mm: consolidate empty_zero_page" (Mike Rapoport)
Some cleanups for our handling of ZERO_PAGE() and zero_pfn
- "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
Han)
A robustness improvement and some cleanups in the kmemleak code
- "Improve khugepaged scan logic" (Vernon Yang)
Improve khugepaged scan logic and reduce CPU consumption by
prioritizing scanning tasks that access memory frequently
- "Make KHO Stateless" (Jason Miu)
Simplify Kexec Handover by transitioning KHO from an xarray-based
metadata tracking system with serialization to a radix tree data
structure that can be passed directly to the next kernel
- "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
Ballasi and Steven Rostedt)
Enhance vmscan's tracepointing
- "mm: arch/shstk: Common shadow stack mapping helper and
VM_NOHUGEPAGE" (Catalin Marinas)
Cleanup for the shadow stack code: remove per-arch code in favour of
a generic implementation
- "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)
Fix a WARN() which can be emitted the KHO restores a vmalloc area
- "mm: Remove stray references to pagevec" (Tal Zussman)
Several cleanups, mainly udpating references to "struct pagevec",
which became folio_batch three years ago
- "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
Shutsemau)
Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
pages encode their relationship to the head page
- "mm/damon/core: improve DAMOS quota efficiency for core layer
filters" (SeongJae Park)
Improve two problematic behaviors of DAMOS that makes it less
efficient when core layer filters are used
- "mm/damon: strictly respect min_nr_regions" (SeongJae Park)
Improve DAMON usability by extending the treatment of the
min_nr_regions user-settable parameter
- "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)
The proper fix for a previously hotfixed SMP=n issue. Code
simplifications and cleanups ensued
- "mm: cleanups around unmapping / zapping" (David Hildenbrand)
A bunch of cleanups around unmapping and zapping. Mostly
simplifications, code movements, documentation and renaming of
zapping functions
- "support batched checking of the young flag for MGLRU" (Baolin Wang)
Batched checking of the young flag for MGLRU. It's part cleanups; one
benchmark shows large performance benefits for arm64
- "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)
memcg cleanup and robustness improvements
- "Allow order zero pages in page reporting" (Yuvraj Sakshith)
Enhance free page reporting - it is presently and undesirably order-0
pages when reporting free memory.
- "mm: vma flag tweaks" (Lorenzo Stoakes)
Cleanup work following from the recent conversion of the VMA flags to
a bitmap
- "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
Park)
Add some more developer-facing debug checks into DAMON core
- "mm/damon: test and document power-of-2 min_region_sz requirement"
(SeongJae Park)
An additional DAMON kunit test and makes some adjustments to the
addr_unit parameter handling
- "mm/damon/core: make passed_sample_intervals comparisons
overflow-safe" (SeongJae Park)
Fix a hard-to-hit time overflow issue in DAMON core
- "mm/damon: improve/fixup/update ratio calculation, test and
documentation" (SeongJae Park)
A batch of misc/minor improvements and fixups for DAMON
- "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
Hildenbrand)
Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
movement was required.
- "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)
A somewhat random mix of fixups, recompression cleanups and
improvements in the zram code
- "mm/damon: support multiple goal-based quota tuning algorithms"
(SeongJae Park)
Extend DAMOS quotas goal auto-tuning to support multiple tuning
algorithms that users can select
- "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)
Fix the khugpaged sysfs handling so we no longer spam the logs with
reams of junk when starting/stopping khugepaged
- "mm: improve map count checks" (Lorenzo Stoakes)
Provide some cleanups and slight fixes in the mremap, mmap and vma
code
- "mm/damon: support addr_unit on default monitoring targets for
modules" (SeongJae Park)
Extend the use of DAMON core's addr_unit tunable
- "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)
Cleanups to khugepaged and is a base for Nico's planned khugepaged
mTHP support
- "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)
Code movement and cleanups in the memhotplug and sparsemem code
- "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
CONFIG_MIGRATION" (David Hildenbrand)
Rationalize some memhotplug Kconfig support
- "change young flag check functions to return bool" (Baolin Wang)
Cleanups to change all young flag check functions to return bool
- "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
Law and SeongJae Park)
Fix a few potential DAMON bugs
- "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
Stoakes)
Convert a lot of the existing use of the legacy vm_flags_t data type
to the new vma_flags_t type which replaces it. Mainly in the vma
code.
- "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)
Expand the mmap_prepare functionality, which is intended to replace
the deprecated f_op->mmap hook which has been the source of bugs and
security issues for some time. Cleanups, documentation, extension of
mmap_prepare into filesystem drivers
- "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)
Simplify and clean up zap_huge_pmd(). Additional cleanups around
vm_normal_folio_pmd() and the softleaf functionality are performed.
* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm: fix deferred split queue races during migration
mm/khugepaged: fix issue with tracking lock
mm/huge_memory: add and use has_deposited_pgtable()
mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
mm/huge_memory: separate out the folio part of zap_huge_pmd()
mm/huge_memory: use mm instead of tlb->mm
mm/huge_memory: remove unnecessary sanity checks
mm/huge_memory: deduplicate zap deposited table call
mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
mm/huge_memory: add a common exit path to zap_huge_pmd()
mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
mm/huge: avoid big else branch in zap_huge_pmd()
mm/huge_memory: simplify vma_is_specal_huge()
mm: on remap assert that input range within the proposed VMA
mm: add mmap_action_map_kernel_pages[_full]()
uio: replace deprecated mmap hook with mmap_prepare in uio_info
drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
mm: allow handling of stacked mmap_prepare hooks in more drivers
...
Diffstat (limited to 'mm')
85 files changed, 4026 insertions, 3266 deletions
diff --git a/mm/Kconfig b/mm/Kconfig index 67a72fe89186..0a43bb80df4f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -466,14 +466,11 @@ config HAVE_BOOTMEM_INFO_NODE config ARCH_ENABLE_MEMORY_HOTPLUG bool -config ARCH_ENABLE_MEMORY_HOTREMOVE - bool - # eventually, we can have this option just 'select SPARSEMEM' menuconfig MEMORY_HOTPLUG bool "Memory hotplug" select MEMORY_ISOLATION - depends on SPARSEMEM + depends on SPARSEMEM_VMEMMAP depends on ARCH_ENABLE_MEMORY_HOTPLUG depends on 64BIT select NUMA_KEEP_MEMINFO if NUMA @@ -541,8 +538,8 @@ endchoice config MEMORY_HOTREMOVE bool "Allow for memory hot remove" select HAVE_BOOTMEM_INFO_NODE if (X86_64 || PPC64) - depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE - depends on MIGRATION + depends on MEMORY_HOTPLUG + select MIGRATION config MHP_MEMMAP_ON_MEMORY def_bool y @@ -631,20 +628,20 @@ config PAGE_REPORTING those pages to another entity, such as a hypervisor, so that the memory can be freed within the host for other uses. -# -# support for page migration -# -config MIGRATION - bool "Page migration" +config NUMA_MIGRATION + bool "NUMA page migration" default y - depends on (NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA) && MMU - help - Allows the migration of the physical location of pages of processes - while the virtual addresses are not changed. This is useful in - two situations. The first is on NUMA systems to put pages nearer - to the processors accessing. The second is when allocating huge - pages as migration can relocate pages to satisfy a huge page - allocation instead of reclaiming. + depends on NUMA && MMU + select MIGRATION + help + Support the migration of pages to other NUMA nodes, available to + user space through interfaces like migrate_pages(), move_pages(), + and mbind(). Selecting this option also enables support for page + demotion for memory tiering. + +config MIGRATION + bool + depends on MMU config DEVICE_MIGRATION def_bool MIGRATION && ZONE_DEVICE diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c index b0e2a9fa641f..3d7675a3ae04 100644 --- a/mm/bootmem_info.c +++ b/mm/bootmem_info.c @@ -40,57 +40,20 @@ void put_page_bootmem(struct page *page) } } -#ifndef CONFIG_SPARSEMEM_VMEMMAP static void __init register_page_bootmem_info_section(unsigned long start_pfn) { unsigned long mapsize, section_nr, i; struct mem_section *ms; - struct page *page, *memmap; - struct mem_section_usage *usage; - - section_nr = pfn_to_section_nr(start_pfn); - ms = __nr_to_section(section_nr); - - /* Get section's memmap address */ - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - - /* - * Get page for the memmap's phys address - * XXX: need more consideration for sparse_vmemmap... - */ - page = virt_to_page(memmap); - mapsize = sizeof(struct page) * PAGES_PER_SECTION; - mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT; - - /* remember memmap's page */ - for (i = 0; i < mapsize; i++, page++) - get_page_bootmem(section_nr, page, SECTION_INFO); - - usage = ms->usage; - page = virt_to_page(usage); - - mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; - - for (i = 0; i < mapsize; i++, page++) - get_page_bootmem(section_nr, page, MIX_SECTION_INFO); - -} -#else /* CONFIG_SPARSEMEM_VMEMMAP */ -static void __init register_page_bootmem_info_section(unsigned long start_pfn) -{ - unsigned long mapsize, section_nr, i; - struct mem_section *ms; - struct page *page, *memmap; struct mem_section_usage *usage; + struct page *page; + start_pfn = SECTION_ALIGN_DOWN(start_pfn); section_nr = pfn_to_section_nr(start_pfn); ms = __nr_to_section(section_nr); - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - if (!preinited_vmemmap_section(ms)) - register_page_bootmem_memmap(section_nr, memmap, - PAGES_PER_SECTION); + register_page_bootmem_memmap(section_nr, pfn_to_page(start_pfn), + PAGES_PER_SECTION); usage = ms->usage; page = virt_to_page(usage); @@ -100,7 +63,6 @@ static void __init register_page_bootmem_info_section(unsigned long start_pfn) for (i = 0; i < mapsize; i++, page++) get_page_bootmem(section_nr, page, MIX_SECTION_INFO); } -#endif /* !CONFIG_SPARSEMEM_VMEMMAP */ void __init register_page_bootmem_info_node(struct pglist_data *pgdat) { diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index 8c868f7035fc..34631a44cdec 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -12,6 +12,17 @@ config DAMON See https://www.kernel.org/doc/html/latest/mm/damon/index.html for more information. +config DAMON_DEBUG_SANITY + bool "Check sanity of DAMON code" + depends on DAMON + help + This enables additional DAMON debugging-purpose sanity checks in + DAMON code. This can be useful for finding bugs, but impose + additional overhead. This is therefore recommended to be enabled on + only development and test setups. + + If unsure, say N. + config DAMON_KUNIT_TEST bool "Test for damon" if !KUNIT_ALL_TESTS depends on DAMON && KUNIT=y diff --git a/mm/damon/core.c b/mm/damon/core.c index 3e1890d64d06..db6c67e52d2b 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -109,6 +109,17 @@ int damon_select_ops(struct damon_ctx *ctx, enum damon_ops_id id) return err; } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_new_region(unsigned long start, unsigned long end) +{ + WARN_ONCE(start >= end, "start %lu >= end %lu\n", start, end); +} +#else +static void damon_verify_new_region(unsigned long start, unsigned long end) +{ +} +#endif + /* * Construct a damon_region struct * @@ -118,6 +129,7 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end) { struct damon_region *region; + damon_verify_new_region(start, end); region = kmem_cache_alloc(damon_region_cache, GFP_KERNEL); if (!region) return NULL; @@ -140,8 +152,21 @@ void damon_add_region(struct damon_region *r, struct damon_target *t) t->nr_regions++; } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_del_region(struct damon_target *t) +{ + WARN_ONCE(t->nr_regions == 0, "t->nr_regions == 0\n"); +} +#else +static void damon_verify_del_region(struct damon_target *t) +{ +} +#endif + static void damon_del_region(struct damon_region *r, struct damon_target *t) { + damon_verify_del_region(t); + list_del(&r->list); t->nr_regions--; } @@ -362,6 +387,11 @@ void damos_destroy_quota_goal(struct damos_quota_goal *g) damos_free_quota_goal(g); } +static bool damos_quota_goals_empty(struct damos_quota *q) +{ + return list_empty(&q->goals); +} + /* initialize fields of @quota that normally API users wouldn't set */ static struct damos_quota *damos_quota_init(struct damos_quota *quota) { @@ -520,8 +550,27 @@ void damon_destroy_target(struct damon_target *t, struct damon_ctx *ctx) damon_free_target(t); } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_nr_regions(struct damon_target *t) +{ + struct damon_region *r; + unsigned int count = 0; + + damon_for_each_region(r, t) + count++; + WARN_ONCE(count != t->nr_regions, "t->nr_regions (%u) != count (%u)\n", + t->nr_regions, count); +} +#else +static void damon_verify_nr_regions(struct damon_target *t) +{ +} +#endif + unsigned int damon_nr_regions(struct damon_target *t) { + damon_verify_nr_regions(t); + return t->nr_regions; } @@ -621,7 +670,7 @@ static unsigned int damon_accesses_bp_to_nr_accesses( static unsigned int damon_nr_accesses_to_accesses_bp( unsigned int nr_accesses, struct damon_attrs *attrs) { - return nr_accesses * 10000 / damon_max_nr_accesses(attrs); + return mult_frac(nr_accesses, 10000, damon_max_nr_accesses(attrs)); } static unsigned int damon_nr_accesses_for_new_attrs(unsigned int nr_accesses, @@ -707,8 +756,16 @@ static bool damon_valid_intervals_goal(struct damon_attrs *attrs) * @ctx: monitoring context * @attrs: monitoring attributes * - * This function should be called while the kdamond is not running, an access - * check results aggregation is not ongoing (e.g., from damon_call(). + * This function updates monitoring results and next monitoring/damos operation + * schedules. Because those are periodically updated by kdamond, this should + * be called from a safe contexts. Such contexts include damon_ctx setup time + * while the kdamond is not yet started, and inside of kdamond_fn(). + * + * In detail, all DAMON API callers directly call this function for initial + * setup of damon_ctx before calling damon_start(). Some of the API callers + * also indirectly call this function via damon_call() -> damon_commit() for + * online parameters updates. Finally, kdamond_fn() itself use this for + * applying auto-tuned monitoring intervals. * * Every time interval is in micro-seconds. * @@ -860,6 +917,7 @@ static int damos_commit_quota(struct damos_quota *dst, struct damos_quota *src) err = damos_commit_quota_goals(dst, src); if (err) return err; + dst->goal_tuner = src->goal_tuner; dst->weight_sz = src->weight_sz; dst->weight_nr_accesses = src->weight_nr_accesses; dst->weight_age = src->weight_age; @@ -1002,6 +1060,23 @@ static void damos_set_filters_default_reject(struct damos *s) damos_filters_default_reject(&s->ops_filters); } +/* + * damos_commit_dests() - Copy migration destinations from @src to @dst. + * @dst: Destination structure to update. + * @src: Source structure to copy from. + * + * If the number of destinations has changed, the old arrays in @dst are freed + * and new ones are allocated. On success, @dst contains a full copy of + * @src's arrays and count. + * + * On allocation failure, @dst is left in a partially torn-down state: its + * arrays may be NULL and @nr_dests may not reflect the actual allocation + * sizes. The structure remains safe to deallocate via damon_destroy_scheme(), + * but callers must not reuse @dst for further commits — it should be + * discarded. + * + * Return: 0 on success, -ENOMEM on allocation failure. + */ static int damos_commit_dests(struct damos_migrate_dests *dst, struct damos_migrate_dests *src) { @@ -1316,6 +1391,40 @@ static unsigned long damon_region_sz_limit(struct damon_ctx *ctx) return sz; } +static void damon_split_region_at(struct damon_target *t, + struct damon_region *r, unsigned long sz_r); + +/* + * damon_apply_min_nr_regions() - Make effect of min_nr_regions parameter. + * @ctx: monitoring context. + * + * This function implement min_nr_regions (minimum number of damon_region + * objects in the given monitoring context) behavior. It first calculates + * maximum size of each region for enforcing the min_nr_regions as total size + * of the regions divided by the min_nr_regions. After that, this function + * splits regions to ensure all regions are equal to or smaller than the size + * limit. Finally, this function returns the maximum size limit. + * + * Returns: maximum size of each region for convincing min_nr_regions. + */ +static unsigned long damon_apply_min_nr_regions(struct damon_ctx *ctx) +{ + unsigned long max_region_sz = damon_region_sz_limit(ctx); + struct damon_target *t; + struct damon_region *r, *next; + + max_region_sz = ALIGN(max_region_sz, ctx->min_region_sz); + damon_for_each_target(t, ctx) { + damon_for_each_region_safe(r, next, t) { + while (damon_sz_region(r) > max_region_sz) { + damon_split_region_at(t, r, max_region_sz); + r = damon_next_region(r); + } + } + } + return max_region_sz; +} + static int kdamond_fn(void *data); /* @@ -1590,6 +1699,23 @@ static void damon_warn_fix_nr_accesses_corruption(struct damon_region *r) r->nr_accesses_bp = r->nr_accesses * 10000; } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_reset_aggregated(struct damon_region *r, + struct damon_ctx *c) +{ + WARN_ONCE(r->nr_accesses_bp != r->last_nr_accesses * 10000, + "nr_accesses_bp %u last_nr_accesses %u sis %lu %lu\n", + r->nr_accesses_bp, r->last_nr_accesses, + c->passed_sample_intervals, c->next_aggregation_sis); +} +#else +static void damon_verify_reset_aggregated(struct damon_region *r, + struct damon_ctx *c) +{ +} +#endif + + /* * Reset the aggregated monitoring results ('nr_accesses' of each region). */ @@ -1606,6 +1732,7 @@ static void kdamond_reset_aggregated(struct damon_ctx *c) damon_warn_fix_nr_accesses_corruption(r); r->last_nr_accesses = r->nr_accesses; r->nr_accesses = 0; + damon_verify_reset_aggregated(r, c); } ti++; } @@ -1628,7 +1755,7 @@ static unsigned long damon_get_intervals_score(struct damon_ctx *c) } target_access_events = max_access_events * goal_bp / 10000; target_access_events = target_access_events ? : 1; - return access_events * 10000 / target_access_events; + return mult_frac(access_events, 10000, target_access_events); } static unsigned long damon_feed_loop_next_input(unsigned long last_input, @@ -1672,9 +1799,6 @@ static void kdamond_tune_intervals(struct damon_ctx *c) damon_set_attrs(c, &new_attrs); } -static void damon_split_region_at(struct damon_target *t, - struct damon_region *r, unsigned long sz_r); - static bool __damos_valid_target(struct damon_region *r, struct damos *s) { unsigned long sz; @@ -1689,15 +1813,27 @@ static bool __damos_valid_target(struct damon_region *r, struct damos *s) r->age <= s->pattern.max_age_region; } -static bool damos_valid_target(struct damon_ctx *c, struct damon_target *t, - struct damon_region *r, struct damos *s) +/* + * damos_quota_is_set() - Return if the given quota is actually set. + * @quota: The quota to check. + * + * Returns true if the quota is set, false otherwise. + */ +static bool damos_quota_is_set(struct damos_quota *quota) +{ + return quota->esz || quota->sz || quota->ms || + !damos_quota_goals_empty(quota); +} + +static bool damos_valid_target(struct damon_ctx *c, struct damon_region *r, + struct damos *s) { bool ret = __damos_valid_target(r, s); - if (!ret || !s->quota.esz || !c->ops.get_scheme_score) + if (!ret || !damos_quota_is_set(&s->quota) || !c->ops.get_scheme_score) return ret; - return c->ops.get_scheme_score(c, t, r, s) >= s->quota.min_score; + return c->ops.get_scheme_score(c, r, s) >= s->quota.min_score; } /* @@ -1717,17 +1853,18 @@ static bool damos_valid_target(struct damon_ctx *c, struct damon_target *t, * This function checks if a given region should be skipped or not for the * reason. If only the starting part of the region has previously charged, * this function splits the region into two so that the second one covers the - * area that not charged in the previous charge widnow and saves the second - * region in *rp and returns false, so that the caller can apply DAMON action - * to the second one. + * area that not charged in the previous charge widnow, and return true. The + * caller can see the second one on the next iteration of the region walk. + * Note that this means the caller should use damon_for_each_region() instead + * of damon_for_each_region_safe(). If damon_for_each_region_safe() is used, + * the second region will just be ignored. * - * Return: true if the region should be entirely skipped, false otherwise. + * Return: true if the region should be skipped, false otherwise. */ static bool damos_skip_charged_region(struct damon_target *t, - struct damon_region **rp, struct damos *s, + struct damon_region *r, struct damos *s, unsigned long min_region_sz) { - struct damon_region *r = *rp; struct damos_quota *quota = &s->quota; unsigned long sz_to_skip; @@ -1754,8 +1891,7 @@ static bool damos_skip_charged_region(struct damon_target *t, sz_to_skip = min_region_sz; } damon_split_region_at(t, r, sz_to_skip); - r = damon_next_region(r); - *rp = r; + return true; } quota->charge_target_from = NULL; quota->charge_addr_from = 0; @@ -1964,7 +2100,8 @@ static void damos_apply_scheme(struct damon_ctx *c, struct damon_target *t, } if (c->ops.apply_scheme) { - if (quota->esz && quota->charged_sz + sz > quota->esz) { + if (damos_quota_is_set(quota) && + quota->charged_sz + sz > quota->esz) { sz = ALIGN_DOWN(quota->esz - quota->charged_sz, c->min_region_sz); if (!sz) @@ -1983,7 +2120,8 @@ static void damos_apply_scheme(struct damon_ctx *c, struct damon_target *t, quota->total_charged_ns += timespec64_to_ns(&end) - timespec64_to_ns(&begin); quota->charged_sz += sz; - if (quota->esz && quota->charged_sz >= quota->esz) { + if (damos_quota_is_set(quota) && + quota->charged_sz >= quota->esz) { quota->charge_target_from = t; quota->charge_addr_from = r->ar.end + 1; } @@ -2004,24 +2142,25 @@ static void damon_do_apply_schemes(struct damon_ctx *c, damon_for_each_scheme(s, c) { struct damos_quota *quota = &s->quota; - if (c->passed_sample_intervals < s->next_apply_sis) + if (time_before(c->passed_sample_intervals, s->next_apply_sis)) continue; if (!s->wmarks.activated) continue; /* Check the quota */ - if (quota->esz && quota->charged_sz >= quota->esz) + if (damos_quota_is_set(quota) && + quota->charged_sz >= quota->esz) continue; - if (damos_skip_charged_region(t, &r, s, c->min_region_sz)) + if (damos_skip_charged_region(t, r, s, c->min_region_sz)) continue; if (s->max_nr_snapshots && s->max_nr_snapshots <= s->stat.nr_snapshots) continue; - if (damos_valid_target(c, t, r, s)) + if (damos_valid_target(c, r, s)) damos_apply_scheme(c, t, r, s); if (damon_is_last_region(r, t)) @@ -2111,7 +2250,7 @@ static __kernel_ulong_t damos_get_node_mem_bp( numerator = i.totalram - i.freeram; else /* DAMOS_QUOTA_NODE_MEM_FREE_BP */ numerator = i.freeram; - return numerator * 10000 / i.totalram; + return mult_frac(numerator, 10000, i.totalram); } static unsigned long damos_get_node_memcg_used_bp( @@ -2144,7 +2283,7 @@ static unsigned long damos_get_node_memcg_used_bp( numerator = used_pages; else /* DAMOS_QUOTA_NODE_MEMCG_FREE_BP */ numerator = i.totalram - used_pages; - return numerator * 10000 / i.totalram; + return mult_frac(numerator, 10000, i.totalram); } #else static __kernel_ulong_t damos_get_node_mem_bp( @@ -2174,8 +2313,8 @@ static unsigned int damos_get_in_active_mem_bp(bool active_ratio) global_node_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE); total = active + inactive; if (active_ratio) - return active * 10000 / total; - return inactive * 10000 / total; + return mult_frac(active, 10000, total); + return mult_frac(inactive, 10000, total); } static void damos_set_quota_goal_current_value(struct damos_quota_goal *goal) @@ -2218,13 +2357,33 @@ static unsigned long damos_quota_score(struct damos_quota *quota) damos_for_each_quota_goal(goal, quota) { damos_set_quota_goal_current_value(goal); highest_score = max(highest_score, - goal->current_value * 10000 / - goal->target_value); + mult_frac(goal->current_value, 10000, + goal->target_value)); } return highest_score; } +static void damos_goal_tune_esz_bp_consist(struct damos_quota *quota) +{ + unsigned long score = damos_quota_score(quota); + + quota->esz_bp = damon_feed_loop_next_input( + max(quota->esz_bp, 10000UL), score); +} + +static void damos_goal_tune_esz_bp_temporal(struct damos_quota *quota) +{ + unsigned long score = damos_quota_score(quota); + + if (score >= 10000) + quota->esz_bp = 0; + else if (quota->sz) + quota->esz_bp = quota->sz * 10000; + else + quota->esz_bp = ULONG_MAX; +} + /* * Called only if quota->ms, or quota->sz are set, or quota->goals is not empty */ @@ -2239,18 +2398,17 @@ static void damos_set_effective_quota(struct damos_quota *quota) } if (!list_empty("a->goals)) { - unsigned long score = damos_quota_score(quota); - - quota->esz_bp = damon_feed_loop_next_input( - max(quota->esz_bp, 10000UL), - score); + if (quota->goal_tuner == DAMOS_QUOTA_GOAL_TUNER_CONSIST) + damos_goal_tune_esz_bp_consist(quota); + else if (quota->goal_tuner == DAMOS_QUOTA_GOAL_TUNER_TEMPORAL) + damos_goal_tune_esz_bp_temporal(quota); esz = quota->esz_bp / 10000; } if (quota->ms) { if (quota->total_charged_ns) - throughput = mult_frac(quota->total_charged_sz, 1000000, - quota->total_charged_ns); + throughput = mult_frac(quota->total_charged_sz, + 1000000, quota->total_charged_ns); else throughput = PAGE_SIZE * 1024; esz = min(throughput * quota->ms, esz); @@ -2296,7 +2454,8 @@ static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) /* New charge window starts */ if (time_after_eq(jiffies, quota->charged_from + msecs_to_jiffies(quota->reset_interval))) { - if (quota->esz && quota->charged_sz >= quota->esz) + if (damos_quota_is_set(quota) && + quota->charged_sz >= quota->esz) s->stat.qt_exceeds++; quota->total_charged_sz += quota->charged_sz; quota->charged_from = jiffies; @@ -2319,7 +2478,9 @@ static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) damon_for_each_region(r, t) { if (!__damos_valid_target(r, s)) continue; - score = c->ops.get_scheme_score(c, t, r, s); + if (damos_core_filter_out(c, t, r, s)) + continue; + score = c->ops.get_scheme_score(c, r, s); c->regions_score_histogram[score] += damon_sz_region(r); if (score > max_score) @@ -2355,14 +2516,12 @@ static void damos_trace_stat(struct damon_ctx *c, struct damos *s) static void kdamond_apply_schemes(struct damon_ctx *c) { struct damon_target *t; - struct damon_region *r, *next_r; + struct damon_region *r; struct damos *s; - unsigned long sample_interval = c->attrs.sample_interval ? - c->attrs.sample_interval : 1; bool has_schemes_to_apply = false; damon_for_each_scheme(s, c) { - if (c->passed_sample_intervals < s->next_apply_sis) + if (time_before(c->passed_sample_intervals, s->next_apply_sis)) continue; if (!s->wmarks.activated) @@ -2381,23 +2540,36 @@ static void kdamond_apply_schemes(struct damon_ctx *c) if (c->ops.target_valid && c->ops.target_valid(t) == false) continue; - damon_for_each_region_safe(r, next_r, t) + damon_for_each_region(r, t) damon_do_apply_schemes(c, t, r); } damon_for_each_scheme(s, c) { - if (c->passed_sample_intervals < s->next_apply_sis) + if (time_before(c->passed_sample_intervals, s->next_apply_sis)) continue; damos_walk_complete(c, s); - s->next_apply_sis = c->passed_sample_intervals + - (s->apply_interval_us ? s->apply_interval_us : - c->attrs.aggr_interval) / sample_interval; + damos_set_next_apply_sis(s, c); s->last_applied = NULL; damos_trace_stat(c, s); } mutex_unlock(&c->walk_control_lock); } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_merge_two_regions( + struct damon_region *l, struct damon_region *r) +{ + /* damon_merge_two_regions() may created incorrect left region */ + WARN_ONCE(l->ar.start >= l->ar.end, "l: %lu-%lu, r: %lu-%lu\n", + l->ar.start, l->ar.end, r->ar.start, r->ar.end); +} +#else +static void damon_verify_merge_two_regions( + struct damon_region *l, struct damon_region *r) +{ +} +#endif + /* * Merge two adjacent regions into one region */ @@ -2411,9 +2583,24 @@ static void damon_merge_two_regions(struct damon_target *t, l->nr_accesses_bp = l->nr_accesses * 10000; l->age = (l->age * sz_l + r->age * sz_r) / (sz_l + sz_r); l->ar.end = r->ar.end; + damon_verify_merge_two_regions(l, r); damon_destroy_region(r, t); } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_merge_regions_of(struct damon_region *r) +{ + WARN_ONCE(r->nr_accesses != r->nr_accesses_bp / 10000, + "nr_accesses (%u) != nr_accesses_bp (%u)\n", + r->nr_accesses, r->nr_accesses_bp); +} +#else +static void damon_verify_merge_regions_of(struct damon_region *r) +{ +} +#endif + + /* * Merge adjacent regions having similar access frequencies * @@ -2427,6 +2614,7 @@ static void damon_merge_regions_of(struct damon_target *t, unsigned int thres, struct damon_region *r, *prev = NULL, *next; damon_for_each_region_safe(r, next, t) { + damon_verify_merge_regions_of(r); if (abs(r->nr_accesses - r->last_nr_accesses) > thres) r->age = 0; else if ((r->nr_accesses == 0) != (r->last_nr_accesses == 0)) @@ -2480,6 +2668,21 @@ static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold, threshold / 2 < max_thres); } +#ifdef CONFIG_DAMON_DEBUG_SANITY +static void damon_verify_split_region_at(struct damon_region *r, + unsigned long sz_r) +{ + WARN_ONCE(sz_r == 0 || sz_r >= damon_sz_region(r), + "sz_r: %lu r: %lu-%lu (%lu)\n", + sz_r, r->ar.start, r->ar.end, damon_sz_region(r)); +} +#else +static void damon_verify_split_region_at(struct damon_region *r, + unsigned long sz_r) +{ +} +#endif + /* * Split a region in two * @@ -2491,6 +2694,7 @@ static void damon_split_region_at(struct damon_target *t, { struct damon_region *new; + damon_verify_split_region_at(r, sz_r); new = damon_new_region(r->ar.start + sz_r, r->ar.end); if (!new) return; @@ -2722,7 +2926,6 @@ static void kdamond_init_ctx(struct damon_ctx *ctx) { unsigned long sample_interval = ctx->attrs.sample_interval ? ctx->attrs.sample_interval : 1; - unsigned long apply_interval; struct damos *scheme; ctx->passed_sample_intervals = 0; @@ -2733,9 +2936,7 @@ static void kdamond_init_ctx(struct damon_ctx *ctx) ctx->attrs.intervals_goal.aggrs; damon_for_each_scheme(scheme, ctx) { - apply_interval = scheme->apply_interval_us ? - scheme->apply_interval_us : ctx->attrs.aggr_interval; - scheme->next_apply_sis = apply_interval / sample_interval; + damos_set_next_apply_sis(scheme, ctx); damos_set_filters_default_reject(scheme); } } @@ -2761,7 +2962,7 @@ static int kdamond_fn(void *data) if (!ctx->regions_score_histogram) goto done; - sz_limit = damon_region_sz_limit(ctx); + sz_limit = damon_apply_min_nr_regions(ctx); while (!kdamond_need_stop(ctx)) { /* @@ -2786,10 +2987,14 @@ static int kdamond_fn(void *data) if (ctx->ops.check_accesses) max_nr_accesses = ctx->ops.check_accesses(ctx); - if (ctx->passed_sample_intervals >= next_aggregation_sis) + if (time_after_eq(ctx->passed_sample_intervals, + next_aggregation_sis)) { kdamond_merge_regions(ctx, max_nr_accesses / 10, sz_limit); + /* online updates might be made */ + sz_limit = damon_apply_min_nr_regions(ctx); + } /* * do kdamond_call() and kdamond_apply_schemes() after @@ -2805,10 +3010,12 @@ static int kdamond_fn(void *data) sample_interval = ctx->attrs.sample_interval ? ctx->attrs.sample_interval : 1; - if (ctx->passed_sample_intervals >= next_aggregation_sis) { + if (time_after_eq(ctx->passed_sample_intervals, + next_aggregation_sis)) { if (ctx->attrs.intervals_goal.aggrs && - ctx->passed_sample_intervals >= - ctx->next_intervals_tune_sis) { + time_after_eq( + ctx->passed_sample_intervals, + ctx->next_intervals_tune_sis)) { /* * ctx->next_aggregation_sis might be updated * from kdamond_call(). In the case, @@ -2842,13 +3049,13 @@ static int kdamond_fn(void *data) kdamond_split_regions(ctx); } - if (ctx->passed_sample_intervals >= next_ops_update_sis) { + if (time_after_eq(ctx->passed_sample_intervals, + next_ops_update_sis)) { ctx->next_ops_update_sis = next_ops_update_sis + ctx->attrs.ops_update_interval / sample_interval; if (ctx->ops.update) ctx->ops.update(ctx); - sz_limit = damon_region_sz_limit(ctx); } } done: @@ -2874,31 +3081,43 @@ done: static int walk_system_ram(struct resource *res, void *arg) { - struct damon_addr_range *a = arg; + struct resource *a = arg; - if (a->end - a->start < resource_size(res)) { + if (resource_size(a) < resource_size(res)) { a->start = res->start; a->end = res->end; } return 0; } +static unsigned long damon_res_to_core_addr(resource_size_t ra, + unsigned long addr_unit) +{ + /* + * Use div_u64() for avoiding linking errors related with __udivdi3, + * __aeabi_uldivmod, or similar problems. This should also improve the + * performance optimization (read div_u64() comment for the detail). + */ + if (sizeof(ra) == 8 && sizeof(addr_unit) == 4) + return div_u64(ra, addr_unit); + return ra / addr_unit; +} + /* * Find biggest 'System RAM' resource and store its start and end address in * @start and @end, respectively. If no System RAM is found, returns false. */ static bool damon_find_biggest_system_ram(unsigned long *start, - unsigned long *end) + unsigned long *end, unsigned long addr_unit) { - struct damon_addr_range arg = {}; + struct resource res = {}; - walk_system_ram_res(0, ULONG_MAX, &arg, walk_system_ram); - if (arg.end <= arg.start) + walk_system_ram_res(0, -1, &res, walk_system_ram); + *start = damon_res_to_core_addr(res.start, addr_unit); + *end = damon_res_to_core_addr(res.end + 1, addr_unit); + if (*end <= *start) return false; - - *start = arg.start; - *end = arg.end; return true; } @@ -2908,6 +3127,7 @@ static bool damon_find_biggest_system_ram(unsigned long *start, * @t: The monitoring target to set the region. * @start: The pointer to the start address of the region. * @end: The pointer to the end address of the region. + * @addr_unit: The address unit for the damon_ctx of @t. * @min_region_sz: Minimum region size. * * This function sets the region of @t as requested by @start and @end. If the @@ -2920,7 +3140,7 @@ static bool damon_find_biggest_system_ram(unsigned long *start, */ int damon_set_region_biggest_system_ram_default(struct damon_target *t, unsigned long *start, unsigned long *end, - unsigned long min_region_sz) + unsigned long addr_unit, unsigned long min_region_sz) { struct damon_addr_range addr_range; @@ -2928,7 +3148,7 @@ int damon_set_region_biggest_system_ram_default(struct damon_target *t, return -EINVAL; if (!*start && !*end && - !damon_find_biggest_system_ram(start, end)) + !damon_find_biggest_system_ram(start, end, addr_unit)) return -EINVAL; addr_range.start = *start; diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 7bc5c0b2aea3..554559d72976 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -291,12 +291,6 @@ static int damon_lru_sort_apply_parameters(void) if (err) return err; - /* - * If monitor_region_start/end are unset, always silently - * reset addr_unit to 1. - */ - if (!monitor_region_start && !monitor_region_end) - addr_unit = 1; param_ctx->addr_unit = addr_unit; param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1); @@ -345,6 +339,7 @@ static int damon_lru_sort_apply_parameters(void) err = damon_set_region_biggest_system_ram_default(param_target, &monitor_region_start, &monitor_region_end, + param_ctx->addr_unit, param_ctx->min_region_sz); if (err) goto out; diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c index a218d9922234..8c6d613425c1 100644 --- a/mm/damon/ops-common.c +++ b/mm/damon/ops-common.c @@ -90,7 +90,7 @@ void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr return; if (likely(pmd_present(pmdval))) - young |= pmdp_clear_young_notify(vma, addr, pmd); + young |= pmdp_test_and_clear_young(vma, addr, pmd); young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + HPAGE_PMD_SIZE); if (young) folio_set_young(folio); diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 9bfe48826840..5cdcc5037cbc 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -343,8 +343,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, } static int damon_pa_scheme_score(struct damon_ctx *context, - struct damon_target *t, struct damon_region *r, - struct damos *scheme) + struct damon_region *r, struct damos *scheme) { switch (scheme->action) { case DAMOS_PAGEOUT: diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index 43d76f5bed44..86da14778658 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -201,12 +201,6 @@ static int damon_reclaim_apply_parameters(void) if (err) return err; - /* - * If monitor_region_start/end are unset, always silently - * reset addr_unit to 1. - */ - if (!monitor_region_start && !monitor_region_end) - addr_unit = 1; param_ctx->addr_unit = addr_unit; param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1); @@ -251,6 +245,7 @@ static int damon_reclaim_apply_parameters(void) err = damon_set_region_biggest_system_ram_default(param_target, &monitor_region_start, &monitor_region_end, + param_ctx->addr_unit, param_ctx->min_region_sz); if (err) goto out; diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 3a0782e576fa..5186966dafb3 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -1488,6 +1488,7 @@ struct damon_sysfs_quotas { unsigned long sz; unsigned long reset_interval_ms; unsigned long effective_sz; /* Effective size quota in bytes */ + enum damos_quota_goal_tuner goal_tuner; }; static struct damon_sysfs_quotas *damon_sysfs_quotas_alloc(void) @@ -1610,6 +1611,58 @@ static ssize_t effective_bytes_show(struct kobject *kobj, return sysfs_emit(buf, "%lu\n", quotas->effective_sz); } +struct damos_sysfs_qgoal_tuner_name { + enum damos_quota_goal_tuner tuner; + char *name; +}; + +static struct damos_sysfs_qgoal_tuner_name damos_sysfs_qgoal_tuner_names[] = { + { + .tuner = DAMOS_QUOTA_GOAL_TUNER_CONSIST, + .name = "consist", + }, + { + .tuner = DAMOS_QUOTA_GOAL_TUNER_TEMPORAL, + .name = "temporal", + }, +}; + +static ssize_t goal_tuner_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct damon_sysfs_quotas *quotas = container_of(kobj, + struct damon_sysfs_quotas, kobj); + int i; + + for (i = 0; i < ARRAY_SIZE(damos_sysfs_qgoal_tuner_names); i++) { + struct damos_sysfs_qgoal_tuner_name *tuner_name; + + tuner_name = &damos_sysfs_qgoal_tuner_names[i]; + if (tuner_name->tuner == quotas->goal_tuner) + return sysfs_emit(buf, "%s\n", tuner_name->name); + } + return -EINVAL; +} + +static ssize_t goal_tuner_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + struct damon_sysfs_quotas *quotas = container_of(kobj, + struct damon_sysfs_quotas, kobj); + int i; + + for (i = 0; i < ARRAY_SIZE(damos_sysfs_qgoal_tuner_names); i++) { + struct damos_sysfs_qgoal_tuner_name *tuner_name; + + tuner_name = &damos_sysfs_qgoal_tuner_names[i]; + if (sysfs_streq(buf, tuner_name->name)) { + quotas->goal_tuner = tuner_name->tuner; + return count; + } + } + return -EINVAL; +} + static void damon_sysfs_quotas_release(struct kobject *kobj) { kfree(container_of(kobj, struct damon_sysfs_quotas, kobj)); @@ -1627,11 +1680,15 @@ static struct kobj_attribute damon_sysfs_quotas_reset_interval_ms_attr = static struct kobj_attribute damon_sysfs_quotas_effective_bytes_attr = __ATTR_RO_MODE(effective_bytes, 0400); +static struct kobj_attribute damon_sysfs_quotas_goal_tuner_attr = + __ATTR_RW_MODE(goal_tuner, 0600); + static struct attribute *damon_sysfs_quotas_attrs[] = { &damon_sysfs_quotas_ms_attr.attr, &damon_sysfs_quotas_sz_attr.attr, &damon_sysfs_quotas_reset_interval_ms_attr.attr, &damon_sysfs_quotas_effective_bytes_attr.attr, + &damon_sysfs_quotas_goal_tuner_attr.attr, NULL, }; ATTRIBUTE_GROUPS(damon_sysfs_quotas); @@ -2718,6 +2775,7 @@ static struct damos *damon_sysfs_mk_scheme( .weight_sz = sysfs_weights->sz, .weight_nr_accesses = sysfs_weights->nr_accesses, .weight_age = sysfs_weights->age, + .goal_tuner = sysfs_quotas->goal_tuner, }; struct damos_watermarks wmarks = { .metric = sysfs_wmarks->metric, diff --git a/mm/damon/tests/.kunitconfig b/mm/damon/tests/.kunitconfig index 36a450f57b58..144d27e6ecc5 100644 --- a/mm/damon/tests/.kunitconfig +++ b/mm/damon/tests/.kunitconfig @@ -13,3 +13,6 @@ CONFIG_DAMON_VADDR_KUNIT_TEST=y CONFIG_SYSFS=y CONFIG_DAMON_SYSFS=y CONFIG_DAMON_SYSFS_KUNIT_TEST=y + +# enable DAMON_DEBUG_SANITY to catch any bug +CONFIG_DAMON_DEBUG_SANITY=y diff --git a/mm/damon/tests/core-kunit.h b/mm/damon/tests/core-kunit.h index 596f33ec2d81..9e5904c2beeb 100644 --- a/mm/damon/tests/core-kunit.h +++ b/mm/damon/tests/core-kunit.h @@ -693,6 +693,7 @@ static void damos_test_commit_quota(struct kunit *test) .reset_interval = 1, .ms = 2, .sz = 3, + .goal_tuner = DAMOS_QUOTA_GOAL_TUNER_CONSIST, .weight_sz = 4, .weight_nr_accesses = 5, .weight_age = 6, @@ -701,6 +702,7 @@ static void damos_test_commit_quota(struct kunit *test) .reset_interval = 7, .ms = 8, .sz = 9, + .goal_tuner = DAMOS_QUOTA_GOAL_TUNER_TEMPORAL, .weight_sz = 10, .weight_nr_accesses = 11, .weight_age = 12, @@ -714,6 +716,7 @@ static void damos_test_commit_quota(struct kunit *test) KUNIT_EXPECT_EQ(test, dst.reset_interval, src.reset_interval); KUNIT_EXPECT_EQ(test, dst.ms, src.ms); KUNIT_EXPECT_EQ(test, dst.sz, src.sz); + KUNIT_EXPECT_EQ(test, dst.goal_tuner, src.goal_tuner); KUNIT_EXPECT_EQ(test, dst.weight_sz, src.weight_sz); KUNIT_EXPECT_EQ(test, dst.weight_nr_accesses, src.weight_nr_accesses); KUNIT_EXPECT_EQ(test, dst.weight_age, src.weight_age); @@ -1057,6 +1060,27 @@ static void damon_test_commit_target_regions(struct kunit *test) (unsigned long[][2]) {{3, 8}, {8, 10}}, 2); } +static void damon_test_commit_ctx(struct kunit *test) +{ + struct damon_ctx *src, *dst; + + src = damon_new_ctx(); + if (!src) + kunit_skip(test, "src alloc fail"); + dst = damon_new_ctx(); + if (!dst) { + damon_destroy_ctx(src); + kunit_skip(test, "dst alloc fail"); + } + /* Only power of two min_region_sz is allowed. */ + src->min_region_sz = 4096; + KUNIT_EXPECT_EQ(test, damon_commit_ctx(dst, src), 0); + src->min_region_sz = 4095; + KUNIT_EXPECT_EQ(test, damon_commit_ctx(dst, src), -EINVAL); + damon_destroy_ctx(src); + damon_destroy_ctx(dst); +} + static void damos_test_filter_out(struct kunit *test) { struct damon_target *t; @@ -1239,6 +1263,79 @@ static void damon_test_set_filters_default_reject(struct kunit *test) damos_free_filter(target_filter); } +static void damon_test_apply_min_nr_regions_for(struct kunit *test, + unsigned long sz_regions, unsigned long min_region_sz, + unsigned long min_nr_regions, + unsigned long max_region_sz_expect, + unsigned long nr_regions_expect) +{ + struct damon_ctx *ctx; + struct damon_target *t; + struct damon_region *r; + unsigned long max_region_size; + + ctx = damon_new_ctx(); + if (!ctx) + kunit_skip(test, "ctx alloc fail\n"); + t = damon_new_target(); + if (!t) { + damon_destroy_ctx(ctx); + kunit_skip(test, "target alloc fail\n"); + } + damon_add_target(ctx, t); + r = damon_new_region(0, sz_regions); + if (!r) { + damon_destroy_ctx(ctx); + kunit_skip(test, "region alloc fail\n"); + } + damon_add_region(r, t); + + ctx->min_region_sz = min_region_sz; + ctx->attrs.min_nr_regions = min_nr_regions; + max_region_size = damon_apply_min_nr_regions(ctx); + + KUNIT_EXPECT_EQ(test, max_region_size, max_region_sz_expect); + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), nr_regions_expect); + + damon_destroy_ctx(ctx); +} + +static void damon_test_apply_min_nr_regions(struct kunit *test) +{ + /* common, expected setup */ + damon_test_apply_min_nr_regions_for(test, 10, 1, 10, 1, 10); + /* no zero size limit */ + damon_test_apply_min_nr_regions_for(test, 10, 1, 15, 1, 10); + /* max size should be aligned by min_region_sz */ + damon_test_apply_min_nr_regions_for(test, 10, 2, 2, 6, 2); + /* + * when min_nr_regions and min_region_sz conflicts, min_region_sz wins. + */ + damon_test_apply_min_nr_regions_for(test, 10, 2, 10, 2, 5); +} + +static void damon_test_is_last_region(struct kunit *test) +{ + struct damon_region *r; + struct damon_target *t; + int i; + + t = damon_new_target(); + if (!t) + kunit_skip(test, "target alloc fail\n"); + + for (i = 0; i < 4; i++) { + r = damon_new_region(i * 2, (i + 1) * 2); + if (!r) { + damon_free_target(t); + kunit_skip(test, "region alloc %d fail\n", i); + } + damon_add_region(r, t); + KUNIT_EXPECT_TRUE(test, damon_is_last_region(r, t)); + } + damon_free_target(t); +} + static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damon_test_target), KUNIT_CASE(damon_test_regions), @@ -1262,9 +1359,12 @@ static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damos_test_commit_pageout), KUNIT_CASE(damos_test_commit_migrate_hot), KUNIT_CASE(damon_test_commit_target_regions), + KUNIT_CASE(damon_test_commit_ctx), KUNIT_CASE(damos_test_filter_out), KUNIT_CASE(damon_test_feed_loop_next_input), KUNIT_CASE(damon_test_set_filters_default_reject), + KUNIT_CASE(damon_test_apply_min_nr_regions), + KUNIT_CASE(damon_test_is_last_region), {}, }; diff --git a/mm/damon/tests/vaddr-kunit.h b/mm/damon/tests/vaddr-kunit.h index cfae870178bf..98e734d77d51 100644 --- a/mm/damon/tests/vaddr-kunit.h +++ b/mm/damon/tests/vaddr-kunit.h @@ -252,88 +252,12 @@ static void damon_test_apply_three_regions4(struct kunit *test) new_three_regions, expected, ARRAY_SIZE(expected)); } -static void damon_test_split_evenly_fail(struct kunit *test, - unsigned long start, unsigned long end, unsigned int nr_pieces) -{ - struct damon_target *t = damon_new_target(); - struct damon_region *r; - - if (!t) - kunit_skip(test, "target alloc fail"); - - r = damon_new_region(start, end); - if (!r) { - damon_free_target(t); - kunit_skip(test, "region alloc fail"); - } - - damon_add_region(r, t); - KUNIT_EXPECT_EQ(test, - damon_va_evenly_split_region(t, r, nr_pieces), -EINVAL); - KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 1u); - - damon_for_each_region(r, t) { - KUNIT_EXPECT_EQ(test, r->ar.start, start); - KUNIT_EXPECT_EQ(test, r->ar.end, end); - } - - damon_free_target(t); -} - -static void damon_test_split_evenly_succ(struct kunit *test, - unsigned long start, unsigned long end, unsigned int nr_pieces) -{ - struct damon_target *t = damon_new_target(); - struct damon_region *r; - unsigned long expected_width = (end - start) / nr_pieces; - unsigned long i = 0; - - if (!t) - kunit_skip(test, "target alloc fail"); - r = damon_new_region(start, end); - if (!r) { - damon_free_target(t); - kunit_skip(test, "region alloc fail"); - } - damon_add_region(r, t); - KUNIT_EXPECT_EQ(test, - damon_va_evenly_split_region(t, r, nr_pieces), 0); - KUNIT_EXPECT_EQ(test, damon_nr_regions(t), nr_pieces); - - damon_for_each_region(r, t) { - if (i == nr_pieces - 1) { - KUNIT_EXPECT_EQ(test, - r->ar.start, start + i * expected_width); - KUNIT_EXPECT_EQ(test, r->ar.end, end); - break; - } - KUNIT_EXPECT_EQ(test, - r->ar.start, start + i++ * expected_width); - KUNIT_EXPECT_EQ(test, r->ar.end, start + i * expected_width); - } - damon_free_target(t); -} - -static void damon_test_split_evenly(struct kunit *test) -{ - KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(NULL, NULL, 5), - -EINVAL); - - damon_test_split_evenly_fail(test, 0, 100, 0); - damon_test_split_evenly_succ(test, 0, 100, 10); - damon_test_split_evenly_succ(test, 5, 59, 5); - damon_test_split_evenly_succ(test, 4, 6, 1); - damon_test_split_evenly_succ(test, 0, 3, 2); - damon_test_split_evenly_fail(test, 5, 6, 2); -} - static struct kunit_case damon_test_cases[] = { KUNIT_CASE(damon_test_three_regions_in_vmas), KUNIT_CASE(damon_test_apply_three_regions1), KUNIT_CASE(damon_test_apply_three_regions2), KUNIT_CASE(damon_test_apply_three_regions3), KUNIT_CASE(damon_test_apply_three_regions4), - KUNIT_CASE(damon_test_split_evenly), {}, }; diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 729b7ffd3565..b069dbc7e3d2 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -53,52 +53,6 @@ static struct mm_struct *damon_get_mm(struct damon_target *t) return mm; } -/* - * Functions for the initial monitoring target regions construction - */ - -/* - * Size-evenly split a region into 'nr_pieces' small regions - * - * Returns 0 on success, or negative error code otherwise. - */ -static int damon_va_evenly_split_region(struct damon_target *t, - struct damon_region *r, unsigned int nr_pieces) -{ - unsigned long sz_orig, sz_piece, orig_end; - struct damon_region *n = NULL, *next; - unsigned long start; - unsigned int i; - - if (!r || !nr_pieces) - return -EINVAL; - - if (nr_pieces == 1) - return 0; - - orig_end = r->ar.end; - sz_orig = damon_sz_region(r); - sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION_SZ); - - if (!sz_piece) - return -EINVAL; - - r->ar.end = r->ar.start + sz_piece; - next = damon_next_region(r); - for (start = r->ar.end, i = 1; i < nr_pieces; start += sz_piece, i++) { - n = damon_new_region(start, start + sz_piece); - if (!n) - return -ENOMEM; - damon_insert_region(n, r, next, t); - r = n; - } - /* complement last region for possible rounding error */ - if (n) - n->ar.end = orig_end; - - return 0; -} - static unsigned long sz_range(struct damon_addr_range *r) { return r->end - r->start; @@ -240,10 +194,8 @@ static void __damon_va_init_regions(struct damon_ctx *ctx, struct damon_target *t) { struct damon_target *ti; - struct damon_region *r; struct damon_addr_range regions[3]; - unsigned long sz = 0, nr_pieces; - int i, tidx = 0; + int tidx = 0; if (damon_va_three_regions(t, regions)) { damon_for_each_target(ti, ctx) { @@ -255,25 +207,7 @@ static void __damon_va_init_regions(struct damon_ctx *ctx, return; } - for (i = 0; i < 3; i++) - sz += regions[i].end - regions[i].start; - if (ctx->attrs.min_nr_regions) - sz /= ctx->attrs.min_nr_regions; - if (sz < DAMON_MIN_REGION_SZ) - sz = DAMON_MIN_REGION_SZ; - - /* Set the initial three regions of the target */ - for (i = 0; i < 3; i++) { - r = damon_new_region(regions[i].start, regions[i].end); - if (!r) { - pr_err("%d'th init region creation failed\n", i); - return; - } - damon_add_region(r, t); - - nr_pieces = (regions[i].end - regions[i].start) / sz; - damon_va_evenly_split_region(t, r, nr_pieces); - } + damon_set_regions(t, regions, 3, DAMON_MIN_REGION_SZ); } /* Initialize '->regions_list' of every target (task) */ @@ -985,8 +919,7 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, } static int damon_va_scheme_score(struct damon_ctx *context, - struct damon_target *t, struct damon_region *r, - struct damos *scheme) + struct damon_region *r, struct damos *scheme) { switch (scheme->action) { diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 83cf07269f13..23dc3ee09561 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -445,7 +445,7 @@ static void __init pmd_huge_tests(struct pgtable_debug_args *args) * X86 defined pmd_set_huge() verifies that the given * PMD is not a populated non-leaf entry. */ - WRITE_ONCE(*args->pmdp, __pmd(0)); + pmd_clear(args->pmdp); WARN_ON(!pmd_set_huge(args->pmdp, __pfn_to_phys(args->fixed_pmd_pfn), args->page_prot)); WARN_ON(!pmd_clear_huge(args->pmdp)); pmd = pmdp_get(args->pmdp); @@ -465,7 +465,7 @@ static void __init pud_huge_tests(struct pgtable_debug_args *args) * X86 defined pud_set_huge() verifies that the given * PUD is not a populated non-leaf entry. */ - WRITE_ONCE(*args->pudp, __pud(0)); + pud_clear(args->pudp); WARN_ON(!pud_set_huge(args->pudp, __pfn_to_phys(args->fixed_pud_pfn), args->page_prot)); WARN_ON(!pud_clear_huge(args->pudp)); pud = pudp_get(args->pudp); diff --git a/mm/execmem.c b/mm/execmem.c index 810a4ba9c924..084a207e4278 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -203,13 +203,6 @@ static int execmem_cache_add_locked(void *ptr, size_t size, gfp_t gfp_mask) return mas_store_gfp(&mas, (void *)lower, gfp_mask); } -static int execmem_cache_add(void *ptr, size_t size, gfp_t gfp_mask) -{ - guard(mutex)(&execmem_cache.mutex); - - return execmem_cache_add_locked(ptr, size, gfp_mask); -} - static bool within_range(struct execmem_range *range, struct ma_state *mas, size_t size) { @@ -225,18 +218,16 @@ static bool within_range(struct execmem_range *range, struct ma_state *mas, return false; } -static void *__execmem_cache_alloc(struct execmem_range *range, size_t size) +static void *execmem_cache_alloc_locked(struct execmem_range *range, size_t size) { struct maple_tree *free_areas = &execmem_cache.free_areas; struct maple_tree *busy_areas = &execmem_cache.busy_areas; MA_STATE(mas_free, free_areas, 0, ULONG_MAX); MA_STATE(mas_busy, busy_areas, 0, ULONG_MAX); - struct mutex *mutex = &execmem_cache.mutex; unsigned long addr, last, area_size = 0; void *area, *ptr = NULL; int err; - mutex_lock(mutex); mas_for_each(&mas_free, area, ULONG_MAX) { area_size = mas_range_len(&mas_free); @@ -245,7 +236,7 @@ static void *__execmem_cache_alloc(struct execmem_range *range, size_t size) } if (area_size < size) - goto out_unlock; + return NULL; addr = mas_free.index; last = mas_free.last; @@ -254,7 +245,7 @@ static void *__execmem_cache_alloc(struct execmem_range *range, size_t size) mas_set_range(&mas_busy, addr, addr + size - 1); err = mas_store_gfp(&mas_busy, (void *)addr, GFP_KERNEL); if (err) - goto out_unlock; + return NULL; mas_store_gfp(&mas_free, NULL, GFP_KERNEL); if (area_size > size) { @@ -268,19 +259,25 @@ static void *__execmem_cache_alloc(struct execmem_range *range, size_t size) err = mas_store_gfp(&mas_free, ptr, GFP_KERNEL); if (err) { mas_store_gfp(&mas_busy, NULL, GFP_KERNEL); - goto out_unlock; + return NULL; } } ptr = (void *)addr; -out_unlock: - mutex_unlock(mutex); return ptr; } -static int execmem_cache_populate(struct execmem_range *range, size_t size) +static void *__execmem_cache_alloc(struct execmem_range *range, size_t size) +{ + guard(mutex)(&execmem_cache.mutex); + + return execmem_cache_alloc_locked(range, size); +} + +static void *execmem_cache_populate_alloc(struct execmem_range *range, size_t size) { unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; + struct mutex *mutex = &execmem_cache.mutex; struct vm_struct *vm; size_t alloc_size; int err = -ENOMEM; @@ -294,7 +291,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size) } if (!p) - return err; + return NULL; vm = find_vm_area(p); if (!vm) @@ -307,33 +304,39 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size) if (err) goto err_free_mem; - err = execmem_cache_add(p, alloc_size, GFP_KERNEL); + /* + * New memory blocks must be allocated and added to the cache + * as an atomic operation, otherwise they may be consumed + * by a parallel call to the execmem_cache_alloc function. + */ + mutex_lock(mutex); + err = execmem_cache_add_locked(p, alloc_size, GFP_KERNEL); if (err) goto err_reset_direct_map; - return 0; + p = execmem_cache_alloc_locked(range, size); + + mutex_unlock(mutex); + + return p; err_reset_direct_map: + mutex_unlock(mutex); execmem_set_direct_map_valid(vm, true); err_free_mem: vfree(p); - return err; + return NULL; } static void *execmem_cache_alloc(struct execmem_range *range, size_t size) { void *p; - int err; p = __execmem_cache_alloc(range, size); if (p) return p; - err = execmem_cache_populate(range, size); - if (err) - return NULL; - - return __execmem_cache_alloc(range, size); + return execmem_cache_populate_alloc(range, size); } static inline bool is_pending_free(void *ptr) diff --git a/mm/fadvise.c b/mm/fadvise.c index 67028e30aa91..b63fe21416ff 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -43,7 +43,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) return -ESPIPE; mapping = file->f_mapping; - if (!mapping || len < 0) + if (!mapping || len < 0 || offset < 0) return -EINVAL; bdi = inode_to_bdi(mapping->host); diff --git a/mm/filemap.c b/mm/filemap.c index 3c1e785542dd..c568d9058ff8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -31,7 +31,7 @@ #include <linux/hash.h> #include <linux/writeback.h> #include <linux/backing-dev.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/security.h> #include <linux/cpuset.h> #include <linux/hugetlb.h> @@ -18,7 +18,7 @@ #include <linux/hugetlb.h> #include <linux/migrate.h> #include <linux/mm_inline.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/sched/mm.h> #include <linux/shmem_fs.h> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b298cba853ab..42c983821c03 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -100,6 +100,14 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); } +/* If returns true, we are unable to access the VMA's folios. */ +static bool vma_is_special_huge(const struct vm_area_struct *vma) +{ + if (vma_is_dax(vma)) + return false; + return vma_test_any(vma, VMA_PFNMAP_BIT, VMA_MIXEDMAP_BIT); +} + unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, vm_flags_t vm_flags, enum tva_type type, @@ -113,8 +121,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, /* Check the intersection of requested and supported orders. */ if (vma_is_anonymous(vma)) supported_orders = THP_ORDERS_ALL_ANON; - else if (vma_is_special_huge(vma)) - supported_orders = THP_ORDERS_ALL_SPECIAL; + else if (vma_is_dax(vma) || vma_is_special_huge(vma)) + supported_orders = THP_ORDERS_ALL_SPECIAL_DAX; else supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; @@ -316,30 +324,77 @@ static ssize_t enabled_show(struct kobject *kobj, return sysfs_emit(buf, "%s\n", output); } +enum anon_enabled_mode { + ANON_ENABLED_ALWAYS = 0, + ANON_ENABLED_INHERIT = 1, + ANON_ENABLED_MADVISE = 2, + ANON_ENABLED_NEVER = 3, +}; + +static const char * const anon_enabled_mode_strings[] = { + [ANON_ENABLED_ALWAYS] = "always", + [ANON_ENABLED_INHERIT] = "inherit", + [ANON_ENABLED_MADVISE] = "madvise", + [ANON_ENABLED_NEVER] = "never", +}; + +enum global_enabled_mode { + GLOBAL_ENABLED_ALWAYS = 0, + GLOBAL_ENABLED_MADVISE = 1, + GLOBAL_ENABLED_NEVER = 2, +}; + +static const char * const global_enabled_mode_strings[] = { + [GLOBAL_ENABLED_ALWAYS] = "always", + [GLOBAL_ENABLED_MADVISE] = "madvise", + [GLOBAL_ENABLED_NEVER] = "never", +}; + +static bool set_global_enabled_mode(enum global_enabled_mode mode) +{ + static const unsigned long thp_flags[] = { + TRANSPARENT_HUGEPAGE_FLAG, + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + }; + enum global_enabled_mode m; + bool changed = false; + + for (m = 0; m < ARRAY_SIZE(thp_flags); m++) { + if (m == mode) + changed |= !test_and_set_bit(thp_flags[m], + &transparent_hugepage_flags); + else + changed |= test_and_clear_bit(thp_flags[m], + &transparent_hugepage_flags); + } + + return changed; +} + static ssize_t enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - ssize_t ret = count; + int mode; - if (sysfs_streq(buf, "always")) { - clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); - set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); - } else if (sysfs_streq(buf, "madvise")) { - clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); - set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); - } else if (sysfs_streq(buf, "never")) { - clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags); - clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags); - } else - ret = -EINVAL; + mode = sysfs_match_string(global_enabled_mode_strings, buf); + if (mode < 0) + return -EINVAL; - if (ret > 0) { + if (set_global_enabled_mode(mode)) { int err = start_stop_khugepaged(); + if (err) - ret = err; + return err; + } else { + /* + * Recalculate watermarks even when the mode didn't + * change, as the previous code always called + * start_stop_khugepaged() which does this internally. + */ + set_recommended_min_free_kbytes(); } - return ret; + return count; } static struct kobj_attribute enabled_attr = __ATTR_RW(enabled); @@ -515,48 +570,54 @@ static ssize_t anon_enabled_show(struct kobject *kobj, return sysfs_emit(buf, "%s\n", output); } +static bool set_anon_enabled_mode(int order, enum anon_enabled_mode mode) +{ + static unsigned long *enabled_orders[] = { + &huge_anon_orders_always, + &huge_anon_orders_inherit, + &huge_anon_orders_madvise, + }; + enum anon_enabled_mode m; + bool changed = false; + + spin_lock(&huge_anon_orders_lock); + for (m = 0; m < ARRAY_SIZE(enabled_orders); m++) { + if (m == mode) + changed |= !__test_and_set_bit(order, enabled_orders[m]); + else + changed |= __test_and_clear_bit(order, enabled_orders[m]); + } + spin_unlock(&huge_anon_orders_lock); + + return changed; +} + static ssize_t anon_enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { int order = to_thpsize(kobj)->order; - ssize_t ret = count; + int mode; - if (sysfs_streq(buf, "always")) { - spin_lock(&huge_anon_orders_lock); - clear_bit(order, &huge_anon_orders_inherit); - clear_bit(order, &huge_anon_orders_madvise); - set_bit(order, &huge_anon_orders_always); - spin_unlock(&huge_anon_orders_lock); - } else if (sysfs_streq(buf, "inherit")) { - spin_lock(&huge_anon_orders_lock); - clear_bit(order, &huge_anon_orders_always); - clear_bit(order, &huge_anon_orders_madvise); - set_bit(order, &huge_anon_orders_inherit); - spin_unlock(&huge_anon_orders_lock); - } else if (sysfs_streq(buf, "madvise")) { - spin_lock(&huge_anon_orders_lock); - clear_bit(order, &huge_anon_orders_always); - clear_bit(order, &huge_anon_orders_inherit); - set_bit(order, &huge_anon_orders_madvise); - spin_unlock(&huge_anon_orders_lock); - } else if (sysfs_streq(buf, "never")) { - spin_lock(&huge_anon_orders_lock); - clear_bit(order, &huge_anon_orders_always); - clear_bit(order, &huge_anon_orders_inherit); - clear_bit(order, &huge_anon_orders_madvise); - spin_unlock(&huge_anon_orders_lock); - } else - ret = -EINVAL; + mode = sysfs_match_string(anon_enabled_mode_strings, buf); + if (mode < 0) + return -EINVAL; - if (ret > 0) { - int err; + if (set_anon_enabled_mode(order, mode)) { + int err = start_stop_khugepaged(); - err = start_stop_khugepaged(); if (err) - ret = err; + return err; + } else { + /* + * Recalculate watermarks even when the mode didn't + * change, as the previous code always called + * start_stop_khugepaged() which does this internally. + */ + set_recommended_min_free_kbytes(); } - return ret; + + return count; } static struct kobj_attribute anon_enabled_attr = @@ -2341,17 +2402,87 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) mm_dec_nr_ptes(mm); } -int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, +static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma, + pmd_t pmdval, struct folio *folio, bool is_present) +{ + const bool is_device_private = folio_is_device_private(folio); + + /* Present and device private folios are rmappable. */ + if (is_present || is_device_private) + folio_remove_rmap_pmd(folio, &folio->page, vma); + + if (folio_test_anon(folio)) { + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); + } else { + add_mm_counter(mm, mm_counter_file(folio), + -HPAGE_PMD_NR); + + if (is_present && pmd_young(pmdval) && + likely(vma_has_recency(vma))) + folio_mark_accessed(folio); + } + + /* Device private folios are pinned. */ + if (is_device_private) + folio_put(folio); +} + +static struct folio *normal_or_softleaf_folio_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t pmdval, bool is_present) +{ + if (is_present) + return vm_normal_folio_pmd(vma, addr, pmdval); + + if (!thp_migration_supported()) + WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!"); + return pmd_to_softleaf_folio(pmdval); +} + +static bool has_deposited_pgtable(struct vm_area_struct *vma, pmd_t pmdval, + struct folio *folio) +{ + /* Some architectures require unconditional depositing. */ + if (arch_needs_pgtable_deposit()) + return true; + + /* + * Huge zero always deposited except for DAX which handles itself, see + * set_huge_zero_folio(). + */ + if (is_huge_zero_pmd(pmdval)) + return !vma_is_dax(vma); + + /* + * Otherwise, only anonymous folios are deposited, see + * __do_huge_pmd_anonymous_page(). + */ + return folio && folio_test_anon(folio); +} + +/** + * zap_huge_pmd - Zap a huge THP which is of PMD size. + * @tlb: The MMU gather TLB state associated with the operation. + * @vma: The VMA containing the range to zap. + * @pmd: A pointer to the leaf PMD entry. + * @addr: The virtual address for the range to zap. + * + * Returns: %true on success, %false otherwise. + */ +bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr) { - pmd_t orig_pmd; + struct mm_struct *mm = tlb->mm; + struct folio *folio = NULL; + bool is_present = false; + bool has_deposit; spinlock_t *ptl; + pmd_t orig_pmd; tlb_change_page_size(tlb, HPAGE_PMD_SIZE); ptl = __pmd_trans_huge_lock(pmd, vma); if (!ptl) - return 0; + return false; /* * For architectures like ppc64 we look at deposited pgtable * when calling pmdp_huge_get_and_clear. So do the @@ -2362,64 +2493,19 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, tlb->fullmm); arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); - if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { - if (arch_needs_pgtable_deposit()) - zap_deposited_table(tlb->mm, pmd); - spin_unlock(ptl); - } else if (is_huge_zero_pmd(orig_pmd)) { - if (!vma_is_dax(vma) || arch_needs_pgtable_deposit()) - zap_deposited_table(tlb->mm, pmd); - spin_unlock(ptl); - } else { - struct folio *folio = NULL; - int flush_needed = 1; - if (pmd_present(orig_pmd)) { - struct page *page = pmd_page(orig_pmd); + is_present = pmd_present(orig_pmd); + folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present); + has_deposit = has_deposited_pgtable(vma, orig_pmd, folio); + if (folio) + zap_huge_pmd_folio(mm, vma, orig_pmd, folio, is_present); + if (has_deposit) + zap_deposited_table(mm, pmd); - folio = page_folio(page); - folio_remove_rmap_pmd(folio, page, vma); - WARN_ON_ONCE(folio_mapcount(folio) < 0); - VM_BUG_ON_PAGE(!PageHead(page), page); - } else if (pmd_is_valid_softleaf(orig_pmd)) { - const softleaf_t entry = softleaf_from_pmd(orig_pmd); - - folio = softleaf_to_folio(entry); - flush_needed = 0; - - if (!thp_migration_supported()) - WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!"); - } - - if (folio_test_anon(folio)) { - zap_deposited_table(tlb->mm, pmd); - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); - } else { - if (arch_needs_pgtable_deposit()) - zap_deposited_table(tlb->mm, pmd); - add_mm_counter(tlb->mm, mm_counter_file(folio), - -HPAGE_PMD_NR); - - /* - * Use flush_needed to indicate whether the PMD entry - * is present, instead of checking pmd_present() again. - */ - if (flush_needed && pmd_young(orig_pmd) && - likely(vma_has_recency(vma))) - folio_mark_accessed(folio); - } - - if (folio_is_device_private(folio)) { - folio_remove_rmap_pmd(folio, &folio->page, vma); - WARN_ON_ONCE(folio_mapcount(folio) < 0); - folio_put(folio); - } - - spin_unlock(ptl); - if (flush_needed) - tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE); - } - return 1; + spin_unlock(ptl); + if (is_present && folio) + tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE); + return true; } #ifndef pmd_move_must_withdraw @@ -2864,7 +2950,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); arch_check_zapped_pud(vma, orig_pud); tlb_remove_pud_tlb_entry(tlb, pud, addr); - if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { + if (vma_is_special_huge(vma)) { spin_unlock(ptl); /* No zero page support yet */ } else { @@ -2972,7 +3058,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) { pte_t entry; - entry = pfn_pte(my_zero_pfn(addr), vma->vm_page_prot); + entry = pfn_pte(zero_pfn(addr), vma->vm_page_prot); entry = pte_mkspecial(entry); if (pmd_uffd_wp(old_pmd)) entry = pte_mkuffd_wp(entry); @@ -3015,7 +3101,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, */ if (arch_needs_pgtable_deposit()) zap_deposited_table(mm, pmd); - if (!vma_is_dax(vma) && vma_is_special_huge(vma)) + if (vma_is_special_huge(vma)) return; if (unlikely(pmd_is_migration_entry(old_pmd))) { const softleaf_t old_entry = softleaf_from_pmd(old_pmd); @@ -4106,7 +4192,7 @@ out_unlock: i_mmap_unlock_read(mapping); out: xas_destroy(&xas); - if (old_order == HPAGE_PMD_ORDER) + if (is_pmd_order(old_order)) count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); count_mthp_stat(old_order, !ret ? MTHP_STAT_SPLIT : MTHP_STAT_SPLIT_FAILED); return ret; @@ -4456,7 +4542,7 @@ retry: goto next; } if (!folio_trylock(folio)) - goto next; + goto requeue; if (!split_folio(folio)) { did_split = true; if (underused) @@ -4465,13 +4551,18 @@ retry: } folio_unlock(folio); next: + /* + * If thp_underused() returns false, or if split_folio() + * succeeds, or if split_folio() fails in the case it was + * underused, then consider it used and don't add it back to + * split_queue. + */ if (did_split || !folio_test_partially_mapped(folio)) continue; +requeue: /* - * Only add back to the queue if folio is partially mapped. - * If thp_underused returns false, or if split_folio fails - * in the case it was underused, then consider it used and - * don't add it back to split_queue. + * Add back partially mapped folios, or underused folios that + * we could not lock this round. */ fqueue = folio_split_queue_lock_irqsave(folio, &flags); if (list_empty(&folio->_deferred_list)) { @@ -4576,8 +4667,16 @@ next: static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma) { - return vma_is_special_huge(vma) || (vma->vm_flags & VM_IO) || - is_vm_hugetlb_page(vma); + if (vma_is_dax(vma)) + return true; + if (vma_is_special_huge(vma)) + return true; + if (vma_test(vma, VMA_IO_BIT)) + return true; + if (is_vm_hugetlb_page(vma)) + return true; + + return false; } static int split_huge_pages_pid(int pid, unsigned long vaddr_start, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2ced2c8633d8..9413ed497be5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1017,34 +1017,6 @@ static pgoff_t vma_hugecache_offset(struct hstate *h, (vma->vm_pgoff >> huge_page_order(h)); } -/** - * vma_kernel_pagesize - Page size granularity for this VMA. - * @vma: The user mapping. - * - * Folios in this VMA will be aligned to, and at least the size of the - * number of bytes returned by this function. - * - * Return: The default size of the folios allocated when backing a VMA. - */ -unsigned long vma_kernel_pagesize(struct vm_area_struct *vma) -{ - if (vma->vm_ops && vma->vm_ops->pagesize) - return vma->vm_ops->pagesize(vma); - return PAGE_SIZE; -} -EXPORT_SYMBOL_GPL(vma_kernel_pagesize); - -/* - * Return the page size being used by the MMU to back a VMA. In the majority - * of cases, the page size used by the kernel matches the MMU size. On - * architectures where it differs, an architecture-specific 'strong' - * version of this symbol is required. - */ -__weak unsigned long vma_mmu_pagesize(struct vm_area_struct *vma) -{ - return vma_kernel_pagesize(vma); -} - /* * Flags for MAP_PRIVATE reservations. These are stored in the bottom * bits of the reservation map pointer, which are always clear due to @@ -1186,7 +1158,7 @@ static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) static void set_vma_desc_resv_map(struct vm_area_desc *desc, struct resv_map *map) { VM_WARN_ON_ONCE(!is_vma_hugetlb_flags(&desc->vma_flags)); - VM_WARN_ON_ONCE(vma_desc_test_flags(desc, VMA_MAYSHARE_BIT)); + VM_WARN_ON_ONCE(vma_desc_test(desc, VMA_MAYSHARE_BIT)); desc->private_data = map; } @@ -1194,7 +1166,7 @@ static void set_vma_desc_resv_map(struct vm_area_desc *desc, struct resv_map *ma static void set_vma_desc_resv_flags(struct vm_area_desc *desc, unsigned long flags) { VM_WARN_ON_ONCE(!is_vma_hugetlb_flags(&desc->vma_flags)); - VM_WARN_ON_ONCE(vma_desc_test_flags(desc, VMA_MAYSHARE_BIT)); + VM_WARN_ON_ONCE(vma_desc_test(desc, VMA_MAYSHARE_BIT)); desc->private_data = (void *)((unsigned long)desc->private_data | flags); } @@ -3160,6 +3132,7 @@ found: /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, + struct hstate *h, unsigned long start_page_number, unsigned long end_page_number) { @@ -3168,6 +3141,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, struct page *page = folio_page(folio, start_page_number); unsigned long head_pfn = folio_pfn(folio); unsigned long pfn, end_pfn = head_pfn + end_page_number; + unsigned int order = huge_page_order(h); /* * As we marked all tail pages with memblock_reserved_mark_noinit(), @@ -3175,7 +3149,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, */ for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) { __init_single_page(page, pfn, zone, nid); - prep_compound_tail((struct page *)folio, pfn - head_pfn); + prep_compound_tail(page, &folio->page, order); set_page_count(page, 0); } } @@ -3195,7 +3169,7 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, __folio_set_head(folio); ret = folio_ref_freeze(folio, 1); VM_BUG_ON(!ret); - hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); + hugetlb_folio_init_tail_vmemmap(folio, h, 1, nr_pages); prep_compound_head(&folio->page, huge_page_order(h)); } @@ -3252,7 +3226,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, * time as this is early in boot and there should * be no contention. */ - hugetlb_folio_init_tail_vmemmap(folio, + hugetlb_folio_init_tail_vmemmap(folio, h, HUGETLB_VMEMMAP_RESERVE_PAGES, pages_per_huge_page(h)); } @@ -6592,7 +6566,7 @@ long hugetlb_reserve_pages(struct inode *inode, * to reserve the full area even if read-only as mprotect() may be * called to make the mapping read-write. Assume !desc is a shm mapping */ - if (!desc || vma_desc_test_flags(desc, VMA_MAYSHARE_BIT)) { + if (!desc || vma_desc_test(desc, VMA_MAYSHARE_BIT)) { /* * resv_map can not be NULL as hugetlb_reserve_pages is only * called for inodes for which resv_maps were created (see @@ -6626,7 +6600,7 @@ long hugetlb_reserve_pages(struct inode *inode, if (err < 0) goto out_err; - if (desc && !vma_desc_test_flags(desc, VMA_MAYSHARE_BIT) && h_cg) { + if (desc && !vma_desc_test(desc, VMA_MAYSHARE_BIT) && h_cg) { /* For private mappings, the hugetlb_cgroup uncharge info hangs * of the resv_map. */ @@ -6663,7 +6637,7 @@ long hugetlb_reserve_pages(struct inode *inode, * consumed reservations are stored in the map. Hence, nothing * else has to be done for private mappings here */ - if (!desc || vma_desc_test_flags(desc, VMA_MAYSHARE_BIT)) { + if (!desc || vma_desc_test(desc, VMA_MAYSHARE_BIT)) { add = region_add(resv_map, from, to, regions_needed, h, h_cg); if (unlikely(add < 0)) { @@ -6727,7 +6701,7 @@ out_uncharge_cgroup: hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h), chg * pages_per_huge_page(h), h_cg); out_err: - if (!desc || vma_desc_test_flags(desc, VMA_MAYSHARE_BIT)) + if (!desc || vma_desc_test(desc, VMA_MAYSHARE_BIT)) /* Only call region_abort if the region_chg succeeded but the * region_add failed or didn't run. */ diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index a9280259e12a..4a077d231d3a 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -19,14 +19,15 @@ #include <asm/tlbflush.h> #include "hugetlb_vmemmap.h" +#include "internal.h" /** * struct vmemmap_remap_walk - walk vmemmap page table * * @remap_pte: called for each lowest-level entry (PTE). * @nr_walked: the number of walked pte. - * @reuse_page: the page which is reused for the tail vmemmap pages. - * @reuse_addr: the virtual address of the @reuse_page page. + * @vmemmap_head: the page to be installed as first in the vmemmap range + * @vmemmap_tail: the page to be installed as non-first in the vmemmap range * @vmemmap_pages: the list head of the vmemmap pages that can be freed * or is mapped from. * @flags: used to modify behavior in vmemmap page table walking @@ -35,17 +36,17 @@ struct vmemmap_remap_walk { void (*remap_pte)(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk); + unsigned long nr_walked; - struct page *reuse_page; - unsigned long reuse_addr; + struct page *vmemmap_head; + struct page *vmemmap_tail; struct list_head *vmemmap_pages; + /* Skip the TLB flush when we split the PMD */ #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0) /* Skip the TLB flush when we remap the PTE */ #define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1) -/* synchronize_rcu() to avoid writes from page_ref_add_unless() */ -#define VMEMMAP_SYNCHRONIZE_RCU BIT(2) unsigned long flags; }; @@ -141,14 +142,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr, { struct vmemmap_remap_walk *vmemmap_walk = walk->private; - /* - * The reuse_page is found 'first' in page table walking before - * starting remapping. - */ - if (!vmemmap_walk->reuse_page) - vmemmap_walk->reuse_page = pte_page(ptep_get(pte)); - else - vmemmap_walk->remap_pte(pte, addr, vmemmap_walk); + vmemmap_walk->remap_pte(pte, addr, vmemmap_walk); vmemmap_walk->nr_walked++; return 0; @@ -208,18 +202,12 @@ static void free_vmemmap_page_list(struct list_head *list) static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk) { - /* - * Remap the tail pages as read-only to catch illegal write operation - * to the tail pages. - */ - pgprot_t pgprot = PAGE_KERNEL_RO; struct page *page = pte_page(ptep_get(pte)); pte_t entry; /* Remapping the head page requires r/w */ - if (unlikely(addr == walk->reuse_addr)) { - pgprot = PAGE_KERNEL; - list_del(&walk->reuse_page->lru); + if (unlikely(walk->nr_walked == 0 && walk->vmemmap_head)) { + list_del(&walk->vmemmap_head->lru); /* * Makes sure that preceding stores to the page contents from @@ -227,53 +215,50 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * write. */ smp_wmb(); + + entry = mk_pte(walk->vmemmap_head, PAGE_KERNEL); + } else { + /* + * Remap the tail pages as read-only to catch illegal write + * operation to the tail pages. + */ + entry = mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO); } - entry = mk_pte(walk->reuse_page, pgprot); list_add(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } -/* - * How many struct page structs need to be reset. When we reuse the head - * struct page, the special metadata (e.g. page->flags or page->mapping) - * cannot copy to the tail struct page structs. The invalid value will be - * checked in the free_tail_page_prepare(). In order to avoid the message - * of "corrupted mapping in tail page". We need to reset at least 4 (one - * head struct page struct and three tail struct page structs) struct page - * structs. - */ -#define NR_RESET_STRUCT_PAGE 4 - -static inline void reset_struct_pages(struct page *start) -{ - struct page *from = start + NR_RESET_STRUCT_PAGE; - - BUILD_BUG_ON(NR_RESET_STRUCT_PAGE * 2 > PAGE_SIZE / sizeof(struct page)); - memcpy(start, from, sizeof(*from) * NR_RESET_STRUCT_PAGE); -} - static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk) { - pgprot_t pgprot = PAGE_KERNEL; struct page *page; - void *to; - - BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page); + struct page *from, *to; page = list_first_entry(walk->vmemmap_pages, struct page, lru); list_del(&page->lru); + + /* + * Initialize tail pages in the newly allocated vmemmap page. + * + * There is folio-scope metadata that is encoded in the first few + * tail pages. + * + * Use the value last tail page in the page with the head page + * to initialize the rest of tail pages. + */ + from = compound_head((struct page *)addr) + + PAGE_SIZE / sizeof(struct page) - 1; to = page_to_virt(page); - copy_page(to, (void *)walk->reuse_addr); - reset_struct_pages(to); + for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++, to++) + *to = *from; /* * Makes sure that preceding stores to the page contents become visible * before the set_pte_at() write. */ smp_wmb(); - set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); } /** @@ -283,33 +268,28 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, * to remap. * @end: end address of the vmemmap virtual address range that we want to * remap. - * @reuse: reuse address. - * * Return: %0 on success, negative error code otherwise. */ -static int vmemmap_remap_split(unsigned long start, unsigned long end, - unsigned long reuse) +static int vmemmap_remap_split(unsigned long start, unsigned long end) { struct vmemmap_remap_walk walk = { .remap_pte = NULL, .flags = VMEMMAP_SPLIT_NO_TLB_FLUSH, }; - /* See the comment in the vmemmap_remap_free(). */ - BUG_ON(start - reuse != PAGE_SIZE); - - return vmemmap_remap_range(reuse, end, &walk); + return vmemmap_remap_range(start, end, &walk); } /** * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) - * to the page which @reuse is mapped to, then free vmemmap - * which the range are mapped to. + * to use @vmemmap_head/tail, then free vmemmap which + * the range are mapped to. * @start: start address of the vmemmap virtual address range that we want * to remap. * @end: end address of the vmemmap virtual address range that we want to * remap. - * @reuse: reuse address. + * @vmemmap_head: the page to be installed as first in the vmemmap range + * @vmemmap_tail: the page to be installed as non-first in the vmemmap range * @vmemmap_pages: list to deposit vmemmap pages to be freed. It is callers * responsibility to free pages. * @flags: modifications to vmemmap_remap_walk flags @@ -317,69 +297,38 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end, * Return: %0 on success, negative error code otherwise. */ static int vmemmap_remap_free(unsigned long start, unsigned long end, - unsigned long reuse, + struct page *vmemmap_head, + struct page *vmemmap_tail, struct list_head *vmemmap_pages, unsigned long flags) { int ret; struct vmemmap_remap_walk walk = { .remap_pte = vmemmap_remap_pte, - .reuse_addr = reuse, + .vmemmap_head = vmemmap_head, + .vmemmap_tail = vmemmap_tail, .vmemmap_pages = vmemmap_pages, .flags = flags, }; - int nid = page_to_nid((struct page *)reuse); - gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; - /* - * Allocate a new head vmemmap page to avoid breaking a contiguous - * block of struct page memory when freeing it back to page allocator - * in free_vmemmap_page_list(). This will allow the likely contiguous - * struct page backing memory to be kept contiguous and allowing for - * more allocations of hugepages. Fallback to the currently - * mapped head page in case should it fail to allocate. - */ - walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); - if (walk.reuse_page) { - copy_page(page_to_virt(walk.reuse_page), - (void *)walk.reuse_addr); - list_add(&walk.reuse_page->lru, vmemmap_pages); - memmap_pages_add(1); - } + ret = vmemmap_remap_range(start, end, &walk); + if (!ret || !walk.nr_walked) + return ret; + + end = start + walk.nr_walked * PAGE_SIZE; /* - * In order to make remapping routine most efficient for the huge pages, - * the routine of vmemmap page table walking has the following rules - * (see more details from the vmemmap_pte_range()): - * - * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) - * should be continuous. - * - The @reuse address is part of the range [@reuse, @end) that we are - * walking which is passed to vmemmap_remap_range(). - * - The @reuse address is the first in the complete range. - * - * So we need to make sure that @start and @reuse meet the above rules. + * vmemmap_pages contains pages from the previous vmemmap_remap_range() + * call which failed. These are pages which were removed from + * the vmemmap. They will be restored in the following call. */ - BUG_ON(start - reuse != PAGE_SIZE); + walk = (struct vmemmap_remap_walk) { + .remap_pte = vmemmap_restore_pte, + .vmemmap_pages = vmemmap_pages, + .flags = 0, + }; - ret = vmemmap_remap_range(reuse, end, &walk); - if (ret && walk.nr_walked) { - end = reuse + walk.nr_walked * PAGE_SIZE; - /* - * vmemmap_pages contains pages from the previous - * vmemmap_remap_range call which failed. These - * are pages which were removed from the vmemmap. - * They will be restored in the following call. - */ - walk = (struct vmemmap_remap_walk) { - .remap_pte = vmemmap_restore_pte, - .reuse_addr = reuse, - .vmemmap_pages = vmemmap_pages, - .flags = 0, - }; - - vmemmap_remap_range(reuse, end, &walk); - } + vmemmap_remap_range(start, end, &walk); return ret; } @@ -416,34 +365,26 @@ out: * to remap. * @end: end address of the vmemmap virtual address range that we want to * remap. - * @reuse: reuse address. * @flags: modifications to vmemmap_remap_walk flags * * Return: %0 on success, negative error code otherwise. */ static int vmemmap_remap_alloc(unsigned long start, unsigned long end, - unsigned long reuse, unsigned long flags) + unsigned long flags) { LIST_HEAD(vmemmap_pages); struct vmemmap_remap_walk walk = { .remap_pte = vmemmap_restore_pte, - .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, .flags = flags, }; - /* See the comment in the vmemmap_remap_free(). */ - BUG_ON(start - reuse != PAGE_SIZE); - if (alloc_vmemmap_page_list(start, end, &vmemmap_pages)) return -ENOMEM; - return vmemmap_remap_range(reuse, end, &walk); + return vmemmap_remap_range(start, end, &walk); } -DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); -EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key); - static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON); static int __init hugetlb_vmemmap_optimize_param(char *buf) { @@ -455,8 +396,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio, unsigned long flags) { int ret; - unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end; - unsigned long vmemmap_reuse; + unsigned long vmemmap_start, vmemmap_end; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); @@ -464,25 +404,20 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, if (!folio_test_hugetlb_vmemmap_optimized(folio)) return 0; - if (flags & VMEMMAP_SYNCHRONIZE_RCU) - synchronize_rcu(); - + vmemmap_start = (unsigned long)&folio->page; vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); - vmemmap_reuse = vmemmap_start; + vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; /* * The pages which the vmemmap virtual address range [@vmemmap_start, - * @vmemmap_end) are mapped to are freed to the buddy allocator, and - * the range is mapped to the page which @vmemmap_reuse is mapped to. + * @vmemmap_end) are mapped to are freed to the buddy allocator. * When a HugeTLB page is freed to the buddy allocator, previously * discarded vmemmap pages must be allocated and remapping. */ - ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags); - if (!ret) { + ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags); + if (!ret) folio_clear_hugetlb_vmemmap_optimized(folio); - static_branch_dec(&hugetlb_optimize_vmemmap_key); - } return ret; } @@ -499,7 +434,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, */ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio) { - return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU); + return __hugetlb_vmemmap_restore_folio(h, folio, 0); } /** @@ -522,14 +457,11 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, struct folio *folio, *t_folio; long restored = 0; long ret = 0; - unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU; + unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH; list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { ret = __hugetlb_vmemmap_restore_folio(h, folio, flags); - /* only need to synchronize_rcu() once for each batch */ - flags &= ~VMEMMAP_SYNCHRONIZE_RCU; - if (ret) break; restored++; @@ -561,14 +493,40 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio * return true; } +static struct page *vmemmap_get_tail(unsigned int order, struct zone *zone) +{ + const unsigned int idx = order - VMEMMAP_TAIL_MIN_ORDER; + struct page *tail, *p; + int node = zone_to_nid(zone); + + tail = READ_ONCE(zone->vmemmap_tails[idx]); + if (likely(tail)) + return tail; + + tail = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); + if (!tail) + return NULL; + + p = page_to_virt(tail); + for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++) + init_compound_tail(p + i, NULL, order, zone); + + if (cmpxchg(&zone->vmemmap_tails[idx], NULL, tail)) { + __free_page(tail); + tail = READ_ONCE(zone->vmemmap_tails[idx]); + } + + return tail; +} + static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio, struct list_head *vmemmap_pages, unsigned long flags) { - int ret = 0; - unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end; - unsigned long vmemmap_reuse; + unsigned long vmemmap_start, vmemmap_end; + struct page *vmemmap_head, *vmemmap_tail; + int nid, ret = 0; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); @@ -576,10 +534,11 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, if (!vmemmap_should_optimize_folio(h, folio)) return ret; - static_branch_inc(&hugetlb_optimize_vmemmap_key); + nid = folio_nid(folio); + vmemmap_tail = vmemmap_get_tail(h->order, folio_zone(folio)); + if (!vmemmap_tail) + return -ENOMEM; - if (flags & VMEMMAP_SYNCHRONIZE_RCU) - synchronize_rcu(); /* * Very Subtle * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed @@ -593,22 +552,30 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, */ folio_set_hugetlb_vmemmap_optimized(folio); + vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0); + if (!vmemmap_head) { + ret = -ENOMEM; + goto out; + } + + copy_page(page_to_virt(vmemmap_head), folio); + list_add(&vmemmap_head->lru, vmemmap_pages); + memmap_pages_add(1); + + vmemmap_start = (unsigned long)&folio->page; vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); - vmemmap_reuse = vmemmap_start; - vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; /* - * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) - * to the page which @vmemmap_reuse is mapped to. Add pages previously - * mapping the range to vmemmap_pages list so that they can be freed by - * the caller. + * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end). + * Add pages previously mapping the range to vmemmap_pages list so that + * they can be freed by the caller. */ - ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, + ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, + vmemmap_head, vmemmap_tail, vmemmap_pages, flags); - if (ret) { - static_branch_dec(&hugetlb_optimize_vmemmap_key); +out: + if (ret) folio_clear_hugetlb_vmemmap_optimized(folio); - } return ret; } @@ -627,27 +594,25 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio) { LIST_HEAD(vmemmap_pages); - __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0); free_vmemmap_page_list(&vmemmap_pages); } static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *folio) { - unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end; - unsigned long vmemmap_reuse; + unsigned long vmemmap_start, vmemmap_end; if (!vmemmap_should_optimize_folio(h, folio)) return 0; + vmemmap_start = (unsigned long)&folio->page; vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); - vmemmap_reuse = vmemmap_start; - vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; /* * Split PMDs on the vmemmap virtual address range [@vmemmap_start, * @vmemmap_end] */ - return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse); + return vmemmap_remap_split(vmemmap_start, vmemmap_end); } static void __hugetlb_vmemmap_optimize_folios(struct hstate *h, @@ -657,7 +622,7 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h, struct folio *folio; int nr_to_optimize; LIST_HEAD(vmemmap_pages); - unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU; + unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH; nr_to_optimize = 0; list_for_each_entry(folio, folio_list, lru) { @@ -676,7 +641,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h, register_page_bootmem_memmap(pfn_to_section_nr(spfn), &folio->page, HUGETLB_VMEMMAP_RESERVE_SIZE); - static_branch_inc(&hugetlb_optimize_vmemmap_key); continue; } @@ -710,8 +674,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h, int ret; ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags); - /* only need to synchronize_rcu() once for each batch */ - flags &= ~VMEMMAP_SYNCHRONIZE_RCU; /* * Pages to be freed may have been accumulated. If we @@ -790,7 +752,6 @@ void __init hugetlb_vmemmap_init_early(int nid) { unsigned long psize, paddr, section_size; unsigned long ns, i, pnum, pfn, nr_pages; - unsigned long start, end; struct huge_bootmem_page *m = NULL; void *map; @@ -808,14 +769,6 @@ void __init hugetlb_vmemmap_init_early(int nid) paddr = virt_to_phys(m); pfn = PHYS_PFN(paddr); map = pfn_to_page(pfn); - start = (unsigned long)map; - end = start + nr_pages * sizeof(struct page); - - if (vmemmap_populate_hvo(start, end, nid, - HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) - continue; - - memmap_boot_pages_add(HUGETLB_VMEMMAP_RESERVE_SIZE / PAGE_SIZE); pnum = pfn_to_section_nr(pfn); ns = psize / section_size; @@ -831,11 +784,26 @@ void __init hugetlb_vmemmap_init_early(int nid) } } +static struct zone *pfn_to_zone(unsigned nid, unsigned long pfn) +{ + struct zone *zone; + enum zone_type zone_type; + + for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { + zone = &NODE_DATA(nid)->node_zones[zone_type]; + if (zone_spans_pfn(zone, pfn)) + return zone; + } + + return NULL; +} + void __init hugetlb_vmemmap_init_late(int nid) { struct huge_bootmem_page *m, *tm; unsigned long phys, nr_pages, start, end; unsigned long pfn, nr_mmap; + struct zone *zone = NULL; struct hstate *h; void *map; @@ -850,28 +818,41 @@ void __init hugetlb_vmemmap_init_late(int nid) h = m->hstate; pfn = PHYS_PFN(phys); nr_pages = pages_per_huge_page(h); + map = pfn_to_page(pfn); + start = (unsigned long)map; + end = start + nr_pages * sizeof(struct page); if (!hugetlb_bootmem_page_zones_valid(nid, m)) { /* * Oops, the hugetlb page spans multiple zones. - * Remove it from the list, and undo HVO. + * Remove it from the list, and populate it normally. */ list_del(&m->list); - map = pfn_to_page(pfn); - - start = (unsigned long)map; - end = start + nr_pages * sizeof(struct page); - - vmemmap_undo_hvo(start, end, nid, - HUGETLB_VMEMMAP_RESERVE_SIZE); - nr_mmap = end - start - HUGETLB_VMEMMAP_RESERVE_SIZE; + vmemmap_populate(start, end, nid, NULL); + nr_mmap = end - start; memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE)); memblock_phys_free(phys, huge_page_size(h)); continue; - } else + } + + if (!zone || !zone_spans_pfn(zone, pfn)) + zone = pfn_to_zone(nid, pfn); + if (WARN_ON_ONCE(!zone)) + continue; + + if (vmemmap_populate_hvo(start, end, huge_page_order(h), zone, + HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) { + /* Fallback if HVO population fails */ + vmemmap_populate(start, end, nid, NULL); + nr_mmap = end - start; + } else { m->flags |= HUGE_BOOTMEM_ZONES_VALID; + nr_mmap = HUGETLB_VMEMMAP_RESERVE_SIZE; + } + + memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE)); } } #endif @@ -889,10 +870,27 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = { static int __init hugetlb_vmemmap_init(void) { const struct hstate *h; + struct zone *zone; /* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */ BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES); + for_each_zone(zone) { + for (int i = 0; i < NR_VMEMMAP_TAILS; i++) { + struct page *tail, *p; + unsigned int order; + + tail = zone->vmemmap_tails[i]; + if (!tail) + continue; + + order = i + VMEMMAP_TAIL_MIN_ORDER; + p = page_to_virt(tail); + for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++) + init_compound_tail(p + j, NULL, order, zone); + } + } + for_each_hstate(h) { if (hugetlb_vmemmap_optimizable(h)) { register_sysctl_init("vm", hugetlb_vmemmap_sysctls); diff --git a/mm/internal.h b/mm/internal.h index cb0af847d7d9..c693646e5b3f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -11,6 +11,7 @@ #include <linux/khugepaged.h> #include <linux/mm.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/pagemap.h> #include <linux/pagewalk.h> #include <linux/rmap.h> @@ -516,14 +517,30 @@ void free_pgtables(struct mmu_gather *tlb, struct unmap_desc *desc); void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); +/** + * sync_with_folio_pmd_zap - sync with concurrent zapping of a folio PMD + * @mm: The mm_struct. + * @pmdp: Pointer to the pmd that was found to be pmd_none(). + * + * When we find a pmd_none() while unmapping a folio without holding the PTL, + * zap_huge_pmd() may have cleared the PMD but not yet modified the folio to + * indicate that it's unmapped. Skipping the PMD without synchronization could + * make folio unmapping code assume that unmapping failed. + * + * Wait for concurrent zapping to complete by grabbing the PTL. + */ +static inline void sync_with_folio_pmd_zap(struct mm_struct *mm, pmd_t *pmdp) +{ + spinlock_t *ptl = pmd_lock(mm, pmdp); + + spin_unlock(ptl); +} + struct zap_details; -void unmap_page_range(struct mmu_gather *tlb, - struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - struct zap_details *details); -void zap_page_range_single_batched(struct mmu_gather *tlb, +void zap_vma_range_batched(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long size, struct zap_details *details); +int zap_vma_for_reaping(struct vm_area_struct *vma); int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, gfp_t gfp); @@ -624,6 +641,11 @@ int user_proactive_reclaim(char *buf, pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); /* + * in mm/khugepaged.c + */ +void set_recommended_min_free_kbytes(void); + +/* * in mm/page_alloc.c */ #define K(x) ((x) << (PAGE_SHIFT-10)) @@ -878,13 +900,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order) INIT_LIST_HEAD(&folio->_deferred_list); } -static inline void prep_compound_tail(struct page *head, int tail_idx) +static inline void prep_compound_tail(struct page *tail, + const struct page *head, unsigned int order) { - struct page *p = head + tail_idx; + tail->mapping = TAIL_MAPPING; + set_compound_head(tail, head, order); + set_page_private(tail, 0); +} - p->mapping = TAIL_MAPPING; - set_compound_head(p, head); - set_page_private(p, 0); +static inline void init_compound_tail(struct page *tail, + const struct page *head, unsigned int order, struct zone *zone) +{ + atomic_set(&tail->_mapcount, -1); + set_page_node(tail, zone_to_nid(zone)); + set_page_zone(tail, zone_idx(zone)); + prep_compound_tail(tail, head, order); } void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags); @@ -929,12 +959,59 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long, unsigned long, enum meminit_context, struct vmem_altmap *, int, bool); +/* + * mm/sparse.c + */ #ifdef CONFIG_SPARSEMEM void sparse_init(void); +int sparse_index_init(unsigned long section_nr, int nid); + +static inline void sparse_init_one_section(struct mem_section *ms, + unsigned long pnum, struct page *mem_map, + struct mem_section_usage *usage, unsigned long flags) +{ + unsigned long coded_mem_map; + + BUILD_BUG_ON(SECTION_MAP_LAST_BIT > PFN_SECTION_SHIFT); + + /* + * We encode the start PFN of the section into the mem_map such that + * page_to_pfn() on !CONFIG_SPARSEMEM_VMEMMAP can simply subtract it + * from the page pointer to obtain the PFN. + */ + coded_mem_map = (unsigned long)(mem_map - section_nr_to_pfn(pnum)); + VM_WARN_ON_ONCE(coded_mem_map & ~SECTION_MAP_MASK); + + ms->section_mem_map &= ~SECTION_MAP_MASK; + ms->section_mem_map |= coded_mem_map; + ms->section_mem_map |= flags | SECTION_HAS_MEM_MAP; + ms->usage = usage; +} + +static inline void __section_mark_present(struct mem_section *ms, + unsigned long section_nr) +{ + if (section_nr > __highest_present_section_nr) + __highest_present_section_nr = section_nr; + + ms->section_mem_map |= SECTION_MARKED_PRESENT; +} #else static inline void sparse_init(void) {} #endif /* CONFIG_SPARSEMEM */ +/* + * mm/sparse-vmemmap.c + */ +#ifdef CONFIG_SPARSEMEM_VMEMMAP +void sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages); +#else +static inline void sparse_init_subsection_map(unsigned long pfn, + unsigned long nr_pages) +{ +} +#endif /* CONFIG_SPARSEMEM_VMEMMAP */ + #if defined CONFIG_COMPACTION || defined CONFIG_CMA /* @@ -1218,6 +1295,18 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, } return fpin; } + +static inline bool vma_supports_mlock(const struct vm_area_struct *vma) +{ + if (vma_test_any_mask(vma, VMA_SPECIAL_FLAGS)) + return false; + if (vma_test_single_mask(vma, VMA_DROPPABLE)) + return false; + if (vma_is_dax(vma) || is_vm_hugetlb_page(vma)) + return false; + return vma != get_gate_vma(current->mm); +} + #else /* !CONFIG_MMU */ static inline void unmap_mapping_folio(struct folio *folio) { } static inline void mlock_new_folio(struct folio *folio) { } @@ -1450,6 +1539,8 @@ int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end, } #endif +void clear_vm_uninitialized_flag(struct vm_struct *vm); + int __must_check __vmap_pages_range_noflush(unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages, unsigned int page_shift); @@ -1748,26 +1839,108 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm); -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn); -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, unsigned long size, pgprot_t pgprot); +int remap_pfn_range_prepare(struct vm_area_desc *desc); +int remap_pfn_range_complete(struct vm_area_struct *vma, + struct mmap_action *action); +int simple_ioremap_prepare(struct vm_area_desc *desc); -static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, - unsigned long orig_pfn, unsigned long size) +static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc) { + struct mmap_action *action = &desc->action; + const unsigned long orig_pfn = action->remap.start_pfn; + const pgprot_t orig_pgprot = action->remap.pgprot; + const unsigned long size = action->remap.size; const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size); + int err; + + action->remap.start_pfn = pfn; + action->remap.pgprot = pgprot_decrypted(orig_pgprot); + err = remap_pfn_range_prepare(desc); + if (err) + return err; + + /* Remap does the actual work. */ + action->type = MMAP_REMAP_PFN; + return 0; +} - return remap_pfn_range_prepare(desc, pfn); +/* + * When we succeed an mmap action or just before we unmap a VMA on error, we + * need to ensure any rmap lock held is released. On unmap it's required to + * avoid a deadlock. + */ +static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma, + struct mmap_action *action) +{ + struct file *file; + + if (!action->hide_from_rmap_until_complete) + return; + + VM_WARN_ON_ONCE(vma_is_anonymous(vma)); + file = vma->vm_file; + i_mmap_unlock_write(file->f_mapping); + action->hide_from_rmap_until_complete = false; } -static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma, - unsigned long addr, unsigned long orig_pfn, unsigned long size, - pgprot_t orig_prot) +#ifdef CONFIG_MMU_NOTIFIER +static inline bool clear_flush_young_ptes_notify(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, unsigned int nr) { - const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size); - const pgprot_t prot = pgprot_decrypted(orig_prot); + bool young; - return remap_pfn_range_complete(vma, addr, pfn, size, prot); + young = clear_flush_young_ptes(vma, addr, ptep, nr); + young |= mmu_notifier_clear_flush_young(vma->vm_mm, addr, + addr + nr * PAGE_SIZE); + return young; } +static inline bool pmdp_clear_flush_young_notify(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ + bool young; + + young = pmdp_clear_flush_young(vma, addr, pmdp); + young |= mmu_notifier_clear_flush_young(vma->vm_mm, addr, addr + PMD_SIZE); + return young; +} + +static inline bool test_and_clear_young_ptes_notify(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, unsigned int nr) +{ + bool young; + + young = test_and_clear_young_ptes(vma, addr, ptep, nr); + young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + nr * PAGE_SIZE); + return young; +} + +static inline bool pmdp_test_and_clear_young_notify(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ + bool young; + + young = pmdp_test_and_clear_young(vma, addr, pmdp); + young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE); + return young; +} + +#else /* CONFIG_MMU_NOTIFIER */ + +#define clear_flush_young_ptes_notify clear_flush_young_ptes +#define pmdp_clear_flush_young_notify pmdp_clear_flush_young +#define test_and_clear_young_ptes_notify test_and_clear_young_ptes +#define pmdp_test_and_clear_young_notify pmdp_test_and_clear_young + +#endif /* CONFIG_MMU_NOTIFIER */ + +extern int sysctl_max_map_count; +static inline int get_sysctl_max_map_count(void) +{ + return READ_ONCE(sysctl_max_map_count); +} + +bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags, + unsigned long npages); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/interval_tree.c b/mm/interval_tree.c index 32e390c42c53..32bcfbfcf15f 100644 --- a/mm/interval_tree.c +++ b/mm/interval_tree.c @@ -15,11 +15,6 @@ static inline unsigned long vma_start_pgoff(struct vm_area_struct *v) return v->vm_pgoff; } -static inline unsigned long vma_last_pgoff(struct vm_area_struct *v) -{ - return v->vm_pgoff + vma_pages(v) - 1; -} - INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.rb, unsigned long, shared.rb_subtree_last, vma_start_pgoff, vma_last_pgoff, /* empty */, vma_interval_tree) diff --git a/mm/kasan/init.c b/mm/kasan/init.c index f084e7a5df1e..9c880f607c6a 100644 --- a/mm/kasan/init.c +++ b/mm/kasan/init.c @@ -292,7 +292,7 @@ static void kasan_free_pte(pte_t *pte_start, pmd_t *pmd) return; } - pte_free_kernel(&init_mm, (pte_t *)page_to_virt(pmd_page(*pmd))); + pte_free_kernel(&init_mm, pte_start); pmd_clear(pmd); } @@ -307,7 +307,7 @@ static void kasan_free_pmd(pmd_t *pmd_start, pud_t *pud) return; } - pmd_free(&init_mm, (pmd_t *)page_to_virt(pud_page(*pud))); + pmd_free(&init_mm, pmd_start); pud_clear(pud); } @@ -322,7 +322,7 @@ static void kasan_free_pud(pud_t *pud_start, p4d_t *p4d) return; } - pud_free(&init_mm, (pud_t *)page_to_virt(p4d_page(*p4d))); + pud_free(&init_mm, pud_start); p4d_clear(p4d); } @@ -337,7 +337,7 @@ static void kasan_free_p4d(p4d_t *p4d_start, pgd_t *pgd) return; } - p4d_free(&init_mm, (p4d_t *)page_to_virt(pgd_page(*pgd))); + p4d_free(&init_mm, p4d_start); pgd_clear(pgd); } diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 27efb78eb32d..e804b1e1f886 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -638,7 +638,7 @@ void kasan_report_async(void) */ void kasan_non_canonical_hook(unsigned long addr) { - unsigned long orig_addr; + unsigned long orig_addr, user_orig_addr; const char *bug_type; /* @@ -650,6 +650,9 @@ void kasan_non_canonical_hook(unsigned long addr) orig_addr = (unsigned long)kasan_shadow_to_mem((void *)addr); + /* Strip pointer tag before comparing against userspace ranges */ + user_orig_addr = (unsigned long)set_tag((void *)orig_addr, 0); + /* * For faults near the shadow address for NULL, we can be fairly certain * that this is a KASAN shadow memory access. @@ -661,11 +664,13 @@ void kasan_non_canonical_hook(unsigned long addr) * address, but make it clear that this is not necessarily what's * actually going on. */ - if (orig_addr < PAGE_SIZE) + if (user_orig_addr < PAGE_SIZE) { bug_type = "null-ptr-deref"; - else if (orig_addr < TASK_SIZE) + orig_addr = user_orig_addr; + } else if (user_orig_addr < TASK_SIZE) { bug_type = "probably user-memory-access"; - else if (addr_in_shadow((void *)addr)) + orig_addr = user_orig_addr; + } else if (addr_in_shadow((void *)addr)) bug_type = "probably wild-memory-access"; else bug_type = "maybe wild-memory-access"; diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 7393957f9a20..9eba46212edf 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -51,7 +51,7 @@ /* === Data ================================================================= */ -static bool kfence_enabled __read_mostly; +bool kfence_enabled __read_mostly; static bool disabled_by_warn __read_mostly; unsigned long kfence_sample_interval __read_mostly = CONFIG_KFENCE_SAMPLE_INTERVAL; @@ -336,6 +336,7 @@ out: static check_canary_attributes bool check_canary_byte(u8 *addr) { struct kfence_metadata *meta; + enum kfence_fault fault; unsigned long flags; if (likely(*addr == KFENCE_CANARY_PATTERN_U8(addr))) @@ -345,8 +346,9 @@ static check_canary_attributes bool check_canary_byte(u8 *addr) meta = addr_to_metadata((unsigned long)addr); raw_spin_lock_irqsave(&meta->lock, flags); - kfence_report_error((unsigned long)addr, false, NULL, meta, KFENCE_ERROR_CORRUPTION); + fault = kfence_report_error((unsigned long)addr, false, NULL, meta, KFENCE_ERROR_CORRUPTION); raw_spin_unlock_irqrestore(&meta->lock, flags); + kfence_handle_fault(fault); return false; } @@ -525,11 +527,14 @@ static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool z raw_spin_lock_irqsave(&meta->lock, flags); if (!kfence_obj_allocated(meta) || meta->addr != (unsigned long)addr) { + enum kfence_fault fault; + /* Invalid or double-free, bail out. */ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]); - kfence_report_error((unsigned long)addr, false, NULL, meta, - KFENCE_ERROR_INVALID_FREE); + fault = kfence_report_error((unsigned long)addr, false, NULL, meta, + KFENCE_ERROR_INVALID_FREE); raw_spin_unlock_irqrestore(&meta->lock, flags); + kfence_handle_fault(fault); return; } @@ -831,7 +836,8 @@ static void kfence_check_all_canary(void) static int kfence_check_canary_callback(struct notifier_block *nb, unsigned long reason, void *arg) { - kfence_check_all_canary(); + if (READ_ONCE(kfence_enabled)) + kfence_check_all_canary(); return NOTIFY_OK; } @@ -1266,6 +1272,7 @@ bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs struct kfence_metadata *to_report = NULL; unsigned long unprotected_page = 0; enum kfence_error_type error_type; + enum kfence_fault fault; unsigned long flags; if (!is_kfence_address((void *)addr)) @@ -1324,12 +1331,14 @@ out: if (to_report) { raw_spin_lock_irqsave(&to_report->lock, flags); to_report->unprotected_page = unprotected_page; - kfence_report_error(addr, is_write, regs, to_report, error_type); + fault = kfence_report_error(addr, is_write, regs, to_report, error_type); raw_spin_unlock_irqrestore(&to_report->lock, flags); } else { /* This may be a UAF or OOB access, but we can't be sure. */ - kfence_report_error(addr, is_write, regs, NULL, KFENCE_ERROR_INVALID); + fault = kfence_report_error(addr, is_write, regs, NULL, KFENCE_ERROR_INVALID); } + kfence_handle_fault(fault); + return kfence_unprotect(addr); /* Unprotect and let access proceed. */ } diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h index f9caea007246..1f618f9b0d12 100644 --- a/mm/kfence/kfence.h +++ b/mm/kfence/kfence.h @@ -16,6 +16,8 @@ #include "../slab.h" /* for struct kmem_cache */ +extern bool kfence_enabled; + /* * Get the canary byte pattern for @addr. Use a pattern that varies based on the * lower 3 bits of the address, to detect memory corruptions with higher @@ -140,8 +142,18 @@ enum kfence_error_type { KFENCE_ERROR_INVALID_FREE, /* Invalid free. */ }; -void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs, - const struct kfence_metadata *meta, enum kfence_error_type type); +enum kfence_fault { + KFENCE_FAULT_NONE, + KFENCE_FAULT_REPORT, + KFENCE_FAULT_OOPS, + KFENCE_FAULT_PANIC, +}; + +enum kfence_fault +kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs, + const struct kfence_metadata *meta, enum kfence_error_type type); + +void kfence_handle_fault(enum kfence_fault fault); void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta) __must_hold(&meta->lock); diff --git a/mm/kfence/report.c b/mm/kfence/report.c index 787e87c26926..d548536864b1 100644 --- a/mm/kfence/report.c +++ b/mm/kfence/report.c @@ -7,9 +7,12 @@ #include <linux/stdarg.h> +#include <linux/bug.h> +#include <linux/init.h> #include <linux/kernel.h> #include <linux/lockdep.h> #include <linux/math.h> +#include <linux/panic.h> #include <linux/printk.h> #include <linux/sched/debug.h> #include <linux/seq_file.h> @@ -29,6 +32,26 @@ #define ARCH_FUNC_PREFIX "" #endif +static enum kfence_fault kfence_fault __ro_after_init = KFENCE_FAULT_REPORT; + +static int __init early_kfence_fault(char *arg) +{ + if (!arg) + return -EINVAL; + + if (!strcmp(arg, "report")) + kfence_fault = KFENCE_FAULT_REPORT; + else if (!strcmp(arg, "oops")) + kfence_fault = KFENCE_FAULT_OOPS; + else if (!strcmp(arg, "panic")) + kfence_fault = KFENCE_FAULT_PANIC; + else + return -EINVAL; + + return 0; +} +early_param("kfence.fault", early_kfence_fault); + /* Helper function to either print to a seq_file or to console. */ __printf(2, 3) static void seq_con_printf(struct seq_file *seq, const char *fmt, ...) @@ -189,8 +212,9 @@ static const char *get_access_type(bool is_write) return str_write_read(is_write); } -void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs, - const struct kfence_metadata *meta, enum kfence_error_type type) +enum kfence_fault +kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs, + const struct kfence_metadata *meta, enum kfence_error_type type) { unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 }; const ptrdiff_t object_index = meta ? meta - kfence_metadata : -1; @@ -206,7 +230,7 @@ void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *r /* Require non-NULL meta, except if KFENCE_ERROR_INVALID. */ if (WARN_ON(type != KFENCE_ERROR_INVALID && !meta)) - return; + return KFENCE_FAULT_NONE; /* * Because we may generate reports in printk-unfriendly parts of the @@ -282,6 +306,25 @@ void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *r /* We encountered a memory safety error, taint the kernel! */ add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK); + + return kfence_fault; +} + +void kfence_handle_fault(enum kfence_fault fault) +{ + switch (fault) { + case KFENCE_FAULT_NONE: + case KFENCE_FAULT_REPORT: + break; + case KFENCE_FAULT_OOPS: + BUG(); + break; + case KFENCE_FAULT_PANIC: + /* Disable KFENCE to avoid recursion if check_on_panic is set. */ + WRITE_ONCE(kfence_enabled, false); + panic("kfence.fault=panic set ...\n"); + break; + } } #ifdef CONFIG_PRINTK diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 1dd3cfca610d..b8452dbdb043 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -46,6 +46,7 @@ enum scan_result { SCAN_PAGE_LRU, SCAN_PAGE_LOCK, SCAN_PAGE_ANON, + SCAN_PAGE_LAZYFREE, SCAN_PAGE_COMPOUND, SCAN_ANY_PROCESS, SCAN_VMA_NULL, @@ -68,7 +69,10 @@ enum scan_result { static struct task_struct *khugepaged_thread __read_mostly; static DEFINE_MUTEX(khugepaged_mutex); -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */ +/* + * default scan 8*HPAGE_PMD_NR ptes, pte_mapped_hugepage, pmd_mapped, + * no_pte_table or vmas every 10 second. + */ static unsigned int khugepaged_pages_to_scan __read_mostly; static unsigned int khugepaged_pages_collapsed; static unsigned int khugepaged_full_scans; @@ -85,6 +89,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); * * Note that these are only respected if collapse was initiated by khugepaged. */ +#define KHUGEPAGED_MAX_PTES_LIMIT (HPAGE_PMD_NR - 1) unsigned int khugepaged_max_ptes_none __read_mostly; static unsigned int khugepaged_max_ptes_swap __read_mostly; static unsigned int khugepaged_max_ptes_shared __read_mostly; @@ -100,6 +105,9 @@ struct collapse_control { /* Num pages scanned per node */ u32 node_load[MAX_NUMNODES]; + /* Num pages scanned (see khugepaged_pages_to_scan) */ + unsigned int progress; + /* nodemask for allocation fallback */ nodemask_t alloc_nmask; }; @@ -252,7 +260,7 @@ static ssize_t max_ptes_none_store(struct kobject *kobj, unsigned long max_ptes_none; err = kstrtoul(buf, 10, &max_ptes_none); - if (err || max_ptes_none > HPAGE_PMD_NR - 1) + if (err || max_ptes_none > KHUGEPAGED_MAX_PTES_LIMIT) return -EINVAL; khugepaged_max_ptes_none = max_ptes_none; @@ -277,7 +285,7 @@ static ssize_t max_ptes_swap_store(struct kobject *kobj, unsigned long max_ptes_swap; err = kstrtoul(buf, 10, &max_ptes_swap); - if (err || max_ptes_swap > HPAGE_PMD_NR - 1) + if (err || max_ptes_swap > KHUGEPAGED_MAX_PTES_LIMIT) return -EINVAL; khugepaged_max_ptes_swap = max_ptes_swap; @@ -303,7 +311,7 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj, unsigned long max_ptes_shared; err = kstrtoul(buf, 10, &max_ptes_shared); - if (err || max_ptes_shared > HPAGE_PMD_NR - 1) + if (err || max_ptes_shared > KHUGEPAGED_MAX_PTES_LIMIT) return -EINVAL; khugepaged_max_ptes_shared = max_ptes_shared; @@ -375,7 +383,7 @@ int __init khugepaged_init(void) return -ENOMEM; khugepaged_pages_to_scan = HPAGE_PMD_NR * 8; - khugepaged_max_ptes_none = HPAGE_PMD_NR - 1; + khugepaged_max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT; khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8; khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2; @@ -387,14 +395,14 @@ void __init khugepaged_destroy(void) kmem_cache_destroy(mm_slot_cache); } -static inline int hpage_collapse_test_exit(struct mm_struct *mm) +static inline int collapse_test_exit(struct mm_struct *mm) { return atomic_read(&mm->mm_users) == 0; } -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) +static inline int collapse_test_exit_or_disable(struct mm_struct *mm) { - return hpage_collapse_test_exit(mm) || + return collapse_test_exit(mm) || mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm); } @@ -428,7 +436,7 @@ void __khugepaged_enter(struct mm_struct *mm) int wakeup; /* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm); + VM_BUG_ON_MM(collapse_test_exit(mm), mm); if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm))) return; @@ -482,7 +490,7 @@ void __khugepaged_exit(struct mm_struct *mm) } else if (slot) { /* * This is required to serialize against - * hpage_collapse_test_exit() (which is guaranteed to run + * collapse_test_exit() (which is guaranteed to run * under mmap sem read mode). Stop here (after we return all * pagetables will be destroyed) until khugepaged has finished * working on the pagetables under the mmap_lock. @@ -571,7 +579,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, folio = page_folio(page); VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); - /* See hpage_collapse_scan_pmd(). */ + /* + * If the vma has the VM_DROPPABLE flag, the collapse will + * preserve the lazyfree property without needing to skip. + */ + if (cc->is_khugepaged && !(vma->vm_flags & VM_DROPPABLE) && + folio_test_lazyfree(folio) && !pte_dirty(pteval)) { + result = SCAN_PAGE_LAZYFREE; + goto out; + } + + /* See collapse_scan_pmd(). */ if (folio_maybe_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && @@ -822,7 +840,7 @@ static struct collapse_control khugepaged_collapse_control = { .is_khugepaged = true, }; -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc) +static bool collapse_scan_abort(int nid, struct collapse_control *cc) { int i; @@ -857,7 +875,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void) } #ifdef CONFIG_NUMA -static int hpage_collapse_find_target_node(struct collapse_control *cc) +static int collapse_find_target_node(struct collapse_control *cc) { int nid, target_node = 0, max_value = 0; @@ -876,7 +894,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc) return target_node; } #else -static int hpage_collapse_find_target_node(struct collapse_control *cc) +static int collapse_find_target_node(struct collapse_control *cc) { return 0; } @@ -895,7 +913,7 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE; - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) + if (unlikely(collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; *vmap = vma = find_vma(mm, address); @@ -966,7 +984,7 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm, /* * Bring missing pages in from swap, to complete THP collapse. - * Only done if hpage_collapse_scan_pmd believes it is worthwhile. + * Only done if khugepaged_scan_pmd believes it is worthwhile. * * Called and returns without pte mapped or spinlocks held. * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. @@ -1052,7 +1070,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru { gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); - int node = hpage_collapse_find_target_node(cc); + int node = collapse_find_target_node(cc); struct folio *folio; folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask); @@ -1230,9 +1248,9 @@ out_nolock: return result; } -static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked, - struct collapse_control *cc) +static enum scan_result collapse_scan_pmd(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long start_addr, + bool *lock_dropped, struct collapse_control *cc) { pmd_t *pmd; pte_t *pte, *_pte; @@ -1247,19 +1265,24 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK); result = find_pmd_or_thp_or_none(mm, start_addr, &pmd); - if (result != SCAN_SUCCEED) + if (result != SCAN_SUCCEED) { + cc->progress++; goto out; + } memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl); if (!pte) { + cc->progress++; result = SCAN_NO_PTE_TABLE; goto out; } for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, addr += PAGE_SIZE) { + cc->progress++; + pte_t pteval = ptep_get(_pte); if (pte_none_or_zero(pteval)) { ++none_or_zero; @@ -1314,6 +1337,16 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, } folio = page_folio(page); + /* + * If the vma has the VM_DROPPABLE flag, the collapse will + * preserve the lazyfree property without needing to skip. + */ + if (cc->is_khugepaged && !(vma->vm_flags & VM_DROPPABLE) && + folio_test_lazyfree(folio) && !pte_dirty(pteval)) { + result = SCAN_PAGE_LAZYFREE; + goto out_unmap; + } + if (!folio_test_anon(folio)) { result = SCAN_PAGE_ANON; goto out_unmap; @@ -1340,7 +1373,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, * hit record. */ node = folio_nid(folio); - if (hpage_collapse_scan_abort(node, cc)) { + if (collapse_scan_abort(node, cc)) { result = SCAN_SCAN_ABORT; goto out_unmap; } @@ -1392,7 +1425,7 @@ out_unmap: result = collapse_huge_page(mm, start_addr, referenced, unmapped, cc); /* collapse_huge_page will return with the mmap_lock released */ - *mmap_locked = false; + *lock_dropped = true; } out: trace_mm_khugepaged_scan_pmd(mm, folio, referenced, @@ -1406,7 +1439,7 @@ static void collect_mm_slot(struct mm_slot *slot) lockdep_assert_held(&khugepaged_mm_lock); - if (hpage_collapse_test_exit(mm)) { + if (collapse_test_exit(mm)) { /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); @@ -1508,7 +1541,7 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign if (IS_ERR(folio)) return SCAN_PAGE_NULL; - if (folio_order(folio) != HPAGE_PMD_ORDER) { + if (!is_pmd_order(folio_order(folio))) { result = SCAN_PAGE_COMPOUND; goto drop_folio; } @@ -1761,7 +1794,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) continue; - if (hpage_collapse_test_exit(mm)) + if (collapse_test_exit(mm)) continue; if (!file_backed_vma_is_retractable(vma)) @@ -1991,9 +2024,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, * we locked the first folio, then a THP might be there already. * This will be discovered on the first iteration. */ - if (folio_order(folio) == HPAGE_PMD_ORDER && - folio->index == start) { - /* Maybe PMD-mapped */ + if (is_pmd_order(folio_order(folio))) { result = SCAN_PTE_MAPPED_HUGEPAGE; goto out_unlock; } @@ -2279,8 +2310,9 @@ out: return result; } -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, - struct file *file, pgoff_t start, struct collapse_control *cc) +static enum scan_result collapse_scan_file(struct mm_struct *mm, + unsigned long addr, struct file *file, pgoff_t start, + struct collapse_control *cc) { struct folio *folio = NULL; struct address_space *mapping = file->f_mapping; @@ -2320,22 +2352,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned continue; } - if (folio_order(folio) == HPAGE_PMD_ORDER && - folio->index == start) { - /* Maybe PMD-mapped */ + if (is_pmd_order(folio_order(folio))) { result = SCAN_PTE_MAPPED_HUGEPAGE; /* - * For SCAN_PTE_MAPPED_HUGEPAGE, further processing - * by the caller won't touch the page cache, and so - * it's safe to skip LRU and refcount checks before - * returning. + * PMD-sized THP implies that we can only try + * retracting the PTE table. */ folio_put(folio); break; } node = folio_nid(folio); - if (hpage_collapse_scan_abort(node, cc)) { + if (collapse_scan_abort(node, cc)) { result = SCAN_SCAN_ABORT; folio_put(folio); break; @@ -2370,6 +2398,10 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned } } rcu_read_unlock(); + if (result == SCAN_PTE_MAPPED_HUGEPAGE) + cc->progress++; + else + cc->progress += HPAGE_PMD_NR; if (result == SCAN_SUCCEED) { if (cc->is_khugepaged && @@ -2385,8 +2417,69 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned return result; } -static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result *result, - struct collapse_control *cc) +/* + * Try to collapse a single PMD starting at a PMD aligned addr, and return + * the results. + */ +static enum scan_result collapse_single_pmd(unsigned long addr, + struct vm_area_struct *vma, bool *lock_dropped, + struct collapse_control *cc) +{ + struct mm_struct *mm = vma->vm_mm; + bool triggered_wb = false; + enum scan_result result; + struct file *file; + pgoff_t pgoff; + + mmap_assert_locked(mm); + + if (vma_is_anonymous(vma)) { + result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc); + goto end; + } + + file = get_file(vma->vm_file); + pgoff = linear_page_index(vma, addr); + + mmap_read_unlock(mm); + *lock_dropped = true; +retry: + result = collapse_scan_file(mm, addr, file, pgoff, cc); + + /* + * For MADV_COLLAPSE, when encountering dirty pages, try to writeback, + * then retry the collapse one time. + */ + if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK && + !triggered_wb && mapping_can_writeback(file->f_mapping)) { + const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; + const loff_t lend = lstart + HPAGE_PMD_SIZE - 1; + + filemap_write_and_wait_range(file->f_mapping, lstart, lend); + triggered_wb = true; + goto retry; + } + fput(file); + + if (result == SCAN_PTE_MAPPED_HUGEPAGE) { + mmap_read_lock(mm); + if (collapse_test_exit_or_disable(mm)) + result = SCAN_ANY_PROCESS; + else + result = try_collapse_pte_mapped_thp(mm, addr, + !cc->is_khugepaged); + if (result == SCAN_PMD_MAPPED) + result = SCAN_SUCCEED; + mmap_read_unlock(mm); + } +end: + if (cc->is_khugepaged && result == SCAN_SUCCEED) + ++khugepaged_pages_collapsed; + return result; +} + +static void collapse_scan_mm_slot(unsigned int progress_max, + enum scan_result *result, struct collapse_control *cc) __releases(&khugepaged_mm_lock) __acquires(&khugepaged_mm_lock) { @@ -2394,9 +2487,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result struct mm_slot *slot; struct mm_struct *mm; struct vm_area_struct *vma; - int progress = 0; + unsigned int progress_prev = cc->progress; - VM_BUG_ON(!pages); lockdep_assert_held(&khugepaged_mm_lock); *result = SCAN_FAIL; @@ -2419,8 +2511,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result if (unlikely(!mmap_read_trylock(mm))) goto breakouterloop_mmap_lock; - progress++; - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) + cc->progress++; + if (unlikely(collapse_test_exit_or_disable(mm))) goto breakouterloop; vma_iter_init(&vmi, mm, khugepaged_scan.address); @@ -2428,18 +2520,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result unsigned long hstart, hend; cond_resched(); - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) { - progress++; + if (unlikely(collapse_test_exit_or_disable(mm))) { + cc->progress++; break; } if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) { - progress++; + cc->progress++; continue; } hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE); hend = round_down(vma->vm_end, HPAGE_PMD_SIZE); if (khugepaged_scan.address > hend) { - progress++; + cc->progress++; continue; } if (khugepaged_scan.address < hstart) @@ -2447,47 +2539,21 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); while (khugepaged_scan.address < hend) { - bool mmap_locked = true; + bool lock_dropped = false; cond_resched(); - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) + if (unlikely(collapse_test_exit_or_disable(mm))) goto breakouterloop; - VM_BUG_ON(khugepaged_scan.address < hstart || + VM_WARN_ON_ONCE(khugepaged_scan.address < hstart || khugepaged_scan.address + HPAGE_PMD_SIZE > hend); - if (!vma_is_anonymous(vma)) { - struct file *file = get_file(vma->vm_file); - pgoff_t pgoff = linear_page_index(vma, - khugepaged_scan.address); - - mmap_read_unlock(mm); - mmap_locked = false; - *result = hpage_collapse_scan_file(mm, - khugepaged_scan.address, file, pgoff, cc); - fput(file); - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { - mmap_read_lock(mm); - if (hpage_collapse_test_exit_or_disable(mm)) - goto breakouterloop; - *result = try_collapse_pte_mapped_thp(mm, - khugepaged_scan.address, false); - if (*result == SCAN_PMD_MAPPED) - *result = SCAN_SUCCEED; - mmap_read_unlock(mm); - } - } else { - *result = hpage_collapse_scan_pmd(mm, vma, - khugepaged_scan.address, &mmap_locked, cc); - } - - if (*result == SCAN_SUCCEED) - ++khugepaged_pages_collapsed; + *result = collapse_single_pmd(khugepaged_scan.address, + vma, &lock_dropped, cc); /* move to next address */ khugepaged_scan.address += HPAGE_PMD_SIZE; - progress += HPAGE_PMD_NR; - if (!mmap_locked) + if (lock_dropped) /* * We released mmap_lock so break loop. Note * that we drop mmap_lock before all hugepage @@ -2496,7 +2562,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result * correct result back to caller. */ goto breakouterloop_mmap_lock; - if (progress >= pages) + if (cc->progress >= progress_max) goto breakouterloop; } } @@ -2508,9 +2574,9 @@ breakouterloop_mmap_lock: VM_BUG_ON(khugepaged_scan.mm_slot != slot); /* * Release the current mm_slot if this mm is about to die, or - * if we scanned all vmas of this mm. + * if we scanned all vmas of this mm, or THP got disabled. */ - if (hpage_collapse_test_exit(mm) || !vma) { + if (collapse_test_exit_or_disable(mm) || !vma) { /* * Make sure that if mm_users is reaching zero while * khugepaged runs here, khugepaged_exit will find @@ -2527,7 +2593,8 @@ breakouterloop_mmap_lock: collect_mm_slot(slot); } - return progress; + trace_mm_khugepaged_scan(mm, cc->progress - progress_prev, + khugepaged_scan.mm_slot == NULL); } static int khugepaged_has_work(void) @@ -2543,13 +2610,14 @@ static int khugepaged_wait_event(void) static void khugepaged_do_scan(struct collapse_control *cc) { - unsigned int progress = 0, pass_through_head = 0; - unsigned int pages = READ_ONCE(khugepaged_pages_to_scan); + const unsigned int progress_max = READ_ONCE(khugepaged_pages_to_scan); + unsigned int pass_through_head = 0; bool wait = true; enum scan_result result = SCAN_SUCCEED; lru_add_drain_all(); + cc->progress = 0; while (true) { cond_resched(); @@ -2561,13 +2629,12 @@ static void khugepaged_do_scan(struct collapse_control *cc) pass_through_head++; if (khugepaged_has_work() && pass_through_head < 2) - progress += khugepaged_scan_mm_slot(pages - progress, - &result, cc); + collapse_scan_mm_slot(progress_max, &result, cc); else - progress = pages; + cc->progress = progress_max; spin_unlock(&khugepaged_mm_lock); - if (progress >= pages) + if (cc->progress >= progress_max) break; if (result == SCAN_ALLOC_HUGE_PAGE_FAIL) { @@ -2630,7 +2697,7 @@ static int khugepaged(void *none) return 0; } -static void set_recommended_min_free_kbytes(void) +void set_recommended_min_free_kbytes(void) { struct zone *zone; int nr_zones = 0; @@ -2671,8 +2738,8 @@ static void set_recommended_min_free_kbytes(void) if (recommended_min > min_free_kbytes) { if (user_min_free_kbytes >= 0) - pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n", - min_free_kbytes, recommended_min); + pr_info_ratelimited("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n", + min_free_kbytes, recommended_min); min_free_kbytes = recommended_min; } @@ -2761,7 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, unsigned long hstart, hend, addr; enum scan_result last_fail = SCAN_FAIL; int thps = 0; - bool mmap_locked = true; + bool mmap_unlocked = false; BUG_ON(vma->vm_start > start); BUG_ON(vma->vm_end < end); @@ -2773,6 +2840,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, if (!cc) return -ENOMEM; cc->is_khugepaged = false; + cc->progress = 0; mmgrab(mm); lru_add_drain_all(); @@ -2782,13 +2850,12 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { enum scan_result result = SCAN_FAIL; - bool triggered_wb = false; -retry: - if (!mmap_locked) { + if (mmap_unlocked) { cond_resched(); mmap_read_lock(mm); - mmap_locked = true; + mmap_unlocked = false; + *lock_dropped = true; result = hugepage_vma_revalidate(mm, addr, false, &vma, cc); if (result != SCAN_SUCCEED) { @@ -2798,47 +2865,14 @@ retry: hend = min(hend, vma->vm_end & HPAGE_PMD_MASK); } - mmap_assert_locked(mm); - if (!vma_is_anonymous(vma)) { - struct file *file = get_file(vma->vm_file); - pgoff_t pgoff = linear_page_index(vma, addr); - mmap_read_unlock(mm); - mmap_locked = false; - *lock_dropped = true; - result = hpage_collapse_scan_file(mm, addr, file, pgoff, - cc); - - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb && - mapping_can_writeback(file->f_mapping)) { - loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; - loff_t lend = lstart + HPAGE_PMD_SIZE - 1; - - filemap_write_and_wait_range(file->f_mapping, lstart, lend); - triggered_wb = true; - fput(file); - goto retry; - } - fput(file); - } else { - result = hpage_collapse_scan_pmd(mm, vma, addr, - &mmap_locked, cc); - } - if (!mmap_locked) - *lock_dropped = true; + result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc); -handle_result: switch (result) { case SCAN_SUCCEED: case SCAN_PMD_MAPPED: ++thps; break; - case SCAN_PTE_MAPPED_HUGEPAGE: - BUG_ON(mmap_locked); - mmap_read_lock(mm); - result = try_collapse_pte_mapped_thp(mm, addr, true); - mmap_read_unlock(mm); - goto handle_result; /* Whitelisted set of results where continuing OK */ case SCAN_NO_PTE_TABLE: case SCAN_PTE_NON_PRESENT: @@ -2861,8 +2895,10 @@ handle_result: out_maybelock: /* Caller expects us to hold mmap_lock on return */ - if (!mmap_locked) + if (mmap_unlocked) { + *lock_dropped = true; mmap_read_lock(mm); + } out_nolock: mmap_assert_locked(mm); mmdrop(mm); diff --git a/mm/kmemleak.c b/mm/kmemleak.c index d79acf5c5100..fa8201e23222 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1505,12 +1505,10 @@ static int scan_should_stop(void) * This function may be called from either process or kthread context, * hence the need to check for both stop conditions. */ - if (current->mm) - return signal_pending(current); - else + if (current->flags & PF_KTHREAD) return kthread_should_stop(); - return 0; + return signal_pending(current); } /* @@ -735,21 +735,24 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, return (ret & VM_FAULT_OOM) ? -ENOMEM : 0; } -static bool ksm_compatible(const struct file *file, vm_flags_t vm_flags) +static bool ksm_compatible(const struct file *file, vma_flags_t vma_flags) { - if (vm_flags & (VM_SHARED | VM_MAYSHARE | VM_SPECIAL | - VM_HUGETLB | VM_DROPPABLE)) - return false; /* just ignore the advice */ - + /* Just ignore the advice. */ + if (vma_flags_test_any(&vma_flags, VMA_SHARED_BIT, VMA_MAYSHARE_BIT, + VMA_HUGETLB_BIT)) + return false; + if (vma_flags_test_single_mask(&vma_flags, VMA_DROPPABLE)) + return false; + if (vma_flags_test_any_mask(&vma_flags, VMA_SPECIAL_FLAGS)) + return false; if (file_is_dax(file)) return false; - #ifdef VM_SAO - if (vm_flags & VM_SAO) + if (vma_flags_test(&vma_flags, VMA_SAO_BIT)) return false; #endif #ifdef VM_SPARC_ADI - if (vm_flags & VM_SPARC_ADI) + if (vma_flags_test(&vma_flags, VMA_SPARC_ADI_BIT)) return false; #endif @@ -758,7 +761,7 @@ static bool ksm_compatible(const struct file *file, vm_flags_t vm_flags) static bool vma_ksm_compatible(struct vm_area_struct *vma) { - return ksm_compatible(vma->vm_file, vma->vm_flags); + return ksm_compatible(vma->vm_file, vma->flags); } static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, @@ -2825,17 +2828,17 @@ static int ksm_scan_thread(void *nothing) return 0; } -static bool __ksm_should_add_vma(const struct file *file, vm_flags_t vm_flags) +static bool __ksm_should_add_vma(const struct file *file, vma_flags_t vma_flags) { - if (vm_flags & VM_MERGEABLE) + if (vma_flags_test(&vma_flags, VMA_MERGEABLE_BIT)) return false; - return ksm_compatible(file, vm_flags); + return ksm_compatible(file, vma_flags); } static void __ksm_add_vma(struct vm_area_struct *vma) { - if (__ksm_should_add_vma(vma->vm_file, vma->vm_flags)) + if (__ksm_should_add_vma(vma->vm_file, vma->flags)) vm_flags_set(vma, VM_MERGEABLE); } @@ -2860,16 +2863,16 @@ static int __ksm_del_vma(struct vm_area_struct *vma) * * @mm: Proposed VMA's mm_struct * @file: Proposed VMA's file-backed mapping, if any. - * @vm_flags: Proposed VMA"s flags. + * @vma_flags: Proposed VMA"s flags. * - * Returns: @vm_flags possibly updated to mark mergeable. + * Returns: @vma_flags possibly updated to mark mergeable. */ -vm_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file, - vm_flags_t vm_flags) +vma_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file, + vma_flags_t vma_flags) { if (mm_flags_test(MMF_VM_MERGE_ANY, mm) && - __ksm_should_add_vma(file, vm_flags)) { - vm_flags |= VM_MERGEABLE; + __ksm_should_add_vma(file, vma_flags)) { + vma_flags_set(&vma_flags, VMA_MERGEABLE_BIT); /* * Generally, the flags here always include MMF_VM_MERGEABLE. * However, in rare cases, this flag may be cleared by ksmd who @@ -2879,7 +2882,7 @@ vm_flags_t ksm_vma_flags(struct mm_struct *mm, const struct file *file, __ksm_enter(mm); } - return vm_flags; + return vma_flags; } static void ksm_add_vmas(struct mm_struct *mm) @@ -3168,6 +3171,8 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) return; again: hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { + /* Ignore the stable/unstable/sqnr flags */ + const unsigned long addr = rmap_item->address & PAGE_MASK; struct anon_vma *anon_vma = rmap_item->anon_vma; struct anon_vma_chain *vmac; struct vm_area_struct *vma; @@ -3180,16 +3185,13 @@ again: } anon_vma_lock_read(anon_vma); } + anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 0, ULONG_MAX) { - unsigned long addr; cond_resched(); vma = vmac->vma; - /* Ignore the stable/unstable/sqnr flags */ - addr = rmap_item->address & PAGE_MASK; - if (addr < vma->vm_start || addr >= vma->vm_end) continue; /* diff --git a/mm/madvise.c b/mm/madvise.c index dbb69400786d..69708e953cf5 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -151,13 +151,15 @@ static int madvise_update_vma(vm_flags_t new_flags, struct madvise_behavior *madv_behavior) { struct vm_area_struct *vma = madv_behavior->vma; + vma_flags_t new_vma_flags = legacy_to_vma_flags(new_flags); struct madvise_behavior_range *range = &madv_behavior->range; struct anon_vma_name *anon_name = madv_behavior->anon_name; bool set_new_anon_name = madv_behavior->behavior == __MADV_SET_ANON_VMA_NAME; VMA_ITERATOR(vmi, madv_behavior->mm, range->start); - if (new_flags == vma->vm_flags && (!set_new_anon_name || - anon_vma_name_eq(anon_vma_name(vma), anon_name))) + if (vma_flags_same_mask(&vma->flags, new_vma_flags) && + (!set_new_anon_name || + anon_vma_name_eq(anon_vma_name(vma), anon_name))) return 0; if (set_new_anon_name) @@ -165,7 +167,7 @@ static int madvise_update_vma(vm_flags_t new_flags, range->start, range->end, anon_name); else vma = vma_modify_flags(&vmi, madv_behavior->prev, vma, - range->start, range->end, &new_flags); + range->start, range->end, &new_vma_flags); if (IS_ERR(vma)) return PTR_ERR(vma); @@ -174,7 +176,7 @@ static int madvise_update_vma(vm_flags_t new_flags, /* vm_flags is protected by the mmap_lock held in write mode. */ vma_start_write(vma); - vm_flags_reset(vma, new_flags); + vma->flags = new_vma_flags; if (set_new_anon_name) return replace_anon_vma_name(vma, anon_name); @@ -799,9 +801,10 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior) { struct mm_struct *mm = madv_behavior->mm; struct vm_area_struct *vma = madv_behavior->vma; - unsigned long start_addr = madv_behavior->range.start; - unsigned long end_addr = madv_behavior->range.end; - struct mmu_notifier_range range; + struct mmu_notifier_range range = { + .start = madv_behavior->range.start, + .end = madv_behavior->range.end, + }; struct mmu_gather *tlb = madv_behavior->tlb; struct mm_walk_ops walk_ops = { .pmd_entry = madvise_free_pte_range, @@ -811,12 +814,6 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior) if (!vma_is_anonymous(vma)) return -EINVAL; - range.start = max(vma->vm_start, start_addr); - if (range.start >= vma->vm_end) - return -EINVAL; - range.end = min(vma->vm_end, end_addr); - if (range.end <= vma->vm_start) - return -EINVAL; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, range.start, range.end); @@ -837,7 +834,7 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior) * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The - * zap_page_range_single call sets things up for shrink_active_list to actually + * zap_vma_range call sets things up for shrink_active_list to actually * free these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * shrink_active_list to pick up before reclaiming other pages. @@ -858,12 +855,10 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior) struct madvise_behavior_range *range = &madv_behavior->range; struct zap_details details = { .reclaim_pt = true, - .even_cows = true, }; - zap_page_range_single_batched( - madv_behavior->tlb, madv_behavior->vma, range->start, - range->end - range->start, &details); + zap_vma_range_batched(madv_behavior->tlb, madv_behavior->vma, + range->start, range->end - range->start, &details); return 0; } @@ -1198,8 +1193,7 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) * OK some of the range have non-guard pages mapped, zap * them. This leaves existing guard pages in place. */ - zap_page_range_single(vma, range->start, - range->end - range->start, NULL); + zap_vma_range(vma, range->start, range->end - range->start); } /* diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 597af8a80163..437cd25784fe 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -635,11 +635,8 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry) * have an ID allocated to it anymore, charge the closest online * ancestor for the swap instead and transfer the memory+swap charge. */ - swap_memcg = mem_cgroup_private_id_get_online(memcg); nr_entries = folio_nr_pages(folio); - /* Get references for the tail pages, too */ - if (nr_entries > 1) - mem_cgroup_private_id_get_many(swap_memcg, nr_entries - 1); + swap_memcg = mem_cgroup_private_id_get_online(memcg, nr_entries); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), entry); diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index eb3c3c105657..1b969294ea6a 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -27,8 +27,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg); unsigned long memcg_events(struct mem_cgroup *memcg, int event); int memory_stat_show(struct seq_file *m, void *v); -void mem_cgroup_private_id_get_many(struct mem_cgroup *memcg, unsigned int n); -struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg); +struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg, + unsigned int n); /* Cgroup v1-specific declarations */ #ifdef CONFIG_MEMCG_V1 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 772bac21d155..051b82ebf371 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -34,7 +34,7 @@ #include <linux/shmem_fs.h> #include <linux/hugetlb.h> #include <linux/pagemap.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/vm_event_item.h> #include <linux/smp.h> #include <linux/page-flags.h> @@ -317,6 +317,7 @@ static const unsigned int memcg_node_stat_items[] = { NR_SHMEM_THPS, NR_FILE_THPS, NR_ANON_THPS, + NR_VMALLOC, NR_KERNEL_STACK_KB, NR_PAGETABLE, NR_SECONDARY_PAGETABLE, @@ -330,6 +331,19 @@ static const unsigned int memcg_node_stat_items[] = { PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, PGDEMOTE_PROACTIVE, + PGSTEAL_KSWAPD, + PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, + PGSTEAL_ANON, + PGSTEAL_FILE, + PGSCAN_KSWAPD, + PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, + PGSCAN_ANON, + PGSCAN_FILE, + PGREFILL, #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif @@ -339,10 +353,10 @@ static const unsigned int memcg_stat_items[] = { MEMCG_SWAP, MEMCG_SOCK, MEMCG_PERCPU_B, - MEMCG_VMALLOC, MEMCG_KMEM, MEMCG_ZSWAP_B, MEMCG_ZSWAPPED, + MEMCG_ZSWAP_INCOMP, }; #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items) @@ -443,17 +457,8 @@ static const unsigned int memcg_vm_event_stat[] = { #endif PSWPIN, PSWPOUT, - PGSCAN_KSWAPD, - PGSCAN_DIRECT, - PGSCAN_KHUGEPAGED, - PGSCAN_PROACTIVE, - PGSTEAL_KSWAPD, - PGSTEAL_DIRECT, - PGSTEAL_KHUGEPAGED, - PGSTEAL_PROACTIVE, PGFAULT, PGMAJFAULT, - PGREFILL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE, @@ -1359,11 +1364,12 @@ static const struct memory_stat memory_stats[] = { { "sec_pagetables", NR_SECONDARY_PAGETABLE }, { "percpu", MEMCG_PERCPU_B }, { "sock", MEMCG_SOCK }, - { "vmalloc", MEMCG_VMALLOC }, + { "vmalloc", NR_VMALLOC }, { "shmem", NR_SHMEM }, #ifdef CONFIG_ZSWAP { "zswap", MEMCG_ZSWAP_B }, { "zswapped", MEMCG_ZSWAPPED }, + { "zswap_incomp", MEMCG_ZSWAP_INCOMP }, #endif { "file_mapped", NR_FILE_MAPPED }, { "file_dirty", NR_FILE_DIRTY }, @@ -1400,6 +1406,15 @@ static const struct memory_stat memory_stats[] = { { "pgdemote_direct", PGDEMOTE_DIRECT }, { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, { "pgdemote_proactive", PGDEMOTE_PROACTIVE }, + { "pgsteal_kswapd", PGSTEAL_KSWAPD }, + { "pgsteal_direct", PGSTEAL_DIRECT }, + { "pgsteal_khugepaged", PGSTEAL_KHUGEPAGED }, + { "pgsteal_proactive", PGSTEAL_PROACTIVE }, + { "pgscan_kswapd", PGSCAN_KSWAPD }, + { "pgscan_direct", PGSCAN_DIRECT }, + { "pgscan_khugepaged", PGSCAN_KHUGEPAGED }, + { "pgscan_proactive", PGSCAN_PROACTIVE }, + { "pgrefill", PGREFILL }, #ifdef CONFIG_NUMA_BALANCING { "pgpromote_success", PGPROMOTE_SUCCESS }, #endif @@ -1443,6 +1458,15 @@ static int memcg_page_state_output_unit(int item) case PGDEMOTE_DIRECT: case PGDEMOTE_KHUGEPAGED: case PGDEMOTE_PROACTIVE: + case PGSTEAL_KSWAPD: + case PGSTEAL_DIRECT: + case PGSTEAL_KHUGEPAGED: + case PGSTEAL_PROACTIVE: + case PGSCAN_KSWAPD: + case PGSCAN_DIRECT: + case PGSCAN_KHUGEPAGED: + case PGSCAN_PROACTIVE: + case PGREFILL: #ifdef CONFIG_NUMA_BALANCING case PGPROMOTE_SUCCESS: #endif @@ -1514,15 +1538,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) /* Accumulated memory events */ seq_buf_printf(s, "pgscan %lu\n", - memcg_events(memcg, PGSCAN_KSWAPD) + - memcg_events(memcg, PGSCAN_DIRECT) + - memcg_events(memcg, PGSCAN_PROACTIVE) + - memcg_events(memcg, PGSCAN_KHUGEPAGED)); + memcg_page_state(memcg, PGSCAN_KSWAPD) + + memcg_page_state(memcg, PGSCAN_DIRECT) + + memcg_page_state(memcg, PGSCAN_PROACTIVE) + + memcg_page_state(memcg, PGSCAN_KHUGEPAGED)); seq_buf_printf(s, "pgsteal %lu\n", - memcg_events(memcg, PGSTEAL_KSWAPD) + - memcg_events(memcg, PGSTEAL_DIRECT) + - memcg_events(memcg, PGSTEAL_PROACTIVE) + - memcg_events(memcg, PGSTEAL_KHUGEPAGED)); + memcg_page_state(memcg, PGSTEAL_KSWAPD) + + memcg_page_state(memcg, PGSTEAL_DIRECT) + + memcg_page_state(memcg, PGSTEAL_PROACTIVE) + + memcg_page_state(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { #ifdef CONFIG_MEMCG_V1 @@ -2361,7 +2385,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; unsigned long nr_reclaimed; bool passed_oom = false; - unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP; + unsigned int reclaim_options; bool drained = false; bool raised_max_event = false; unsigned long pflags; @@ -2375,6 +2399,7 @@ retry: /* Avoid the refill and flush of the older stock */ batch = nr_pages; + reclaim_options = MEMCG_RECLAIM_MAY_SWAP; if (!do_memsw_account() || page_counter_try_charge(&memcg->memsw, batch, &counter)) { if (page_counter_try_charge(&memcg->memory, batch, &counter)) @@ -2926,12 +2951,30 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) obj_cgroup_put(objcg); } +static struct obj_stock_pcp *trylock_stock(void) +{ + if (local_trylock(&obj_stock.lock)) + return this_cpu_ptr(&obj_stock); + + return NULL; +} + +static void unlock_stock(struct obj_stock_pcp *stock) +{ + if (stock) + local_unlock(&obj_stock.lock); +} + +/* Call after __refill_obj_stock() to ensure stock->cached_objg == objcg */ static void __account_obj_stock(struct obj_cgroup *objcg, struct obj_stock_pcp *stock, int nr, struct pglist_data *pgdat, enum node_stat_item idx) { int *bytes; + if (!stock || READ_ONCE(stock->cached_objcg) != objcg) + goto direct; + /* * Save vmstat data in stock and skip vmstat array update unless * accumulating over a page of vmstat data or when pgdat changes. @@ -2971,29 +3014,35 @@ static void __account_obj_stock(struct obj_cgroup *objcg, nr = 0; } } +direct: if (nr) mod_objcg_mlstate(objcg, pgdat, idx, nr); } -static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, - struct pglist_data *pgdat, enum node_stat_item idx) +static bool __consume_obj_stock(struct obj_cgroup *objcg, + struct obj_stock_pcp *stock, + unsigned int nr_bytes) +{ + if (objcg == READ_ONCE(stock->cached_objcg) && + stock->nr_bytes >= nr_bytes) { + stock->nr_bytes -= nr_bytes; + return true; + } + + return false; +} + +static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct obj_stock_pcp *stock; bool ret = false; - if (!local_trylock(&obj_stock.lock)) + stock = trylock_stock(); + if (!stock) return ret; - stock = this_cpu_ptr(&obj_stock); - if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { - stock->nr_bytes -= nr_bytes; - ret = true; - - if (pgdat) - __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); - } - - local_unlock(&obj_stock.lock); + ret = __consume_obj_stock(objcg, stock, nr_bytes); + unlock_stock(stock); return ret; } @@ -3077,23 +3126,20 @@ static bool obj_stock_flush_required(struct obj_stock_pcp *stock, return flush; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, - bool allow_uncharge, int nr_acct, struct pglist_data *pgdat, - enum node_stat_item idx) +static void __refill_obj_stock(struct obj_cgroup *objcg, + struct obj_stock_pcp *stock, + unsigned int nr_bytes, + bool allow_uncharge) { - struct obj_stock_pcp *stock; unsigned int nr_pages = 0; - if (!local_trylock(&obj_stock.lock)) { - if (pgdat) - mod_objcg_mlstate(objcg, pgdat, idx, nr_acct); + if (!stock) { nr_pages = nr_bytes >> PAGE_SHIFT; nr_bytes = nr_bytes & (PAGE_SIZE - 1); atomic_add(nr_bytes, &objcg->nr_charged_bytes); goto out; } - stock = this_cpu_ptr(&obj_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3105,27 +3151,45 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, } stock->nr_bytes += nr_bytes; - if (pgdat) - __account_obj_stock(objcg, stock, nr_acct, pgdat, idx); - if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) { nr_pages = stock->nr_bytes >> PAGE_SHIFT; stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock(&obj_stock.lock); out: if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); } -static int obj_cgroup_charge_account(struct obj_cgroup *objcg, gfp_t gfp, size_t size, - struct pglist_data *pgdat, enum node_stat_item idx) +static void refill_obj_stock(struct obj_cgroup *objcg, + unsigned int nr_bytes, + bool allow_uncharge) +{ + struct obj_stock_pcp *stock = trylock_stock(); + __refill_obj_stock(objcg, stock, nr_bytes, allow_uncharge); + unlock_stock(stock); +} + +static int __obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, + size_t size, size_t *remainder) +{ + size_t charge_size; + int ret; + + charge_size = PAGE_ALIGN(size); + ret = obj_cgroup_charge_pages(objcg, gfp, charge_size >> PAGE_SHIFT); + if (!ret) + *remainder = charge_size - size; + + return ret; +} + +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) { - unsigned int nr_pages, nr_bytes; + size_t remainder; int ret; - if (likely(consume_obj_stock(objcg, size, pgdat, idx))) + if (likely(consume_obj_stock(objcg, size))) return 0; /* @@ -3151,28 +3215,16 @@ static int obj_cgroup_charge_account(struct obj_cgroup *objcg, gfp_t gfp, size_t * bytes is (sizeof(object) + PAGE_SIZE - 2) if there is no data * race. */ - nr_pages = size >> PAGE_SHIFT; - nr_bytes = size & (PAGE_SIZE - 1); - - if (nr_bytes) - nr_pages += 1; - - ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages); - if (!ret && (nr_bytes || pgdat)) - refill_obj_stock(objcg, nr_bytes ? PAGE_SIZE - nr_bytes : 0, - false, size, pgdat, idx); + ret = __obj_cgroup_charge(objcg, gfp, size, &remainder); + if (!ret && remainder) + refill_obj_stock(objcg, remainder, false); return ret; } -int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) -{ - return obj_cgroup_charge_account(objcg, gfp, size, NULL, 0); -} - void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) { - refill_obj_stock(objcg, size, true, 0, NULL, 0); + refill_obj_stock(objcg, size, true); } static inline size_t obj_full_size(struct kmem_cache *s) @@ -3187,6 +3239,7 @@ static inline size_t obj_full_size(struct kmem_cache *s) bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, gfp_t flags, size_t size, void **p) { + size_t obj_size = obj_full_size(s); struct obj_cgroup *objcg; struct slab *slab; unsigned long off; @@ -3227,6 +3280,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, for (i = 0; i < size; i++) { unsigned long obj_exts; struct slabobj_ext *obj_ext; + struct obj_stock_pcp *stock; slab = virt_to_slab(p[i]); @@ -3246,9 +3300,20 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, * TODO: we could batch this until slab_pgdat(slab) changes * between iterations, with a more complicated undo */ - if (obj_cgroup_charge_account(objcg, flags, obj_full_size(s), - slab_pgdat(slab), cache_vmstat_idx(s))) - return false; + stock = trylock_stock(); + if (!stock || !__consume_obj_stock(objcg, stock, obj_size)) { + size_t remainder; + + unlock_stock(stock); + if (__obj_cgroup_charge(objcg, flags, obj_size, &remainder)) + return false; + stock = trylock_stock(); + if (remainder) + __refill_obj_stock(objcg, stock, remainder, false); + } + __account_obj_stock(objcg, stock, obj_size, + slab_pgdat(slab), cache_vmstat_idx(s)); + unlock_stock(stock); obj_exts = slab_obj_exts(slab); get_slab_obj_exts(obj_exts); @@ -3270,6 +3335,7 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, for (int i = 0; i < objects; i++) { struct obj_cgroup *objcg; struct slabobj_ext *obj_ext; + struct obj_stock_pcp *stock; unsigned int off; off = obj_to_index(s, slab, p[i]); @@ -3279,8 +3345,13 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, continue; obj_ext->objcg = NULL; - refill_obj_stock(objcg, obj_size, true, -obj_size, - slab_pgdat(slab), cache_vmstat_idx(s)); + + stock = trylock_stock(); + __refill_obj_stock(objcg, stock, obj_size, true); + __account_obj_stock(objcg, stock, -obj_size, + slab_pgdat(slab), cache_vmstat_idx(s)); + unlock_stock(stock); + obj_cgroup_put(objcg); } } @@ -3612,13 +3683,7 @@ static void mem_cgroup_private_id_remove(struct mem_cgroup *memcg) } } -void __maybe_unused mem_cgroup_private_id_get_many(struct mem_cgroup *memcg, - unsigned int n) -{ - refcount_add(n, &memcg->id.ref); -} - -static void mem_cgroup_private_id_put_many(struct mem_cgroup *memcg, unsigned int n) +static inline void mem_cgroup_private_id_put(struct mem_cgroup *memcg, unsigned int n) { if (refcount_sub_and_test(n, &memcg->id.ref)) { mem_cgroup_private_id_remove(memcg); @@ -3628,14 +3693,9 @@ static void mem_cgroup_private_id_put_many(struct mem_cgroup *memcg, unsigned in } } -static inline void mem_cgroup_private_id_put(struct mem_cgroup *memcg) +struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg, unsigned int n) { - mem_cgroup_private_id_put_many(memcg, 1); -} - -struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg) -{ - while (!refcount_inc_not_zero(&memcg->id.ref)) { + while (!refcount_add_not_zero(n, &memcg->id.ref)) { /* * The root cgroup cannot be destroyed, so it's refcount must * always be >= 1. @@ -3935,7 +3995,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) drain_all_stock(memcg); - mem_cgroup_private_id_put(memcg); + mem_cgroup_private_id_put(memcg, 1); } static void mem_cgroup_css_released(struct cgroup_subsys_state *css) @@ -5225,19 +5285,15 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) return 0; } - memcg = mem_cgroup_private_id_get_online(memcg); + memcg = mem_cgroup_private_id_get_online(memcg, nr_pages); if (!mem_cgroup_is_root(memcg) && !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { memcg_memory_event(memcg, MEMCG_SWAP_MAX); memcg_memory_event(memcg, MEMCG_SWAP_FAIL); - mem_cgroup_private_id_put(memcg); + mem_cgroup_private_id_put(memcg, nr_pages); return -ENOMEM; } - - /* Get references for the tail pages, too */ - if (nr_pages > 1) - mem_cgroup_private_id_get_many(memcg, nr_pages - 1); mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); swap_cgroup_record(folio, mem_cgroup_private_id(memcg), entry); @@ -5266,7 +5322,7 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) page_counter_uncharge(&memcg->swap, nr_pages); } mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages); - mem_cgroup_private_id_put_many(memcg, nr_pages); + mem_cgroup_private_id_put(memcg, nr_pages); } rcu_read_unlock(); } @@ -5513,6 +5569,8 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size) memcg = obj_cgroup_memcg(objcg); mod_memcg_state(memcg, MEMCG_ZSWAP_B, size); mod_memcg_state(memcg, MEMCG_ZSWAPPED, 1); + if (size == PAGE_SIZE) + mod_memcg_state(memcg, MEMCG_ZSWAP_INCOMP, 1); rcu_read_unlock(); } @@ -5536,6 +5594,8 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) memcg = obj_cgroup_memcg(objcg); mod_memcg_state(memcg, MEMCG_ZSWAP_B, -size); mod_memcg_state(memcg, MEMCG_ZSWAPPED, -1); + if (size == PAGE_SIZE) + mod_memcg_state(memcg, MEMCG_ZSWAP_INCOMP, -1); rcu_read_unlock(); } diff --git a/mm/memfd.c b/mm/memfd.c index 919c2a53eb96..fb425f4e315f 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -227,7 +227,7 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_WRITE | \ F_SEAL_FUTURE_WRITE) -static int memfd_add_seals(struct file *file, unsigned int seals) +int memfd_add_seals(struct file *file, unsigned int seals) { struct inode *inode = file_inode(file); unsigned int *file_seals; @@ -309,7 +309,7 @@ unlock: return error; } -static int memfd_get_seals(struct file *file) +int memfd_get_seals(struct file *file) { unsigned int *seals = memfd_file_seals_ptr(file); diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index b8edb9f981d7..bc7f4f045edf 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -79,6 +79,8 @@ #include <linux/shmem_fs.h> #include <linux/vmalloc.h> #include <linux/memfd.h> +#include <uapi/linux/memfd.h> + #include "internal.h" static int memfd_luo_preserve_folios(struct file *file, @@ -259,7 +261,7 @@ static int memfd_luo_preserve(struct liveupdate_file_op_args *args) struct memfd_luo_folio_ser *folios_ser; struct memfd_luo_ser *ser; u64 nr_folios; - int err = 0; + int err = 0, seals; inode_lock(inode); shmem_freeze(inode, true); @@ -271,8 +273,21 @@ static int memfd_luo_preserve(struct liveupdate_file_op_args *args) goto err_unlock; } + seals = memfd_get_seals(args->file); + if (seals < 0) { + err = seals; + goto err_free_ser; + } + + /* Make sure the file only has the seals supported by this version. */ + if (seals & ~MEMFD_LUO_ALL_SEALS) { + err = -EOPNOTSUPP; + goto err_free_ser; + } + ser->pos = args->file->f_pos; ser->size = i_size_read(inode); + ser->seals = seals; err = memfd_luo_preserve_folios(args->file, &ser->folios, &folios_ser, &nr_folios); @@ -486,13 +501,29 @@ static int memfd_luo_retrieve(struct liveupdate_file_op_args *args) if (!ser) return -EINVAL; - file = memfd_alloc_file("", 0); + /* Make sure the file only has seals supported by this version. */ + if (ser->seals & ~MEMFD_LUO_ALL_SEALS) { + err = -EOPNOTSUPP; + goto free_ser; + } + + /* + * The seals are preserved. Allow sealing here so they can be added + * later. + */ + file = memfd_alloc_file("", MFD_ALLOW_SEALING); if (IS_ERR(file)) { pr_err("failed to setup file: %pe\n", file); err = PTR_ERR(file); goto free_ser; } + err = memfd_add_seals(file, ser->seals); + if (err) { + pr_err("failed to add seals: %pe\n", ERR_PTR(err)); + goto put_file; + } + vfs_setpos(file, ser->pos, MAX_LFS_FILESIZE); file->f_inode->i_size = ser->size; diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 986f809376eb..54851d8a195b 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -69,7 +69,7 @@ bool folio_use_access_time(struct folio *folio) } #endif -#ifdef CONFIG_MIGRATION +#ifdef CONFIG_NUMA_MIGRATION static int top_tier_adistance; /* * node_demotion[] examples: @@ -129,7 +129,7 @@ static int top_tier_adistance; * */ static struct demotion_nodes *node_demotion __read_mostly; -#endif /* CONFIG_MIGRATION */ +#endif /* CONFIG_NUMA_MIGRATION */ static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); @@ -273,7 +273,7 @@ static struct memory_tier *__node_get_memory_tier(int node) lockdep_is_held(&memory_tier_lock)); } -#ifdef CONFIG_MIGRATION +#ifdef CONFIG_NUMA_MIGRATION bool node_is_toptier(int node) { bool toptier; @@ -519,7 +519,7 @@ static void establish_demotion_targets(void) #else static inline void establish_demotion_targets(void) {} -#endif /* CONFIG_MIGRATION */ +#endif /* CONFIG_NUMA_MIGRATION */ static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype) { @@ -911,7 +911,7 @@ static int __init memory_tier_init(void) if (ret) panic("%s() failed to register memory tier subsystem\n", __func__); -#ifdef CONFIG_MIGRATION +#ifdef CONFIG_NUMA_MIGRATION node_demotion = kzalloc_objs(struct demotion_nodes, nr_node_ids); WARN_ON(!node_demotion); #endif @@ -938,7 +938,7 @@ subsys_initcall(memory_tier_init); bool numa_demotion_enabled = false; -#ifdef CONFIG_MIGRATION +#ifdef CONFIG_NUMA_MIGRATION #ifdef CONFIG_SYSFS static ssize_t demotion_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) diff --git a/mm/memory.c b/mm/memory.c index c65e82c86fed..ea6568571131 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -162,21 +162,8 @@ static int __init disable_randmaps(char *s) } __setup("norandmaps", disable_randmaps); -unsigned long zero_pfn __read_mostly; -EXPORT_SYMBOL(zero_pfn); - unsigned long highest_memmap_pfn __read_mostly; -/* - * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init() - */ -static int __init init_zero_pfn(void) -{ - zero_pfn = page_to_pfn(ZERO_PAGE(0)); - return 0; -} -early_initcall(init_zero_pfn); - void mm_trace_rss_stat(struct mm_struct *mm, int member) { trace_rss_stat(mm, member); @@ -1346,7 +1333,7 @@ again: if (ret == -EIO) { VM_WARN_ON_ONCE(!entry.val); - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { + if (swap_retry_table_alloc(entry, GFP_KERNEL) < 0) { ret = -ENOMEM; goto out; } @@ -1567,11 +1554,13 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) static inline bool should_zap_cows(struct zap_details *details) { /* By default, zap all pages */ - if (!details || details->reclaim_pt) + if (!details) return true; + VM_WARN_ON_ONCE(details->skip_cows && details->reclaim_pt); + /* Or, we zap COWed pages only if the caller wants to */ - return details->even_cows; + return !details->skip_cows; } /* Decides whether we should zap this folio with the folio pointer specified */ @@ -2006,13 +1995,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, } else if (details && details->single_folio && folio_test_pmd_mappable(details->single_folio) && next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) { - spinlock_t *ptl = pmd_lock(tlb->mm, pmd); - /* - * Take and drop THP pmd lock so that we cannot return - * prematurely, while zap_huge_pmd() has cleared *pmd, - * but not yet decremented compound_mapcount(). - */ - spin_unlock(ptl); + sync_with_folio_pmd_zap(tlb->mm, pmd); } if (pmd_none(*pmd)) { addr = next; @@ -2073,65 +2056,74 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb, return addr; } -void unmap_page_range(struct mmu_gather *tlb, - struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - struct zap_details *details) +static void __zap_vma_range(struct mmu_gather *tlb, struct vm_area_struct *vma, + unsigned long start, unsigned long end, + struct zap_details *details) { - pgd_t *pgd; - unsigned long next; + const bool reaping = details && details->reaping; - BUG_ON(addr >= end); - tlb_start_vma(tlb, vma); - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - next = zap_p4d_range(tlb, vma, pgd, addr, next, details); - } while (pgd++, addr = next, addr != end); - tlb_end_vma(tlb, vma); -} + VM_WARN_ON_ONCE(start >= end || !range_in_vma(vma, start, end)); + /* uprobe_munmap() might sleep, so skip it when reaping. */ + if (vma->vm_file && !reaping) + uprobe_munmap(vma, start, end); -static void unmap_single_vma(struct mmu_gather *tlb, - struct vm_area_struct *vma, unsigned long start_addr, - unsigned long end_addr, struct zap_details *details) -{ - unsigned long start = max(vma->vm_start, start_addr); - unsigned long end; + if (unlikely(is_vm_hugetlb_page(vma))) { + zap_flags_t zap_flags = details ? details->zap_flags : 0; - if (start >= vma->vm_end) - return; - end = min(vma->vm_end, end_addr); - if (end <= vma->vm_start) - return; + VM_WARN_ON_ONCE(reaping); + /* + * vm_file will be NULL when we fail early while instantiating + * a new mapping. In this case, no pages were mapped yet and + * there is nothing to do. + */ + if (!vma->vm_file) + return; + __unmap_hugepage_range(tlb, vma, start, end, NULL, zap_flags); + } else { + unsigned long next, addr = start; + pgd_t *pgd; - if (vma->vm_file) - uprobe_munmap(vma, start, end); + tlb_start_vma(tlb, vma); + pgd = pgd_offset(vma->vm_mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + next = zap_p4d_range(tlb, vma, pgd, addr, next, details); + } while (pgd++, addr = next, addr != end); + tlb_end_vma(tlb, vma); + } +} - if (start != end) { - if (unlikely(is_vm_hugetlb_page(vma))) { - /* - * It is undesirable to test vma->vm_file as it - * should be non-null for valid hugetlb area. - * However, vm_file will be NULL in the error - * cleanup path of mmap_region. When - * hugetlbfs ->mmap method fails, - * mmap_region() nullifies vma->vm_file - * before calling this function to clean up. - * Since no pte has actually been setup, it is - * safe to do nothing in this case. - */ - if (vma->vm_file) { - zap_flags_t zap_flags = details ? - details->zap_flags : 0; - __unmap_hugepage_range(tlb, vma, start, end, - NULL, zap_flags); - } - } else - unmap_page_range(tlb, vma, start, end, details); +/** + * zap_vma_for_reaping - zap all page table entries in the vma without blocking + * @vma: The vma to zap. + * + * Zap all page table entries in the vma without blocking for use by the oom + * killer. Hugetlb vmas are not supported. + * + * Returns: 0 on success, -EBUSY if we would have to block. + */ +int zap_vma_for_reaping(struct vm_area_struct *vma) +{ + struct zap_details details = { + .reaping = true, + }; + struct mmu_notifier_range range; + struct mmu_gather tlb; + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, + vma->vm_start, vma->vm_end); + tlb_gather_mmu(&tlb, vma->vm_mm); + if (mmu_notifier_invalidate_range_start_nonblock(&range)) { + tlb_finish_mmu(&tlb); + return -EBUSY; } + __zap_vma_range(&tlb, vma, range.start, range.end, &details); + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb); + return 0; } /** @@ -2156,8 +2148,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap) struct mmu_notifier_range range; struct zap_details details = { .zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP, - /* Careful - we need to zap private pages too! */ - .even_cows = true, }; vma = unmap->first; @@ -2165,10 +2155,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap) unmap->vma_start, unmap->vma_end); mmu_notifier_invalidate_range_start(&range); do { - unsigned long start = unmap->vma_start; - unsigned long end = unmap->vma_end; + unsigned long start = max(vma->vm_start, unmap->vma_start); + unsigned long end = min(vma->vm_end, unmap->vma_end); + hugetlb_zap_begin(vma, &start, &end); - unmap_single_vma(tlb, vma, start, end, &details); + __zap_vma_range(tlb, vma, start, end, &details); hugetlb_zap_end(vma, &details); vma = mas_find(unmap->mas, unmap->tree_end - 1); } while (vma); @@ -2176,17 +2167,20 @@ void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap) } /** - * zap_page_range_single_batched - remove user pages in a given range + * zap_vma_range_batched - zap page table entries in a vma range * @tlb: pointer to the caller's struct mmu_gather - * @vma: vm_area_struct holding the applicable pages - * @address: starting address of pages to remove - * @size: number of bytes to remove - * @details: details of shared cache invalidation + * @vma: the vma covering the range to zap + * @address: starting address of the range to zap + * @size: number of bytes to zap + * @details: details specifying zapping behavior + * + * @tlb must not be NULL. The provided address range must be fully + * contained within @vma. If @vma is for hugetlb, @tlb is flushed and + * re-initialized by this function. * - * @tlb shouldn't be NULL. The range must fit into one VMA. If @vma is for - * hugetlb, @tlb is flushed and re-initialized by this function. + * If @details is NULL, this function will zap all page table entries. */ -void zap_page_range_single_batched(struct mmu_gather *tlb, +void zap_vma_range_batched(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { @@ -2195,6 +2189,9 @@ void zap_page_range_single_batched(struct mmu_gather *tlb, VM_WARN_ON_ONCE(!tlb || tlb->mm != vma->vm_mm); + if (unlikely(!size)) + return; + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, address, end); hugetlb_zap_begin(vma, &range.start, &range.end); @@ -2204,7 +2201,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb, * unmap 'address-end' not 'range.start-range.end' as range * could have been expanded for hugetlb pmd sharing. */ - unmap_single_vma(tlb, vma, address, end, details); + __zap_vma_range(tlb, vma, address, end, details); mmu_notifier_invalidate_range_end(&range); if (is_vm_hugetlb_page(vma)) { /* @@ -2218,45 +2215,42 @@ void zap_page_range_single_batched(struct mmu_gather *tlb, } /** - * zap_page_range_single - remove user pages in a given range - * @vma: vm_area_struct holding the applicable pages - * @address: starting address of pages to zap + * zap_vma_range - zap all page table entries in a vma range + * @vma: the vma covering the range to zap + * @address: starting address of the range to zap * @size: number of bytes to zap - * @details: details of shared cache invalidation * - * The range must fit into one VMA. + * The provided address range must be fully contained within @vma. */ -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, - unsigned long size, struct zap_details *details) +void zap_vma_range(struct vm_area_struct *vma, unsigned long address, + unsigned long size) { struct mmu_gather tlb; tlb_gather_mmu(&tlb, vma->vm_mm); - zap_page_range_single_batched(&tlb, vma, address, size, details); + zap_vma_range_batched(&tlb, vma, address, size, NULL); tlb_finish_mmu(&tlb); } /** - * zap_vma_ptes - remove ptes mapping the vma - * @vma: vm_area_struct holding ptes to be zapped - * @address: starting address of pages to zap + * zap_special_vma_range - zap all page table entries in a special vma range + * @vma: the vma covering the range to zap + * @address: starting address of the range to zap * @size: number of bytes to zap * - * This function only unmaps ptes assigned to VM_PFNMAP vmas. - * - * The entire address range must be fully contained within the vma. - * + * This function does nothing when the provided address range is not fully + * contained in @vma, or when the @vma is not VM_PFNMAP or VM_MIXEDMAP. */ -void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, +void zap_special_vma_range(struct vm_area_struct *vma, unsigned long address, unsigned long size) { if (!range_in_vma(vma, address, address + size) || - !(vma->vm_flags & VM_PFNMAP)) + !(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) return; - zap_page_range_single(vma, address, size, NULL); + zap_vma_range(vma, address, size); } -EXPORT_SYMBOL_GPL(zap_vma_ptes); +EXPORT_SYMBOL_GPL(zap_special_vma_range); static pmd_t *walk_to_pmd(struct mm_struct *mm, unsigned long addr) { @@ -2490,13 +2484,14 @@ out: int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, struct page **pages, unsigned long *num) { - const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1; + const unsigned long nr_pages = *num; + const unsigned long end = addr + PAGE_SIZE * nr_pages; - if (addr < vma->vm_start || end_addr >= vma->vm_end) + if (!range_in_vma(vma, addr, end)) return -EFAULT; if (!(vma->vm_flags & VM_MIXEDMAP)) { - BUG_ON(mmap_read_trylock(vma->vm_mm)); - BUG_ON(vma->vm_flags & VM_PFNMAP); + VM_WARN_ON_ONCE(mmap_read_trylock(vma->vm_mm)); + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP); vm_flags_set(vma, VM_MIXEDMAP); } /* Defer page refcount checking till we're about to map that page. */ @@ -2504,6 +2499,39 @@ int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_pages); +int map_kernel_pages_prepare(struct vm_area_desc *desc) +{ + const struct mmap_action *action = &desc->action; + const unsigned long addr = action->map_kernel.start; + unsigned long nr_pages, end; + + if (!vma_desc_test(desc, VMA_MIXEDMAP_BIT)) { + VM_WARN_ON_ONCE(mmap_read_trylock(desc->mm)); + VM_WARN_ON_ONCE(vma_desc_test(desc, VMA_PFNMAP_BIT)); + vma_desc_set_flags(desc, VMA_MIXEDMAP_BIT); + } + + nr_pages = action->map_kernel.nr_pages; + end = addr + PAGE_SIZE * nr_pages; + if (!range_in_vma_desc(desc, addr, end)) + return -EFAULT; + + return 0; +} +EXPORT_SYMBOL(map_kernel_pages_prepare); + +int map_kernel_pages_complete(struct vm_area_struct *vma, + struct mmap_action *action) +{ + unsigned long nr_pages; + + nr_pages = action->map_kernel.nr_pages; + return insert_pages(vma, action->map_kernel.start, + action->map_kernel.pages, + &nr_pages, vma->vm_page_prot); +} +EXPORT_SYMBOL(map_kernel_pages_complete); + /** * vm_insert_page - insert single page into user vma * @vma: user vma to map to @@ -2988,7 +3016,7 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad if (WARN_ON_ONCE(!PAGE_ALIGNED(addr))) return -EINVAL; - VM_WARN_ON_ONCE(!vma_test_all_flags_mask(vma, VMA_REMAP_FLAGS)); + VM_WARN_ON_ONCE(!vma_test_all_mask(vma, VMA_REMAP_FLAGS)); BUG_ON(addr >= end); pfn -= addr >> PAGE_SHIFT; @@ -3022,7 +3050,7 @@ static int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long add * maintain page reference counts, and callers may free * pages due to the error. So zap it early. */ - zap_page_range_single(vma, addr, size, NULL); + zap_vma_range(vma, addr, size); return error; } @@ -3105,26 +3133,37 @@ static int do_remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, } #endif -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn) +int remap_pfn_range_prepare(struct vm_area_desc *desc) { - /* - * We set addr=VMA start, end=VMA end here, so this won't fail, but we - * check it again on complete and will fail there if specified addr is - * invalid. - */ - get_remap_pgoff(vma_desc_is_cow_mapping(desc), desc->start, desc->end, - desc->start, desc->end, pfn, &desc->pgoff); + const struct mmap_action *action = &desc->action; + const unsigned long start = action->remap.start; + const unsigned long end = start + action->remap.size; + const unsigned long pfn = action->remap.start_pfn; + const bool is_cow = vma_desc_is_cow_mapping(desc); + int err; + + if (!range_in_vma_desc(desc, start, end)) + return -EFAULT; + + err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn, + &desc->pgoff); + if (err) + return err; + vma_desc_set_flags_mask(desc, VMA_REMAP_FLAGS); + return 0; } -static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, unsigned long size) +static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma, + unsigned long addr, unsigned long pfn, + unsigned long size) { - unsigned long end = addr + PAGE_ALIGN(size); + const unsigned long end = addr + PAGE_ALIGN(size); + const bool is_cow = is_cow_mapping(vma->vm_flags); int err; - err = get_remap_pgoff(is_cow_mapping(vma->vm_flags), addr, end, - vma->vm_start, vma->vm_end, pfn, &vma->vm_pgoff); + err = get_remap_pgoff(is_cow, addr, end, vma->vm_start, vma->vm_end, + pfn, &vma->vm_pgoff); if (err) return err; @@ -3157,10 +3196,67 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(remap_pfn_range); -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr, - unsigned long pfn, unsigned long size, pgprot_t prot) +int remap_pfn_range_complete(struct vm_area_struct *vma, + struct mmap_action *action) { - return do_remap_pfn_range(vma, addr, pfn, size, prot); + const unsigned long start = action->remap.start; + const unsigned long pfn = action->remap.start_pfn; + const unsigned long size = action->remap.size; + const pgprot_t prot = action->remap.pgprot; + + return do_remap_pfn_range(vma, start, pfn, size, prot); +} + +static int __simple_ioremap_prep(unsigned long vm_len, pgoff_t vm_pgoff, + phys_addr_t start_phys, unsigned long size, + unsigned long *pfnp) +{ + unsigned long pfn, pages; + + /* Check that the physical memory area passed in looks valid */ + if (start_phys + size < start_phys) + return -EINVAL; + /* + * You *really* shouldn't map things that aren't page-aligned, + * but we've historically allowed it because IO memory might + * just have smaller alignment. + */ + size += start_phys & ~PAGE_MASK; + pfn = start_phys >> PAGE_SHIFT; + pages = (size + ~PAGE_MASK) >> PAGE_SHIFT; + if (pfn + pages < pfn) + return -EINVAL; + + /* We start the mapping 'vm_pgoff' pages into the area */ + if (vm_pgoff > pages) + return -EINVAL; + pfn += vm_pgoff; + pages -= vm_pgoff; + + /* Can we fit all of the mapping? */ + if ((vm_len >> PAGE_SHIFT) > pages) + return -EINVAL; + + *pfnp = pfn; + return 0; +} + +int simple_ioremap_prepare(struct vm_area_desc *desc) +{ + struct mmap_action *action = &desc->action; + const phys_addr_t start = action->simple_ioremap.start_phys_addr; + const unsigned long size = action->simple_ioremap.size; + unsigned long pfn; + int err; + + err = __simple_ioremap_prep(vma_desc_size(desc), desc->pgoff, + start, size, &pfn); + if (err) + return err; + + /* The I/O remap logic does the heavy lifting. */ + mmap_action_ioremap_full(desc, pfn); + return io_remap_pfn_range_prepare(desc); } /** @@ -3180,32 +3276,15 @@ int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr, */ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len) { - unsigned long vm_len, pfn, pages; - - /* Check that the physical memory area passed in looks valid */ - if (start + len < start) - return -EINVAL; - /* - * You *really* shouldn't map things that aren't page-aligned, - * but we've historically allowed it because IO memory might - * just have smaller alignment. - */ - len += start & ~PAGE_MASK; - pfn = start >> PAGE_SHIFT; - pages = (len + ~PAGE_MASK) >> PAGE_SHIFT; - if (pfn + pages < pfn) - return -EINVAL; - - /* We start the mapping 'vm_pgoff' pages into the area */ - if (vma->vm_pgoff > pages) - return -EINVAL; - pfn += vma->vm_pgoff; - pages -= vma->vm_pgoff; + const unsigned long vm_start = vma->vm_start; + const unsigned long vm_end = vma->vm_end; + const unsigned long vm_len = vm_end - vm_start; + unsigned long pfn; + int err; - /* Can we fit all of the mapping? */ - vm_len = vma->vm_end - vma->vm_start; - if (vm_len >> PAGE_SHIFT > pages) - return -EINVAL; + err = __simple_ioremap_prep(vm_len, vma->vm_pgoff, start, len, &pfn); + if (err) + return err; /* Ok, let it rip */ return io_remap_pfn_range(vma, vma->vm_start, pfn, vm_len, vma->vm_page_prot); @@ -4241,31 +4320,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) return wp_page_copy(vmf); } -static void unmap_mapping_range_vma(struct vm_area_struct *vma, - unsigned long start_addr, unsigned long end_addr, - struct zap_details *details) -{ - zap_page_range_single(vma, start_addr, end_addr - start_addr, details); -} - static inline void unmap_mapping_range_tree(struct rb_root_cached *root, pgoff_t first_index, pgoff_t last_index, struct zap_details *details) { struct vm_area_struct *vma; - pgoff_t vba, vea, zba, zea; + unsigned long start, size; + struct mmu_gather tlb; vma_interval_tree_foreach(vma, root, first_index, last_index) { - vba = vma->vm_pgoff; - vea = vba + vma_pages(vma) - 1; - zba = max(first_index, vba); - zea = min(last_index, vea); + const pgoff_t start_idx = max(first_index, vma->vm_pgoff); + const pgoff_t end_idx = min(last_index, vma_last_pgoff(vma)) + 1; + + start = vma->vm_start + ((start_idx - vma->vm_pgoff) << PAGE_SHIFT); + size = (end_idx - start_idx) << PAGE_SHIFT; - unmap_mapping_range_vma(vma, - ((zba - vba) << PAGE_SHIFT) + vma->vm_start, - ((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start, - details); + tlb_gather_mmu(&tlb, vma->vm_mm); + zap_vma_range_batched(&tlb, vma, start, size, details); + tlb_finish_mmu(&tlb); } } @@ -4292,7 +4365,7 @@ void unmap_mapping_folio(struct folio *folio) first_index = folio->index; last_index = folio_next_index(folio) - 1; - details.even_cows = false; + details.skip_cows = true; details.single_folio = folio; details.zap_flags = ZAP_FLAG_DROP_MARKER; @@ -4322,7 +4395,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t first_index = start; pgoff_t last_index = start + nr - 1; - details.even_cows = even_cows; + details.skip_cows = !even_cows; if (last_index < first_index) last_index = ULONG_MAX; @@ -5209,6 +5282,37 @@ fallback: return folio_prealloc(vma->vm_mm, vma, vmf->address, true); } +void map_anon_folio_pte_nopf(struct folio *folio, pte_t *pte, + struct vm_area_struct *vma, unsigned long addr, + bool uffd_wp) +{ + const unsigned int nr_pages = folio_nr_pages(folio); + pte_t entry = folio_mk_pte(folio, vma->vm_page_prot); + + entry = pte_sw_mkyoung(entry); + + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry), vma); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + + folio_ref_add(folio, nr_pages - 1); + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + set_ptes(vma->vm_mm, addr, pte, entry, nr_pages); + update_mmu_cache_range(NULL, vma, addr, pte, nr_pages); +} + +static void map_anon_folio_pte_pf(struct folio *folio, pte_t *pte, + struct vm_area_struct *vma, unsigned long addr, bool uffd_wp) +{ + const unsigned int order = folio_order(folio); + + map_anon_folio_pte_nopf(folio, pte, vma, addr, uffd_wp); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1L << order); + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_ALLOC); +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -5220,7 +5324,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) unsigned long addr = vmf->address; struct folio *folio; vm_fault_t ret = 0; - int nr_pages = 1; + int nr_pages; pte_t entry; /* File mapping without ->vm_ops ? */ @@ -5237,7 +5341,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Use the zero-page for reads */ if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm)) { - entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), + entry = pte_mkspecial(pfn_pte(zero_pfn(vmf->address), vma->vm_page_prot)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); @@ -5255,7 +5359,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_MISSING); } - goto setpte; + if (vmf_orig_pte_uffd_wp(vmf)) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, addr, vmf->pte, entry); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, vmf->pte); + goto unlock; } /* Allocate our own private page. */ @@ -5279,11 +5389,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) */ __folio_mark_uptodate(folio); - entry = folio_mk_pte(folio, vma->vm_page_prot); - entry = pte_sw_mkyoung(entry); - if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry), vma); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; @@ -5305,19 +5410,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) folio_put(folio); return handle_userfault(vmf, VM_UFFD_MISSING); } - - folio_ref_add(folio, nr_pages - 1); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); - count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); - folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); -setpte: - if (vmf_orig_pte_uffd_wp(vmf)) - entry = pte_mkuffd_wp(entry); - set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); + map_anon_folio_pte_pf(folio, vmf->pte, vma, addr, + vmf_orig_pte_uffd_wp(vmf)); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); @@ -5426,7 +5520,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) return ret; - if (folio_order(folio) != HPAGE_PMD_ORDER) + if (!is_pmd_order(folio_order(folio))) return ret; page = &folio->page; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 05a47953ef21..2a943ec57c85 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -221,7 +221,7 @@ void put_online_mems(void) bool movable_node_enabled = false; static int mhp_default_online_type = -1; -int mhp_get_default_online_type(void) +enum mmop mhp_get_default_online_type(void) { if (mhp_default_online_type >= 0) return mhp_default_online_type; @@ -240,7 +240,7 @@ int mhp_get_default_online_type(void) return mhp_default_online_type; } -void mhp_set_default_online_type(int online_type) +void mhp_set_default_online_type(enum mmop online_type) { mhp_default_online_type = online_type; } @@ -319,21 +319,13 @@ static void release_memory_resource(struct resource *res) static int check_pfn_span(unsigned long pfn, unsigned long nr_pages) { /* - * Disallow all operations smaller than a sub-section and only - * allow operations smaller than a section for - * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range() - * enforces a larger memory_block_size_bytes() granularity for - * memory that will be marked online, so this check should only - * fire for direct arch_{add,remove}_memory() users outside of - * add_memory_resource(). + * Disallow all operations smaller than a sub-section. + * Note that check_hotplug_memory_range() enforces a larger + * memory_block_size_bytes() granularity for memory that will be marked + * online, so this check should only fire for direct + * arch_{add,remove}_memory() users outside of add_memory_resource(). */ - unsigned long min_align; - - if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) - min_align = PAGES_PER_SUBSECTION; - else - min_align = PAGES_PER_SECTION; - if (!IS_ALIGNED(pfn | nr_pages, min_align)) + if (!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION)) return -EINVAL; return 0; } @@ -1046,7 +1038,7 @@ static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn return movable_node_enabled ? movable_zone : kernel_zone; } -struct zone *zone_for_pfn_range(int online_type, int nid, +struct zone *zone_for_pfn_range(enum mmop online_type, int nid, struct memory_group *group, unsigned long start_pfn, unsigned long nr_pages) { @@ -1752,7 +1744,8 @@ static int scan_movable_pages(unsigned long start, unsigned long end, { unsigned long pfn; - for_each_valid_pfn(pfn, start, end) { + for (pfn = start; pfn < end; pfn++) { + unsigned long nr_pages; struct page *page; struct folio *folio; @@ -1769,9 +1762,9 @@ static int scan_movable_pages(unsigned long start, unsigned long end, if (PageOffline(page) && page_count(page)) return -EBUSY; - if (!PageHuge(page)) - continue; folio = page_folio(page); + if (!folio_test_hugetlb(folio)) + continue; /* * This test is racy as we hold no reference or lock. The * hugetlb page could have been free'ed and head is no longer @@ -1781,7 +1774,11 @@ static int scan_movable_pages(unsigned long start, unsigned long end, */ if (folio_test_hugetlb_migratable(folio)) goto found; - pfn |= folio_nr_pages(folio) - 1; + nr_pages = folio_nr_pages(folio); + if (unlikely(nr_pages < 1 || nr_pages > MAX_FOLIO_NR_PAGES || + !is_power_of_2(nr_pages))) + continue; + pfn |= nr_pages - 1; } return -ENOENT; found: @@ -1797,7 +1794,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - for_each_valid_pfn(pfn, start_pfn, end_pfn) { + for (pfn = start_pfn; pfn < end_pfn; pfn++) { struct page *page; page = pfn_to_page(pfn); @@ -2325,7 +2322,7 @@ EXPORT_SYMBOL_GPL(remove_memory); static int try_offline_memory_block(struct memory_block *mem, void *arg) { - uint8_t online_type = MMOP_ONLINE_KERNEL; + enum mmop online_type = MMOP_ONLINE_KERNEL; uint8_t **online_types = arg; struct page *page; int rc; @@ -2358,7 +2355,7 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg) int rc; if (**online_types != MMOP_OFFLINE) { - mem->online_type = **online_types; + mem->online_type = (enum mmop)**online_types; rc = device_online(&mem->dev); if (rc < 0) pr_warn("%s: Failed to re-online memory: %d", diff --git a/mm/mempolicy.c b/mm/mempolicy.c index cf92bd6a8226..2e136b738889 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1245,7 +1245,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, return err; } -#ifdef CONFIG_MIGRATION +#ifdef CONFIG_NUMA_MIGRATION static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags) { @@ -2455,7 +2455,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && /* filter "hugepage" allocation, unless from alloc_pages() */ - order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) { + is_pmd_order(order) && ilx != NO_INTERLEAVE_INDEX) { /* * For hugepage allocation and non-interleave policy which * allows the current node (or other explicitly preferred diff --git a/mm/migrate.c b/mm/migrate.c index 2c3d489ecf51..76142a02192b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -321,7 +321,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, if (!pages_identical(page, ZERO_PAGE(0))) return false; - newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address), + newpte = pte_mkspecial(pfn_pte(zero_pfn(pvmw->address), pvmw->vma->vm_page_prot)); if (pte_swp_soft_dirty(old_pte)) @@ -1358,6 +1358,8 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private, int rc; int old_page_state = 0; struct anon_vma *anon_vma = NULL; + bool src_deferred_split = false; + bool src_partially_mapped = false; struct list_head *prev; __migrate_folio_extract(dst, &old_page_state, &anon_vma); @@ -1371,11 +1373,26 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private, goto out_unlock_both; } + if (folio_order(src) > 1 && + !data_race(list_empty(&src->_deferred_list))) { + src_deferred_split = true; + src_partially_mapped = folio_test_partially_mapped(src); + } + rc = move_to_new_folio(dst, src, mode); if (rc) goto out; /* + * Requeue the destination folio on the deferred split queue if + * the source was on the queue. The source is unqueued in + * __folio_migrate_mapping(), so we recorded the state from + * before move_to_new_folio(). + */ + if (src_deferred_split) + deferred_split_folio(dst, src_partially_mapped); + + /* * When successful, push dst to LRU immediately: so that if it * turns out to be an mlocked page, remove_migration_ptes() will * automatically build up the correct dst->mlock_count for it. @@ -2205,8 +2222,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private) return __folio_alloc(gfp_mask, order, nid, mtc->nmask); } -#ifdef CONFIG_NUMA - +#ifdef CONFIG_NUMA_MIGRATION static int store_status(int __user *status, int start, int value, int nr) { while (nr-- > 0) { @@ -2605,6 +2621,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, { return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags); } +#endif /* CONFIG_NUMA_MIGRATION */ #ifdef CONFIG_NUMA_BALANCING /* @@ -2747,4 +2764,3 @@ int migrate_misplaced_folio(struct folio *folio, int node) return nr_remaining ? -EAGAIN : 0; } #endif /* CONFIG_NUMA_BALANCING */ -#endif /* CONFIG_NUMA */ diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 8079676c8f1f..2912eba575d5 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -914,6 +914,10 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate, unsigned long flags; int ret = 0; + /* + * take a reference, since split_huge_pmd_address() with freeze = true + * drops a reference at the end. + */ folio_get(folio); split_huge_pmd_address(migrate->vma, addr, true); ret = folio_split_unmapped(folio, 0); diff --git a/mm/mlock.c b/mm/mlock.c index 2f699c3497a5..fdbd1434a35f 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -13,7 +13,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/pagemap.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/pagewalk.h> #include <linux/mempolicy.h> #include <linux/syscalls.h> @@ -415,13 +415,14 @@ out: * @vma - vma containing range to be mlock()ed or munlock()ed * @start - start address in @vma of the range * @end - end of range in @vma - * @newflags - the new set of flags for @vma. + * @new_vma_flags - the new set of flags for @vma. * * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED; * called for munlock() and munlockall(), to clear VM_LOCKED from @vma. */ static void mlock_vma_pages_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, vm_flags_t newflags) + unsigned long start, unsigned long end, + vma_flags_t *new_vma_flags) { static const struct mm_walk_ops mlock_walk_ops = { .pmd_entry = mlock_pte_range, @@ -439,18 +440,18 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma, * combination should not be visible to other mmap_lock users; * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED. */ - if (newflags & VM_LOCKED) - newflags |= VM_IO; + if (vma_flags_test(new_vma_flags, VMA_LOCKED_BIT)) + vma_flags_set(new_vma_flags, VMA_IO_BIT); vma_start_write(vma); - vm_flags_reset_once(vma, newflags); + vma_flags_reset_once(vma, new_vma_flags); lru_add_drain(); walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); lru_add_drain(); - if (newflags & VM_IO) { - newflags &= ~VM_IO; - vm_flags_reset_once(vma, newflags); + if (vma_flags_test(new_vma_flags, VMA_IO_BIT)) { + vma_flags_clear(new_vma_flags, VMA_IO_BIT); + vma_flags_reset_once(vma, new_vma_flags); } } @@ -467,18 +468,22 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, vm_flags_t newflags) { + vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags); + const vma_flags_t old_vma_flags = vma->flags; struct mm_struct *mm = vma->vm_mm; int nr_pages; int ret = 0; - vm_flags_t oldflags = vma->vm_flags; - if (newflags == oldflags || (oldflags & VM_SPECIAL) || - is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || - vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE)) - /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ + if (vma_flags_same_pair(&old_vma_flags, &new_vma_flags) || + vma_is_secretmem(vma) || !vma_supports_mlock(vma)) { + /* + * Don't set VM_LOCKED or VM_LOCKONFAULT and don't count. + * For secretmem, don't allow the memory to be unlocked. + */ goto out; + } - vma = vma_modify_flags(vmi, *prev, vma, start, end, &newflags); + vma = vma_modify_flags(vmi, *prev, vma, start, end, &new_vma_flags); if (IS_ERR(vma)) { ret = PTR_ERR(vma); goto out; @@ -488,9 +493,9 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, * Keep track of amount of locked VM. */ nr_pages = (end - start) >> PAGE_SHIFT; - if (!(newflags & VM_LOCKED)) + if (!vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT)) nr_pages = -nr_pages; - else if (oldflags & VM_LOCKED) + else if (vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) nr_pages = 0; mm->locked_vm += nr_pages; @@ -499,12 +504,13 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, * It's okay if try_to_unmap_one unmaps a page just after we * set VM_LOCKED, populate_vma_page_range will bring it back. */ - if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) { + if (vma_flags_test(&new_vma_flags, VMA_LOCKED_BIT) && + vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT)) { /* No work to do, and mlocking twice would be wrong */ vma_start_write(vma); - vm_flags_reset(vma, newflags); + vma->flags = new_vma_flags; } else { - mlock_vma_pages_range(vma, start, end, newflags); + mlock_vma_pages_range(vma, start, end, &new_vma_flags); } out: *prev = vma; diff --git a/mm/mm_init.c b/mm/mm_init.c index df34797691bd..79f93f2a90cf 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -53,6 +53,17 @@ EXPORT_SYMBOL(mem_map); void *high_memory; EXPORT_SYMBOL(high_memory); +unsigned long zero_page_pfn __ro_after_init; +EXPORT_SYMBOL(zero_page_pfn); + +#ifndef __HAVE_COLOR_ZERO_PAGE +uint8_t empty_zero_page[PAGE_SIZE] __page_aligned_bss; +EXPORT_SYMBOL(empty_zero_page); + +struct page *__zero_page __ro_after_init; +EXPORT_SYMBOL(__zero_page); +#endif /* __HAVE_COLOR_ZERO_PAGE */ + #ifdef CONFIG_DEBUG_MEMORY_INIT int __meminitdata mminit_loglevel; @@ -801,7 +812,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start, static bool __meminit overlap_memmap_init(unsigned long zone, unsigned long *pfn) { - static struct memblock_region *r; + static struct memblock_region *r __meminitdata; if (mirrored_kernelcore && zone == ZONE_MOVABLE) { if (!r || *pfn >= memblock_region_memory_end_pfn(r)) { @@ -1099,7 +1110,7 @@ static void __ref memmap_init_compound(struct page *head, struct page *page = pfn_to_page(pfn); __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); - prep_compound_tail(head, pfn - head_pfn); + prep_compound_tail(page, head, order); set_page_count(page, 0); } prep_compound_head(head, order); @@ -1885,7 +1896,7 @@ static void __init free_area_init(void) pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid, (u64)start_pfn << PAGE_SHIFT, ((u64)end_pfn << PAGE_SHIFT) - 1); - subsection_map_init(start_pfn, end_pfn - start_pfn); + sparse_init_subsection_map(start_pfn, end_pfn - start_pfn); } /* Initialise every node */ @@ -2672,6 +2683,22 @@ static void __init mem_init_print_info(void) ); } +#ifndef __HAVE_COLOR_ZERO_PAGE +/* + * architectures that __HAVE_COLOR_ZERO_PAGE must define this function + */ +void __init __weak arch_setup_zero_pages(void) +{ + __zero_page = virt_to_page(empty_zero_page); +} +#endif + +static void __init init_zero_page_pfn(void) +{ + arch_setup_zero_pages(); + zero_page_pfn = page_to_pfn(ZERO_PAGE(0)); +} + void __init __weak arch_mm_preinit(void) { } @@ -2694,6 +2721,7 @@ void __init mm_core_init_early(void) void __init mm_core_init(void) { arch_mm_preinit(); + init_zero_page_pfn(); /* Initializations relying on SMP setup */ BUILD_BUG_ON(MAX_ZONELISTS > 2); diff --git a/mm/mmap.c b/mm/mmap.c index 843160946aa5..5754d1c36462 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -192,7 +192,8 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) brkvma = vma_prev_limit(&vmi, mm->start_brk); /* Ok, looks good - let it rip. */ - if (do_brk_flags(&vmi, brkvma, oldbrk, newbrk - oldbrk, 0) < 0) + if (do_brk_flags(&vmi, brkvma, oldbrk, newbrk - oldbrk, + EMPTY_VMA_FLAGS) < 0) goto out; mm->brk = brk; @@ -375,7 +376,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW; /* Too many mappings? */ - if (mm->map_count > sysctl_max_map_count) + if (mm->map_count > get_sysctl_max_map_count()) return -ENOMEM; /* @@ -1201,8 +1202,10 @@ out: return ret; } -int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags) +int vm_brk_flags(unsigned long addr, unsigned long request, bool is_exec) { + const vma_flags_t vma_flags = is_exec ? + mk_vma_flags(VMA_EXEC_BIT) : EMPTY_VMA_FLAGS; struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; unsigned long len; @@ -1217,10 +1220,6 @@ int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags) if (!len) return 0; - /* Until we need other flags, refuse anything except VM_EXEC. */ - if ((vm_flags & (~VM_EXEC)) != 0) - return -EINVAL; - if (mmap_write_lock_killable(mm)) return -EINTR; @@ -1233,7 +1232,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t vm_flags) goto munmap_failed; vma = vma_prev(&vmi); - ret = do_brk_flags(&vmi, vma, addr, len, vm_flags); + ret = do_brk_flags(&vmi, vma, addr, len, vma_flags); populate = ((mm->def_flags & VM_LOCKED) != 0); mmap_write_unlock(mm); userfaultfd_unmap_complete(mm, &uf); @@ -1246,7 +1245,6 @@ limits_failed: mmap_write_unlock(mm); return ret; } -EXPORT_SYMBOL(vm_brk_flags); static unsigned long tear_down_vmas(struct mm_struct *mm, struct vma_iterator *vmi, @@ -1332,12 +1330,13 @@ destroy: * Return true if the calling process may expand its vm space by the passed * number of pages */ -bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages) +bool may_expand_vm(struct mm_struct *mm, const vma_flags_t *vma_flags, + unsigned long npages) { if (mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT) return false; - if (is_data_mapping(flags) && + if (is_data_mapping_vma_flags(vma_flags) && mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT) { /* Workaround for Valgrind */ if (rlimit(RLIMIT_DATA) == 0 && diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index fe5b6a031717..3985d856de7f 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -296,6 +296,25 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch) call_rcu(&batch->rcu, tlb_remove_table_rcu); } +/** + * tlb_remove_table_sync_rcu - synchronize with software page-table walkers + * + * Like tlb_remove_table_sync_one() but uses RCU grace period instead of IPI + * broadcast. Use in slow paths where sleeping is acceptable. + * + * Software/Lockless page-table walkers use local_irq_disable(), which is also + * an RCU read-side critical section. synchronize_rcu() waits for all such + * sections, providing the same guarantee as tlb_remove_table_sync_one() but + * without disrupting all CPUs with IPIs. + * + * Do not use for freeing memory. Use RCU callbacks instead to avoid latency + * spikes. + */ +void tlb_remove_table_sync_rcu(void) +{ + synchronize_rcu(); +} + #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */ static void tlb_remove_table_free(struct mmu_table_batch *batch) @@ -339,7 +358,7 @@ static inline void __tlb_remove_table_one(void *table) #else static inline void __tlb_remove_table_one(void *table) { - tlb_remove_table_sync_one(); + tlb_remove_table_sync_rcu(); __tlb_remove_table(table); } #endif /* CONFIG_PT_RECLAIM */ diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 4d8a64ce8eda..245b74f39f91 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -335,7 +335,7 @@ static void mn_hlist_release(struct mmu_notifier_subscriptions *subscriptions, * ->release returns. */ id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, + hlist_for_each_entry_srcu(subscription, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) /* * If ->release runs before mmu_notifier_unregister it must be @@ -390,15 +390,15 @@ void __mmu_notifier_release(struct mm_struct *mm) * unmap the address and return 1 or 0 depending if the mapping previously * existed or not. */ -int __mmu_notifier_clear_flush_young(struct mm_struct *mm, - unsigned long start, - unsigned long end) +bool __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long start, unsigned long end) { struct mmu_notifier *subscription; - int young = 0, id; + bool young = false; + int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_srcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { if (subscription->ops->clear_flush_young) @@ -410,15 +410,15 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm, return young; } -int __mmu_notifier_clear_young(struct mm_struct *mm, - unsigned long start, - unsigned long end) +bool __mmu_notifier_clear_young(struct mm_struct *mm, + unsigned long start, unsigned long end) { struct mmu_notifier *subscription; - int young = 0, id; + bool young = false; + int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_srcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { if (subscription->ops->clear_young) @@ -430,14 +430,15 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, return young; } -int __mmu_notifier_test_young(struct mm_struct *mm, - unsigned long address) +bool __mmu_notifier_test_young(struct mm_struct *mm, + unsigned long address) { struct mmu_notifier *subscription; - int young = 0, id; + bool young = false; + int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_srcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { if (subscription->ops->test_young) { @@ -512,7 +513,7 @@ static int mn_hlist_invalidate_range_start( int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, + hlist_for_each_entry_srcu(subscription, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { const struct mmu_notifier_ops *ops = subscription->ops; @@ -550,7 +551,7 @@ static int mn_hlist_invalidate_range_start( * notifiers and one or more failed start, any that succeeded * start are expecting their end to be called. Do so now. */ - hlist_for_each_entry_rcu(subscription, &subscriptions->list, + hlist_for_each_entry_srcu(subscription, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { if (!subscription->ops->invalidate_range_end) continue; @@ -588,7 +589,7 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions, int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist, + hlist_for_each_entry_srcu(subscription, &subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { if (subscription->ops->invalidate_range_end) { if (!mmu_notifier_range_blockable(range)) @@ -623,7 +624,7 @@ void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm, int id; id = srcu_read_lock(&srcu); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_srcu(subscription, &mm->notifier_subscriptions->list, hlist, srcu_read_lock_held(&srcu)) { if (subscription->ops->arch_invalidate_secondary_tlbs) @@ -759,7 +760,7 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops) struct mmu_notifier *subscription; spin_lock(&mm->notifier_subscriptions->lock); - hlist_for_each_entry_rcu(subscription, + hlist_for_each_entry_srcu(subscription, &mm->notifier_subscriptions->list, hlist, lockdep_is_held(&mm->notifier_subscriptions->lock)) { if (subscription->ops != ops) diff --git a/mm/mprotect.c b/mm/mprotect.c index c0571445bef7..110d47a36d4b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -697,7 +697,8 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, unsigned long start, unsigned long end, vm_flags_t newflags) { struct mm_struct *mm = vma->vm_mm; - vm_flags_t oldflags = READ_ONCE(vma->vm_flags); + const vma_flags_t old_vma_flags = READ_ONCE(vma->flags); + vma_flags_t new_vma_flags = legacy_to_vma_flags(newflags); long nrpages = (end - start) >> PAGE_SHIFT; unsigned int mm_cp_flags = 0; unsigned long charged = 0; @@ -706,7 +707,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, if (vma_is_sealed(vma)) return -EPERM; - if (newflags == oldflags) { + if (vma_flags_same_pair(&old_vma_flags, &new_vma_flags)) { *pprev = vma; return 0; } @@ -717,8 +718,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, * uncommon case, so doesn't need to be very optimized. */ if (arch_has_pfn_modify_check() && - (oldflags & (VM_PFNMAP|VM_MIXEDMAP)) && - (newflags & VM_ACCESS_FLAGS) == 0) { + vma_flags_test_any(&old_vma_flags, VMA_PFNMAP_BIT, + VMA_MIXEDMAP_BIT) && + !vma_flags_test_any_mask(&new_vma_flags, VMA_ACCESS_FLAGS)) { pgprot_t new_pgprot = vm_get_page_prot(newflags); error = walk_page_range(current->mm, start, end, @@ -736,24 +738,25 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, * hugetlb mapping were accounted for even if read-only so there is * no need to account for them here. */ - if (newflags & VM_WRITE) { + if (vma_flags_test(&new_vma_flags, VMA_WRITE_BIT)) { /* Check space limits when area turns into data. */ - if (!may_expand_vm(mm, newflags, nrpages) && - may_expand_vm(mm, oldflags, nrpages)) + if (!may_expand_vm(mm, &new_vma_flags, nrpages) && + may_expand_vm(mm, &old_vma_flags, nrpages)) return -ENOMEM; - if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| - VM_SHARED|VM_NORESERVE))) { + if (!vma_flags_test_any(&old_vma_flags, + VMA_ACCOUNT_BIT, VMA_WRITE_BIT, VMA_HUGETLB_BIT, + VMA_SHARED_BIT, VMA_NORESERVE_BIT)) { charged = nrpages; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; - newflags |= VM_ACCOUNT; + vma_flags_set(&new_vma_flags, VMA_ACCOUNT_BIT); } - } else if ((oldflags & VM_ACCOUNT) && vma_is_anonymous(vma) && - !vma->anon_vma) { - newflags &= ~VM_ACCOUNT; + } else if (vma_flags_test(&old_vma_flags, VMA_ACCOUNT_BIT) && + vma_is_anonymous(vma) && !vma->anon_vma) { + vma_flags_clear(&new_vma_flags, VMA_ACCOUNT_BIT); } - vma = vma_modify_flags(vmi, *pprev, vma, start, end, &newflags); + vma = vma_modify_flags(vmi, *pprev, vma, start, end, &new_vma_flags); if (IS_ERR(vma)) { error = PTR_ERR(vma); goto fail; @@ -766,26 +769,28 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, * held in write mode. */ vma_start_write(vma); - vm_flags_reset_once(vma, newflags); + vma_flags_reset_once(vma, &new_vma_flags); if (vma_wants_manual_pte_write_upgrade(vma)) mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE; vma_set_page_prot(vma); change_protection(tlb, vma, start, end, mm_cp_flags); - if ((oldflags & VM_ACCOUNT) && !(newflags & VM_ACCOUNT)) + if (vma_flags_test(&old_vma_flags, VMA_ACCOUNT_BIT) && + !vma_flags_test(&new_vma_flags, VMA_ACCOUNT_BIT)) vm_unacct_memory(nrpages); /* * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major * fault on access. */ - if ((oldflags & (VM_WRITE | VM_SHARED | VM_LOCKED)) == VM_LOCKED && - (newflags & VM_WRITE)) { + if (vma_flags_test(&new_vma_flags, VMA_WRITE_BIT) && + vma_flags_test(&old_vma_flags, VMA_LOCKED_BIT) && + !vma_flags_test_any(&old_vma_flags, VMA_WRITE_BIT, VMA_SHARED_BIT)) populate_vma_page_range(vma, start, end, NULL); - } - vm_stat_account(mm, oldflags, -nrpages); + vm_stat_account(mm, vma_flags_to_legacy(old_vma_flags), -nrpages); + newflags = vma_flags_to_legacy(new_vma_flags); vm_stat_account(mm, newflags, nrpages); perf_event_mmap(vma); return 0; @@ -873,6 +878,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, tmp = vma->vm_start; for_each_vma_range(vmi, vma, end) { vm_flags_t mask_off_old_flags; + vma_flags_t new_vma_flags; vm_flags_t newflags; int new_vma_pkey; @@ -895,6 +901,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey); newflags = calc_vm_prot_bits(prot, new_vma_pkey); newflags |= (vma->vm_flags & ~mask_off_old_flags); + new_vma_flags = legacy_to_vma_flags(newflags); /* newflags >> 4 shift VM_MAY% in place of VM_% */ if ((newflags & ~(newflags >> 4)) & VM_ACCESS_FLAGS) { @@ -902,7 +909,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, break; } - if (map_deny_write_exec(vma->vm_flags, newflags)) { + if (map_deny_write_exec(&vma->flags, &new_vma_flags)) { error = -EACCES; break; } @@ -978,7 +985,7 @@ SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val) if (pkey == -1) goto out; - ret = arch_set_user_pkey_access(current, pkey, init_val); + ret = arch_set_user_pkey_access(pkey, init_val); if (ret) { mm_pkey_free(current->mm, pkey); goto out; diff --git a/mm/mremap.c b/mm/mremap.c index 2be876a70cc0..e9c8b1d05832 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -244,7 +244,7 @@ static int move_ptes(struct pagetable_move_control *pmc, goto out; } /* - * Now new_pte is none, so hpage_collapse_scan_file() path can not find + * Now new_pte is none, so collapse_scan_file() path can not find * this by traversing file->f_mapping, so there is no concurrency with * retract_page_tables(). In addition, we already hold the exclusive * mmap_lock, so this new_pte page is stable, so there is no need to get @@ -1028,6 +1028,75 @@ static void vrm_stat_account(struct vma_remap_struct *vrm, mm->locked_vm += pages; } +static bool __check_map_count_against_split(struct mm_struct *mm, + bool before_unmaps) +{ + const int sys_map_count = get_sysctl_max_map_count(); + int map_count = mm->map_count; + + mmap_assert_write_locked(mm); + + /* + * At the point of shrinking the VMA, if new_len < old_len, we unmap + * thusly in the worst case: + * + * old_addr+old_len old_addr+old_len + * |---------------.----.---------| |---------------| |---------| + * | . . | -> | +1 | -1 | +1 | + * |---------------.----.---------| |---------------| |---------| + * old_addr+new_len old_addr+new_len + * + * At the point of removing the portion of an existing VMA to make space + * for the moved VMA if MREMAP_FIXED, we unmap thusly in the worst case: + * + * new_addr new_addr+new_len new_addr new_addr+new_len + * |----.---------------.---------| |----| |---------| + * | . . | -> | +1 | -1 | +1 | + * |----.---------------.---------| |----| |---------| + * + * Therefore, before we consider the move anything, we have to account + * for 2 additional VMAs possibly being created upon these unmappings. + */ + if (before_unmaps) + map_count += 2; + + /* + * At the point of MOVING the VMA: + * + * We start by copying a VMA, which creates an additional VMA if no + * merge occurs, then if not MREMAP_DONTUNMAP, we unmap the source VMA. + * In the worst case we might then observe: + * + * new_addr new_addr+new_len new_addr new_addr+new_len + * |----| |---------| |----|---------------|---------| + * | | | | -> | | +1 | | + * |----| |---------| |----|---------------|---------| + * + * old_addr old_addr+old_len old_addr old_addr+old_len + * |----.---------------.---------| |----| |---------| + * | . . | -> | +1 | -1 | +1 | + * |----.---------------.---------| |----| |---------| + * + * Therefore we must check to ensure we have headroom of 2 additional + * VMAs. + */ + return map_count + 2 <= sys_map_count; +} + +/* Do we violate the map count limit if we split VMAs when moving the VMA? */ +static bool check_map_count_against_split(void) +{ + return __check_map_count_against_split(current->mm, + /*before_unmaps=*/false); +} + +/* Do we violate the map count limit if we split VMAs prior to early unmaps? */ +static bool check_map_count_against_split_early(void) +{ + return __check_map_count_against_split(current->mm, + /*before_unmaps=*/true); +} + /* * Perform checks before attempting to write a VMA prior to it being * moved. @@ -1041,10 +1110,11 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) vm_flags_t dummy = vma->vm_flags; /* - * We'd prefer to avoid failure later on in do_munmap: - * which may split one vma into three before unmapping. + * We'd prefer to avoid failure later on in do_munmap: we copy a VMA, + * which may not merge, then (if MREMAP_DONTUNMAP is not set) unmap the + * source, which may split, causing a net increase of 2 mappings. */ - if (current->mm->map_count >= sysctl_max_map_count - 3) + if (!check_map_count_against_split()) return -ENOMEM; if (vma->vm_ops && vma->vm_ops->may_split) { @@ -1402,10 +1472,10 @@ static unsigned long mremap_to(struct vma_remap_struct *vrm) /* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */ if (vrm->flags & MREMAP_DONTUNMAP) { - vm_flags_t vm_flags = vrm->vma->vm_flags; + vma_flags_t vma_flags = vrm->vma->flags; unsigned long pages = vrm->old_len >> PAGE_SHIFT; - if (!may_expand_vm(mm, vm_flags, pages)) + if (!may_expand_vm(mm, &vma_flags, pages)) return -ENOMEM; } @@ -1743,7 +1813,7 @@ static int check_prep_vma(struct vma_remap_struct *vrm) if (!mlock_future_ok(mm, vma->vm_flags & VM_LOCKED, vrm->delta)) return -EAGAIN; - if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT)) + if (!may_expand_vm(mm, &vma->flags, vrm->delta >> PAGE_SHIFT)) return -ENOMEM; return 0; @@ -1803,23 +1873,6 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) if (vrm_overlaps(vrm)) return -EINVAL; - /* - * move_vma() need us to stay 4 maps below the threshold, otherwise - * it will bail out at the very beginning. - * That is a problem if we have already unmapped the regions here - * (new_addr, and old_addr), because userspace will not know the - * state of the vma's after it gets -ENOMEM. - * So, to avoid such scenario we can pre-compute if the whole - * operation has high chances to success map-wise. - * Worst-scenario case is when both vma's (new_addr and old_addr) get - * split in 3 before unmapping it. - * That means 2 more maps (1 for each) to the ones we already hold. - * Check whether current map count plus 2 still leads us to 4 maps below - * the threshold, otherwise return -ENOMEM here to be more safe. - */ - if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3) - return -ENOMEM; - return 0; } @@ -1929,6 +1982,11 @@ static unsigned long do_mremap(struct vma_remap_struct *vrm) return -EINTR; vrm->mmap_locked = true; + if (!check_map_count_against_split_early()) { + mmap_write_unlock(mm); + return -ENOMEM; + } + if (vrm_move_only(vrm)) { res = remap_move(vrm); } else { diff --git a/mm/mseal.c b/mm/mseal.c index ac58643181f7..e2093ae3d25c 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -68,14 +68,17 @@ static int mseal_apply(struct mm_struct *mm, const unsigned long curr_start = MAX(vma->vm_start, start); const unsigned long curr_end = MIN(vma->vm_end, end); - if (!(vma->vm_flags & VM_SEALED)) { - vm_flags_t vm_flags = vma->vm_flags | VM_SEALED; + if (!vma_test(vma, VMA_SEALED_BIT)) { + vma_flags_t vma_flags = vma->flags; + + vma_flags_set(&vma_flags, VMA_SEALED_BIT); vma = vma_modify_flags(&vmi, prev, vma, curr_start, - curr_end, &vm_flags); + curr_end, &vma_flags); if (IS_ERR(vma)) return PTR_ERR(vma); - vm_flags_set(vma, VM_SEALED); + vma_start_write(vma); + vma_set_flags(vma, VMA_SEALED_BIT); } prev = vma; diff --git a/mm/nommu.c b/mm/nommu.c index c3a23b082adb..ed3934bc2de4 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1317,7 +1317,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return -ENOMEM; mm = vma->vm_mm; - if (mm->map_count >= sysctl_max_map_count) + if (mm->map_count >= get_sysctl_max_map_count()) return -ENOMEM; region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5c6c95c169ee..5f372f6e26fa 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -135,19 +135,16 @@ struct task_struct *find_lock_task_mm(struct task_struct *p) { struct task_struct *t; - rcu_read_lock(); + guard(rcu)(); for_each_thread(p, t) { task_lock(t); if (likely(t->mm)) - goto found; + return t; task_unlock(t); } - t = NULL; -found: - rcu_read_unlock(); - return t; + return NULL; } /* @@ -548,21 +545,8 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) * count elevated without a good reason. */ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - struct mmu_notifier_range range; - struct mmu_gather tlb; - - mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, - mm, vma->vm_start, - vma->vm_end); - tlb_gather_mmu(&tlb, mm); - if (mmu_notifier_invalidate_range_start_nonblock(&range)) { - tlb_finish_mmu(&tlb); + if (zap_vma_for_reaping(vma)) ret = false; - continue; - } - unmap_page_range(&tlb, vma, range.start, range.end, NULL); - mmu_notifier_invalidate_range_end(&range); - tlb_finish_mmu(&tlb); } } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c1a4b32af1a7..88cd53d4ba09 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -33,7 +33,7 @@ #include <linux/sysctl.h> #include <linux/cpu.h> #include <linux/syscalls.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/timer.h> #include <linux/sched/rt.h> #include <linux/sched/signal.h> @@ -2666,7 +2666,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb) * while this function is in progress, although it may have been truncated * before this function is called. Most callers have the folio locked. * A few have the folio blocked from truncation through other means (e.g. - * zap_vma_pages() has it mapped and is holding the page table lock). + * zap_vma() has it mapped and is holding the page table lock). * When called from mark_buffer_dirty(), the filesystem should hold a * reference to the buffer_head that is being marked dirty, which causes * try_to_free_buffers() to fail. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2d4b6f1a554e..111b54df8a3c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -31,7 +31,7 @@ #include <linux/sysctl.h> #include <linux/cpu.h> #include <linux/cpuset.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/memory_hotplug.h> #include <linux/nodemask.h> #include <linux/vmstat.h> @@ -94,23 +94,6 @@ typedef int __bitwise fpi_t; static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) -#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) -/* - * On SMP, spin_trylock is sufficient protection. - * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP. - * Pass flags to a no-op inline function to typecheck and silence the unused - * variable warning. - */ -static inline void __pcp_trylock_noop(unsigned long *flags) { } -#define pcp_trylock_prepare(flags) __pcp_trylock_noop(&(flags)) -#define pcp_trylock_finish(flags) __pcp_trylock_noop(&(flags)) -#else - -/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */ -#define pcp_trylock_prepare(flags) local_irq_save(flags) -#define pcp_trylock_finish(flags) local_irq_restore(flags) -#endif - /* * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid * a migration causing the wrong PCP to be locked and remote memory being @@ -128,71 +111,52 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { } #endif /* - * Generic helper to lookup and a per-cpu variable with an embedded spinlock. - * Return value should be used with equivalent unlock helper. + * A helper to lookup and trylock pcp with embedded spinlock. + * The return value should be used with the unlock helper. + * NULL return value means the trylock failed. */ -#define pcpu_spin_trylock(type, member, ptr) \ +#ifdef CONFIG_SMP +#define pcp_spin_trylock(ptr) \ ({ \ - type *_ret; \ + struct per_cpu_pages *_ret; \ pcpu_task_pin(); \ _ret = this_cpu_ptr(ptr); \ - if (!spin_trylock(&_ret->member)) { \ + if (!spin_trylock(&_ret->lock)) { \ pcpu_task_unpin(); \ _ret = NULL; \ } \ _ret; \ }) -#define pcpu_spin_unlock(member, ptr) \ +#define pcp_spin_unlock(ptr) \ ({ \ - spin_unlock(&ptr->member); \ + spin_unlock(&ptr->lock); \ pcpu_task_unpin(); \ }) -/* struct per_cpu_pages specific helpers. */ -#define pcp_spin_trylock(ptr, UP_flags) \ -({ \ - struct per_cpu_pages *__ret; \ - pcp_trylock_prepare(UP_flags); \ - __ret = pcpu_spin_trylock(struct per_cpu_pages, lock, ptr); \ - if (!__ret) \ - pcp_trylock_finish(UP_flags); \ - __ret; \ -}) - -#define pcp_spin_unlock(ptr, UP_flags) \ -({ \ - pcpu_spin_unlock(lock, ptr); \ - pcp_trylock_finish(UP_flags); \ -}) - /* - * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e. - * a potentially remote cpu drain) and get interrupted by an operation that - * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP - * spinlock assumptions making the trylock a no-op. So we have to turn that - * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no - * remote cpu's so we can only be locking the only existing local one. + * On CONFIG_SMP=n the UP implementation of spin_trylock() never fails and thus + * is not compatible with our locking scheme. However we do not need pcp for + * scalability in the first place, so just make all the trylocks fail and take + * the slow path unconditionally. */ -#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) -static inline void __flags_noop(unsigned long *flags) { } -#define pcp_spin_lock_maybe_irqsave(ptr, flags) \ -({ \ - __flags_noop(&(flags)); \ - spin_lock(&(ptr)->lock); \ -}) -#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \ -({ \ - spin_unlock(&(ptr)->lock); \ - __flags_noop(&(flags)); \ -}) #else -#define pcp_spin_lock_maybe_irqsave(ptr, flags) \ - spin_lock_irqsave(&(ptr)->lock, flags) -#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \ - spin_unlock_irqrestore(&(ptr)->lock, flags) +#define pcp_spin_trylock(ptr) \ + NULL + +#define pcp_spin_unlock(ptr) \ + BUG_ON(1) #endif +/* + * In some cases we do not need to pin the task to the CPU because we are + * already given a specific cpu's pcp pointer. + */ +#define pcp_spin_lock_nopin(ptr) \ + spin_lock(&(ptr)->lock) +#define pcp_spin_unlock_nopin(ptr) \ + spin_unlock(&(ptr)->lock) + #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); @@ -243,6 +207,8 @@ unsigned int pageblock_order __read_mostly; static void __free_pages_ok(struct page *page, unsigned int order, fpi_t fpi_flags); +static void reserve_highatomic_pageblock(struct page *page, int order, + struct zone *zone); /* * results with 256, 32 in the lowmem_reserve sysctl: @@ -687,7 +653,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order) #ifdef CONFIG_TRANSPARENT_HUGEPAGE bool movable; if (order > PAGE_ALLOC_COSTLY_ORDER) { - VM_BUG_ON(order != HPAGE_PMD_ORDER); + VM_BUG_ON(!is_pmd_order(order)); movable = migratetype == MIGRATE_MOVABLE; @@ -719,7 +685,7 @@ static inline bool pcp_allowed_order(unsigned int order) if (order <= PAGE_ALLOC_COSTLY_ORDER) return true; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (order == HPAGE_PMD_ORDER) + if (is_pmd_order(order)) return true; #endif return false; @@ -731,7 +697,7 @@ static inline bool pcp_allowed_order(unsigned int order) * The first PAGE_SIZE page is called the "head page" and have PG_head set. * * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded - * in bit 0 of page->compound_head. The rest of bits is pointer to head page. + * in bit 0 of page->compound_info. The rest of bits is pointer to head page. * * The first tail page's ->compound_order holds the order of allocation. * This usage means that zero-order pages may not be compound. @@ -744,7 +710,7 @@ void prep_compound_page(struct page *page, unsigned int order) __SetPageHead(page); for (i = 1; i < nr_pages; i++) - prep_compound_tail(page, i); + prep_compound_tail(page + i, page, order); prep_compound_head(page, order); } @@ -1079,7 +1045,6 @@ static inline bool page_expected_state(struct page *page, #ifdef CONFIG_MEMCG page->memcg_data | #endif - page_pool_page_is_pp(page) | (page->flags.f & check_flags))) return false; @@ -1106,8 +1071,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) if (unlikely(page->memcg_data)) bad_reason = "page still charged to cgroup"; #endif - if (unlikely(page_pool_page_is_pp(page))) - bad_reason = "page_pool leak"; return bad_reason; } @@ -1416,9 +1379,17 @@ __always_inline bool __free_pages_prepare(struct page *page, mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); folio->mapping = NULL; } - if (unlikely(page_has_type(page))) + if (unlikely(page_has_type(page))) { + /* networking expects to clear its page type before releasing */ + if (is_check_pages_enabled()) { + if (unlikely(PageNetpp(page))) { + bad_page(page, "page_pool leak"); + return false; + } + } /* Reset the page_type (which overlays _mapcount) */ page->page_type = UINT_MAX; + } if (is_check_pages_enabled()) { if (free_page_is_bad(page)) @@ -2588,7 +2559,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) { int high_min, to_drain, to_drain_batched, batch; - unsigned long UP_flags; bool todo = false; high_min = READ_ONCE(pcp->high_min); @@ -2608,9 +2578,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) to_drain = pcp->count - pcp->high; while (to_drain > 0) { to_drain_batched = min(to_drain, batch); - pcp_spin_lock_maybe_irqsave(pcp, UP_flags); + pcp_spin_lock_nopin(pcp); free_pcppages_bulk(zone, to_drain_batched, pcp, 0); - pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); + pcp_spin_unlock_nopin(pcp); todo = true; to_drain -= to_drain_batched; @@ -2627,15 +2597,14 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) */ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { - unsigned long UP_flags; int to_drain, batch; batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) { - pcp_spin_lock_maybe_irqsave(pcp, UP_flags); + pcp_spin_lock_nopin(pcp); free_pcppages_bulk(zone, to_drain, pcp, 0); - pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); + pcp_spin_unlock_nopin(pcp); } } #endif @@ -2646,11 +2615,10 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) static void drain_pages_zone(unsigned int cpu, struct zone *zone) { struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); - unsigned long UP_flags; int count; do { - pcp_spin_lock_maybe_irqsave(pcp, UP_flags); + pcp_spin_lock_nopin(pcp); count = pcp->count; if (count) { int to_drain = min(count, @@ -2659,7 +2627,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) free_pcppages_bulk(zone, to_drain, pcp, 0); count -= to_drain; } - pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); + pcp_spin_unlock_nopin(pcp); } while (count); } @@ -2858,7 +2826,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, */ static bool free_frozen_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, - unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags) + unsigned int order, fpi_t fpi_flags) { int high, batch; int to_free, to_free_batched; @@ -2918,9 +2886,9 @@ static bool free_frozen_page_commit(struct zone *zone, if (to_free == 0 || pcp->count == 0) break; - pcp_spin_unlock(pcp, *UP_flags); + pcp_spin_unlock(pcp); - pcp = pcp_spin_trylock(zone->per_cpu_pageset, *UP_flags); + pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (!pcp) { ret = false; break; @@ -2932,7 +2900,7 @@ static bool free_frozen_page_commit(struct zone *zone, * returned in an unlocked state. */ if (smp_processor_id() != cpu) { - pcp_spin_unlock(pcp, *UP_flags); + pcp_spin_unlock(pcp); ret = false; break; } @@ -2964,7 +2932,6 @@ static bool free_frozen_page_commit(struct zone *zone, static void __free_frozen_pages(struct page *page, unsigned int order, fpi_t fpi_flags) { - unsigned long UP_flags; struct per_cpu_pages *pcp; struct zone *zone; unsigned long pfn = page_to_pfn(page); @@ -3000,12 +2967,12 @@ static void __free_frozen_pages(struct page *page, unsigned int order, add_page_to_zone_llist(zone, page, order); return; } - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); + pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (pcp) { if (!free_frozen_page_commit(zone, pcp, page, migratetype, - order, fpi_flags, &UP_flags)) + order, fpi_flags)) return; - pcp_spin_unlock(pcp, UP_flags); + pcp_spin_unlock(pcp); } else { free_one_page(zone, page, pfn, order, fpi_flags); } @@ -3026,7 +2993,6 @@ void free_frozen_pages_nolock(struct page *page, unsigned int order) */ void free_unref_folios(struct folio_batch *folios) { - unsigned long UP_flags; struct per_cpu_pages *pcp = NULL; struct zone *locked_zone = NULL; int i, j; @@ -3069,7 +3035,7 @@ void free_unref_folios(struct folio_batch *folios) if (zone != locked_zone || is_migrate_isolate(migratetype)) { if (pcp) { - pcp_spin_unlock(pcp, UP_flags); + pcp_spin_unlock(pcp); locked_zone = NULL; pcp = NULL; } @@ -3088,7 +3054,7 @@ void free_unref_folios(struct folio_batch *folios) * trylock is necessary as folios may be getting freed * from IRQ or SoftIRQ context after an IO completion. */ - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); + pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (unlikely(!pcp)) { free_one_page(zone, &folio->page, pfn, order, FPI_NONE); @@ -3106,14 +3072,14 @@ void free_unref_folios(struct folio_batch *folios) trace_mm_page_free_batched(&folio->page); if (!free_frozen_page_commit(zone, pcp, &folio->page, - migratetype, order, FPI_NONE, &UP_flags)) { + migratetype, order, FPI_NONE)) { pcp = NULL; locked_zone = NULL; } } if (pcp) - pcp_spin_unlock(pcp, UP_flags); + pcp_spin_unlock(pcp); folio_batch_reinit(folios); } @@ -3275,6 +3241,13 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order)); + /* + * If this is a high-order atomic allocation then check + * if the pageblock should be reserved for the future + */ + if (unlikely(alloc_flags & ALLOC_HIGHATOMIC)) + reserve_highatomic_pageblock(page, order, zone); + __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); zone_statistics(preferred_zone, zone, 1); @@ -3346,6 +3319,20 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, int batch = nr_pcp_alloc(pcp, zone, order); int alloced; + /* + * Don't refill the list for a higher order atomic + * allocation under memory pressure, as this would + * not build up any HIGHATOMIC reserves, which + * might be needed soon. + * + * Instead, direct it towards the reserves by + * returning NULL, which will make the caller fall + * back to rmqueue_buddy. This will try to use the + * reserves first and grow them if needed. + */ + if (alloc_flags & ALLOC_HIGHATOMIC) + return NULL; + alloced = rmqueue_bulk(zone, order, batch, list, migratetype, alloc_flags); @@ -3371,10 +3358,9 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, struct per_cpu_pages *pcp; struct list_head *list; struct page *page; - unsigned long UP_flags; /* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */ - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); + pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (!pcp) return NULL; @@ -3386,7 +3372,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, pcp->free_count >>= 1; list = &pcp->lists[order_to_pindex(migratetype, order)]; page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); - pcp_spin_unlock(pcp, UP_flags); + pcp_spin_unlock(pcp); if (page) { __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); zone_statistics(preferred_zone, zone, 1); @@ -3961,13 +3947,6 @@ try_this_zone: if (page) { prep_new_page(page, order, gfp_mask, alloc_flags); - /* - * If this is a high-order atomic allocation then check - * if the pageblock should be reserved for the future - */ - if (unlikely(alloc_flags & ALLOC_HIGHATOMIC)) - reserve_highatomic_pageblock(page, order, zone); - return page; } else { if (cond_accept_memory(zone, order, alloc_flags)) @@ -5067,7 +5046,6 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, struct page **page_array) { struct page *page; - unsigned long UP_flags; struct zone *zone; struct zoneref *z; struct per_cpu_pages *pcp; @@ -5136,7 +5114,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, cond_accept_memory(zone, 0, alloc_flags); retry_this_zone: - mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages; + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages - nr_populated; if (zone_watermark_fast(zone, 0, mark, zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp)) { @@ -5161,7 +5139,7 @@ retry_this_zone: goto failed; /* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */ - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); + pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (!pcp) goto failed; @@ -5180,7 +5158,7 @@ retry_this_zone: if (unlikely(!page)) { /* Try and allocate at least one page */ if (!nr_account) { - pcp_spin_unlock(pcp, UP_flags); + pcp_spin_unlock(pcp); goto failed; } break; @@ -5192,7 +5170,7 @@ retry_this_zone: page_array[nr_populated++] = page; } - pcp_spin_unlock(pcp, UP_flags); + pcp_spin_unlock(pcp); __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account); @@ -6147,7 +6125,6 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu) { struct per_cpu_pages *pcp; struct cpu_cacheinfo *cci; - unsigned long UP_flags; pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); cci = get_cpu_cacheinfo(cpu); @@ -6158,12 +6135,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu) * This can reduce zone lock contention without hurting * cache-hot pages sharing. */ - pcp_spin_lock_maybe_irqsave(pcp, UP_flags); + pcp_spin_lock_nopin(pcp); if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch) pcp->flags |= PCPF_FREE_HIGH_BATCH; else pcp->flags &= ~PCPF_FREE_HIGH_BATCH; - pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags); + pcp_spin_unlock_nopin(pcp); } void setup_pcp_cacheinfo(unsigned int cpu) @@ -6553,8 +6530,8 @@ void calculate_min_free_kbytes(void) if (new_min_free_kbytes > user_min_free_kbytes) min_free_kbytes = clamp(new_min_free_kbytes, 128, 262144); else - pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n", - new_min_free_kbytes, user_min_free_kbytes); + pr_warn_ratelimited("min_free_kbytes is not updated to %d because user defined value %d is preferred\n", + new_min_free_kbytes, user_min_free_kbytes); } diff --git a/mm/page_idle.c b/mm/page_idle.c index 96bb94c7b6c3..9c67cbac2965 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -74,7 +74,7 @@ static bool page_idle_clear_pte_refs_one(struct folio *folio, pmd_t pmdval = pmdp_get(pvmw.pmd); if (likely(pmd_present(pmdval))) - referenced |= pmdp_clear_young_notify(vma, addr, pvmw.pmd); + referenced |= pmdp_test_and_clear_young(vma, addr, pvmw.pmd); referenced |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE); } else { /* unexpected pmd-mapped page? */ diff --git a/mm/page_io.c b/mm/page_io.c index a2c034660c80..330abc5ab7b4 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -450,14 +450,14 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug) VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); /* - * ->flags can be updated non-atomically (scan_swap_map_slots), + * ->flags can be updated non-atomically, * but that will never affect SWP_FS_OPS, so the data_race * is safe. */ if (data_race(sis->flags & SWP_FS_OPS)) swap_writepage_fs(folio, swap_plug); /* - * ->flags can be updated non-atomically (scan_swap_map_slots), + * ->flags can be updated non-atomically, * but that will never affect SWP_SYNCHRONOUS_IO, so the data_race * is safe. */ diff --git a/mm/page_reporting.c b/mm/page_reporting.c index f0042d5743af..7418f2e500bb 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -12,7 +12,7 @@ #include "internal.h" /* Initialize to an unsupported value */ -unsigned int page_reporting_order = -1; +unsigned int page_reporting_order = PAGE_REPORTING_ORDER_UNSPECIFIED; static int page_order_update_notify(const char *val, const struct kernel_param *kp) { @@ -369,8 +369,9 @@ int page_reporting_register(struct page_reporting_dev_info *prdev) * pageblock_order. */ - if (page_reporting_order == -1) { - if (prdev->order > 0 && prdev->order <= MAX_PAGE_ORDER) + if (page_reporting_order == PAGE_REPORTING_ORDER_UNSPECIFIED) { + if (prdev->order != PAGE_REPORTING_ORDER_UNSPECIFIED && + prdev->order <= MAX_PAGE_ORDER) page_reporting_order = prdev->order; else page_reporting_order = pageblock_order; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index b38a1d00c971..a4d52fdb3056 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -269,11 +269,6 @@ restart: spin_unlock(pvmw->ptl); pvmw->ptl = NULL; } else if (!pmd_present(pmde)) { - /* - * If PVMW_SYNC, take and drop THP pmd lock so that we - * cannot return prematurely, while zap_huge_pmd() has - * cleared *pmd but not decremented compound_mapcount(). - */ const softleaf_t entry = softleaf_from_pmd(pmde); if (softleaf_is_device_private(entry)) { @@ -284,11 +279,9 @@ restart: if ((pvmw->flags & PVMW_SYNC) && thp_vma_suitable_order(vma, pvmw->address, PMD_ORDER) && - (pvmw->nr_pages >= HPAGE_PMD_NR)) { - spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); + (pvmw->nr_pages >= HPAGE_PMD_NR)) + sync_with_folio_pmd_zap(mm, pvmw->pmd); - spin_unlock(ptl); - } step_forward(pvmw, PMD_SIZE); continue; } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 4e7bcd975c54..3ae2586ff45b 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -5,7 +5,6 @@ #include <linux/hugetlb.h> #include <linux/mmu_context.h> #include <linux/swap.h> -#include <linux/leafops.h> #include <asm/tlbflush.h> @@ -860,9 +859,6 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, * VM as documented by vm_normal_page(). If requested, zeropages will be * returned as well. * - * As default, this function only considers present page table entries. - * If requested, it will also consider migration entries. - * * If this function returns NULL it might either indicate "there is nothing" or * "there is nothing suitable". * @@ -873,11 +869,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, * that call. * * @fw->page will correspond to the page that is effectively referenced by - * @addr. However, for migration entries and shared zeropages @fw->page is - * set to NULL. Note that large folios might be mapped by multiple page table - * entries, and this function will always only lookup a single entry as - * specified by @addr, which might or might not cover more than a single page of - * the returned folio. + * @addr. However, for shared zeropages @fw->page is set to NULL. Note that + * large folios might be mapped by multiple page table entries, and this + * function will always only lookup a single entry as specified by @addr, which + * might or might not cover more than a single page of the returned folio. * * This function must *not* be used as a naive replacement for * get_user_pages() / pin_user_pages(), especially not to perform DMA or @@ -904,7 +899,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, folio_walk_flags_t flags) { unsigned long entry_size; - bool expose_page = true; + bool zeropage = false; struct page *page; pud_t *pudp, pud; pmd_t *pmdp, pmd; @@ -952,10 +947,6 @@ struct folio *folio_walk_start(struct folio_walk *fw, if (page) goto found; } - /* - * TODO: FW_MIGRATION support for PUD migration entries - * once there are relevant users. - */ spin_unlock(ptl); goto not_found; } @@ -989,16 +980,9 @@ pmd_table: } else if ((flags & FW_ZEROPAGE) && is_huge_zero_pmd(pmd)) { page = pfn_to_page(pmd_pfn(pmd)); - expose_page = false; + zeropage = true; goto found; } - } else if ((flags & FW_MIGRATION) && - pmd_is_migration_entry(pmd)) { - const softleaf_t entry = softleaf_from_pmd(pmd); - - page = softleaf_to_page(entry); - expose_page = false; - goto found; } spin_unlock(ptl); goto not_found; @@ -1023,15 +1007,7 @@ pte_table: if ((flags & FW_ZEROPAGE) && is_zero_pfn(pte_pfn(pte))) { page = pfn_to_page(pte_pfn(pte)); - expose_page = false; - goto found; - } - } else if (!pte_none(pte)) { - const softleaf_t entry = softleaf_from_pte(pte); - - if ((flags & FW_MIGRATION) && softleaf_is_migration(entry)) { - page = softleaf_to_page(entry); - expose_page = false; + zeropage = true; goto found; } } @@ -1040,7 +1016,7 @@ not_found: vma_pgtable_walk_end(vma); return NULL; found: - if (expose_page) + if (!zeropage) /* Note: Offset from the mapped page, not the folio start. */ fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT); else diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index af7966169d69..b91b1a98029c 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -81,10 +81,11 @@ int ptep_set_access_flags(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH -int ptep_clear_flush_young(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) +bool ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) { - int young; + bool young; + young = ptep_test_and_clear_young(vma, address, ptep); if (young) flush_tlb_page(vma, address); @@ -123,10 +124,11 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH -int pmdp_clear_flush_young(struct vm_area_struct *vma, - unsigned long address, pmd_t *pmdp) +bool pmdp_clear_flush_young(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp) { - int young; + bool young; + VM_BUG_ON(address & ~HPAGE_PMD_MASK); young = pmdp_test_and_clear_young(vma, address, pmdp); if (young) diff --git a/mm/rmap.c b/mm/rmap.c index 8f08090d7eb9..78b7fb5f367c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -965,25 +965,25 @@ static bool folio_referenced_one(struct folio *folio, return false; } - if (lru_gen_enabled() && pvmw.pte) { - if (lru_gen_look_around(&pvmw)) - referenced++; - } else if (pvmw.pte) { - if (folio_test_large(folio)) { - unsigned long end_addr = pmd_addr_end(address, vma->vm_end); - unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT; - pte_t pteval = ptep_get(pvmw.pte); + if (pvmw.pte && folio_test_large(folio)) { + const unsigned long end_addr = pmd_addr_end(address, vma->vm_end); + const unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT; + pte_t pteval = ptep_get(pvmw.pte); - nr = folio_pte_batch(folio, pvmw.pte, - pteval, max_nr); - } + nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr); + } - ptes += nr; + /* + * When LRU is switching, we don’t know where the surrounding folios + * are. —they could be on active/inactive lists or on MGLRU. So the + * simplest approach is to disable this look-around optimization. + */ + if (lru_gen_enabled() && !lru_gen_switching() && pvmw.pte) { + if (lru_gen_look_around(&pvmw, nr)) + referenced++; + } else if (pvmw.pte) { if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr)) referenced++; - /* Skip the batched PTEs */ - pvmw.pte += nr - 1; - pvmw.address += (nr - 1) * PAGE_SIZE; } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) @@ -993,6 +993,7 @@ static bool folio_referenced_one(struct folio *folio, WARN_ON_ONCE(1); } + ptes += nr; pra->mapcount -= nr; /* * If we are sure that we batched the entire folio, @@ -1002,6 +1003,10 @@ static bool folio_referenced_one(struct folio *folio, page_vma_mapped_walk_done(&pvmw); break; } + + /* Skip the batched PTEs */ + pvmw.pte += nr - 1; + pvmw.address += (nr - 1) * PAGE_SIZE; } if (referenced) @@ -1072,6 +1077,7 @@ int folio_referenced(struct folio *folio, int is_locked, .invalid_vma = invalid_folio_referenced_vma, }; + VM_WARN_ON_ONCE_FOLIO(folio_is_zone_device(folio), folio); *vm_flags = 0; if (!pra.mapcount) return 0; @@ -2060,7 +2066,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, } if (!pvmw.pte) { - if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) { + if (folio_test_lazyfree(folio)) { if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio)) goto walk_done; /* diff --git a/mm/secretmem.c b/mm/secretmem.c index 11a779c812a7..5f57ac4720d3 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -122,7 +122,7 @@ static int secretmem_mmap_prepare(struct vm_area_desc *desc) { const unsigned long len = vma_desc_size(desc); - if (!vma_desc_test_flags(desc, VMA_SHARED_BIT, VMA_MAYSHARE_BIT)) + if (!vma_desc_test_any(desc, VMA_SHARED_BIT, VMA_MAYSHARE_BIT)) return -EINVAL; vma_desc_set_flags(desc, VMA_LOCKED_BIT, VMA_DONTDUMP_BIT); diff --git a/mm/shmem.c b/mm/shmem.c index 0b0e577e880a..19bf77925fa1 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -61,7 +61,7 @@ static struct vfsmount *shm_mnt __ro_after_init; #include <linux/slab.h> #include <linux/backing-dev.h> #include <linux/writeback.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/percpu_counter.h> #include <linux/falloc.h> #include <linux/splice.h> @@ -1113,7 +1113,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend, pgoff_t start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT; pgoff_t end = (lend + 1) >> PAGE_SHIFT; struct folio_batch fbatch; - pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t indices[FOLIO_BATCH_SIZE]; struct folio *folio; bool same_folio; long nr_swaps_freed = 0; @@ -1513,7 +1513,7 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type) struct address_space *mapping = inode->i_mapping; pgoff_t start = 0; struct folio_batch fbatch; - pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t indices[FOLIO_BATCH_SIZE]; int ret = 0; do { @@ -2047,14 +2047,8 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct shmem_inode_info *info = SHMEM_I(inode); struct folio *new, *swapcache; int nr_pages = 1 << order; - gfp_t alloc_gfp; + gfp_t alloc_gfp = gfp; - /* - * We have arrived here because our zones are constrained, so don't - * limit chance of success with further cpuset and node constraints. - */ - gfp &= ~GFP_CONSTRAINT_MASK; - alloc_gfp = gfp; if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (WARN_ON_ONCE(order)) return ERR_PTR(-EINVAL); @@ -5582,8 +5576,7 @@ static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj, spin_unlock(&huge_shmem_orders_lock); } else if (sysfs_streq(buf, "inherit")) { /* Do not override huge allocation policy with non-PMD sized mTHP */ - if (shmem_huge == SHMEM_HUGE_FORCE && - order != HPAGE_PMD_ORDER) + if (shmem_huge == SHMEM_HUGE_FORCE && !is_pmd_order(order)) return -EINVAL; spin_lock(&huge_shmem_orders_lock); diff --git a/mm/shrinker.c b/mm/shrinker.c index 7b61fc0ee78f..c23086bccf4d 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -219,6 +219,8 @@ static int shrinker_memcg_alloc(struct shrinker *shrinker) if (mem_cgroup_disabled()) return -ENOSYS; + if (mem_cgroup_kmem_disabled() && !(shrinker->flags & SHRINKER_NONSLAB)) + return -ENOSYS; mutex_lock(&shrinker_mutex); id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); @@ -410,7 +412,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, total_scan = min(total_scan, (2 * freeable)); trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, - freeable, delta, total_scan, priority); + freeable, delta, total_scan, priority, + shrinkctl->memcg); /* * Normally, we should not scan less than batch_size objects in one @@ -461,7 +464,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, */ new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl); - trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan); + trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan, + shrinkctl->memcg); return freed; } @@ -544,8 +548,11 @@ again: /* Call non-slab shrinkers even though kmem is disabled */ if (!memcg_kmem_online() && - !(shrinker->flags & SHRINKER_NONSLAB)) + !(shrinker->flags & SHRINKER_NONSLAB)) { + clear_bit(offset, unit->map); + shrinker_put(shrinker); continue; + } ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) { @@ -716,6 +723,7 @@ non_memcg: * - non-memcg-aware shrinkers * - !CONFIG_MEMCG * - memcg is disabled by kernel command line + * - non-slab shrinkers: when memcg kmem is disabled */ size = sizeof(*shrinker->nr_deferred); if (flags & SHRINKER_NUMA_AWARE) diff --git a/mm/slab.h b/mm/slab.h index c735e6b4dddb..bf2f87acf5e3 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -94,7 +94,7 @@ struct slab { #define SLAB_MATCH(pg, sl) \ static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl)) SLAB_MATCH(flags, flags); -SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */ +SLAB_MATCH(compound_info, slab_cache); /* Ensure bit 0 is clear */ SLAB_MATCH(_refcount, __page_refcount); #ifdef CONFIG_MEMCG SLAB_MATCH(memcg_data, obj_exts); @@ -131,11 +131,7 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(struct freelist */ static inline struct slab *page_slab(const struct page *page) { - unsigned long head; - - head = READ_ONCE(page->compound_head); - if (head & 1) - page = (struct page *)(head - 1); + page = compound_head(page); if (data_race(page->page_type >> 24) != PGTY_slab) page = NULL; diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 37522d6cb398..6eadb9d116e4 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -62,7 +62,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node) if (slab_is_available()) { gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; int order = get_order(size); - static bool warned; + static bool warned __meminitdata; struct page *page; page = alloc_pages_node(node, gfp_mask, order); @@ -303,59 +303,6 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, } /* - * Undo populate_hvo, and replace it with a normal base page mapping. - * Used in memory init in case a HVO mapping needs to be undone. - * - * This can happen when it is discovered that a memblock allocated - * hugetlb page spans multiple zones, which can only be verified - * after zones have been initialized. - * - * We know that: - * 1) The first @headsize / PAGE_SIZE vmemmap pages were individually - * allocated through memblock, and mapped. - * - * 2) The rest of the vmemmap pages are mirrors of the last head page. - */ -int __meminit vmemmap_undo_hvo(unsigned long addr, unsigned long end, - int node, unsigned long headsize) -{ - unsigned long maddr, pfn; - pte_t *pte; - int headpages; - - /* - * Should only be called early in boot, so nothing will - * be accessing these page structures. - */ - WARN_ON(!early_boot_irqs_disabled); - - headpages = headsize >> PAGE_SHIFT; - - /* - * Clear mirrored mappings for tail page structs. - */ - for (maddr = addr + headsize; maddr < end; maddr += PAGE_SIZE) { - pte = virt_to_kpte(maddr); - pte_clear(&init_mm, maddr, pte); - } - - /* - * Clear and free mappings for head page and first tail page - * structs. - */ - for (maddr = addr; headpages-- > 0; maddr += PAGE_SIZE) { - pte = virt_to_kpte(maddr); - pfn = pte_pfn(ptep_get(pte)); - pte_clear(&init_mm, maddr, pte); - memblock_phys_free(PFN_PHYS(pfn), PAGE_SIZE); - } - - flush_tlb_kernel_range(addr, end); - - return vmemmap_populate(addr, end, node, NULL); -} - -/* * Write protect the mirrored tail page structs for HVO. This will be * called from the hugetlb code when gathering and initializing the * memblock allocated gigantic pages. The write protect can't be @@ -378,16 +325,54 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end, } } -/* - * Populate vmemmap pages HVO-style. The first page contains the head - * page and needed tail pages, the other ones are mirrors of the first - * page. - */ +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP +static __meminit struct page *vmemmap_get_tail(unsigned int order, struct zone *zone) +{ + struct page *p, *tail; + unsigned int idx; + int node = zone_to_nid(zone); + + if (WARN_ON_ONCE(order < VMEMMAP_TAIL_MIN_ORDER)) + return NULL; + if (WARN_ON_ONCE(order > MAX_FOLIO_ORDER)) + return NULL; + + idx = order - VMEMMAP_TAIL_MIN_ORDER; + tail = zone->vmemmap_tails[idx]; + if (tail) + return tail; + + /* + * Only allocate the page, but do not initialize it. + * + * Any initialization done here will be overwritten by memmap_init(). + * + * hugetlb_vmemmap_init() will take care of initialization after + * memmap_init(). + */ + + p = vmemmap_alloc_block_zero(PAGE_SIZE, node); + if (!p) + return NULL; + + tail = virt_to_page(p); + zone->vmemmap_tails[idx] = tail; + + return tail; +} + int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end, - int node, unsigned long headsize) + unsigned int order, struct zone *zone, + unsigned long headsize) { - pte_t *pte; unsigned long maddr; + struct page *tail; + pte_t *pte; + int node = zone_to_nid(zone); + + tail = vmemmap_get_tail(order, zone); + if (!tail) + return -ENOMEM; for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) { pte = vmemmap_populate_address(maddr, node, NULL, -1, 0); @@ -399,8 +384,9 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end, * Reuse the last page struct page mapped above for the rest. */ return vmemmap_populate_range(maddr, end, node, NULL, - pte_pfn(ptep_get(pte)), 0); + page_to_pfn(tail), 0); } +#endif void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node, unsigned long addr, unsigned long next) @@ -605,3 +591,307 @@ void __init sparse_vmemmap_init_nid_late(int nid) hugetlb_vmemmap_init_late(nid); } #endif + +static void subsection_mask_set(unsigned long *map, unsigned long pfn, + unsigned long nr_pages) +{ + int idx = subsection_map_index(pfn); + int end = subsection_map_index(pfn + nr_pages - 1); + + bitmap_set(map, idx, end - idx + 1); +} + +void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1); + unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn); + + for (nr = start_sec_nr; nr <= end_sec_nr; nr++) { + struct mem_section *ms; + unsigned long pfns; + + pfns = min(nr_pages, PAGES_PER_SECTION + - (pfn & ~PAGE_SECTION_MASK)); + ms = __nr_to_section(nr); + subsection_mask_set(ms->usage->subsection_map, pfn, pfns); + + pr_debug("%s: sec: %lu pfns: %lu set(%d, %d)\n", __func__, nr, + pfns, subsection_map_index(pfn), + subsection_map_index(pfn + pfns - 1)); + + pfn += pfns; + nr_pages -= pfns; + } +} + +#ifdef CONFIG_MEMORY_HOTPLUG + +/* Mark all memory sections within the pfn range as online */ +void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn) +{ + unsigned long pfn; + + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + unsigned long section_nr = pfn_to_section_nr(pfn); + struct mem_section *ms = __nr_to_section(section_nr); + + ms->section_mem_map |= SECTION_IS_ONLINE; + } +} + +/* Mark all memory sections within the pfn range as offline */ +void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) +{ + unsigned long pfn; + + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + unsigned long section_nr = pfn_to_section_nr(pfn); + struct mem_section *ms = __nr_to_section(section_nr); + + ms->section_mem_map &= ~SECTION_IS_ONLINE; + } +} + +static struct page * __meminit populate_section_memmap(unsigned long pfn, + unsigned long nr_pages, int nid, struct vmem_altmap *altmap, + struct dev_pagemap *pgmap) +{ + return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap); +} + +static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, + struct vmem_altmap *altmap) +{ + unsigned long start = (unsigned long) pfn_to_page(pfn); + unsigned long end = start + nr_pages * sizeof(struct page); + + vmemmap_free(start, end, altmap); +} +static void free_map_bootmem(struct page *memmap) +{ + unsigned long start = (unsigned long)memmap; + unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION); + + vmemmap_free(start, end, NULL); +} + +static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; + struct mem_section *ms = __pfn_to_section(pfn); + unsigned long *subsection_map = ms->usage + ? &ms->usage->subsection_map[0] : NULL; + + subsection_mask_set(map, pfn, nr_pages); + if (subsection_map) + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); + + if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), + "section already deactivated (%#lx + %ld)\n", + pfn, nr_pages)) + return -EINVAL; + + bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); + return 0; +} + +static bool is_subsection_map_empty(struct mem_section *ms) +{ + return bitmap_empty(&ms->usage->subsection_map[0], + SUBSECTIONS_PER_SECTION); +} + +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + struct mem_section *ms = __pfn_to_section(pfn); + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; + unsigned long *subsection_map; + int rc = 0; + + subsection_mask_set(map, pfn, nr_pages); + + subsection_map = &ms->usage->subsection_map[0]; + + if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) + rc = -EINVAL; + else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION)) + rc = -EEXIST; + else + bitmap_or(subsection_map, map, subsection_map, + SUBSECTIONS_PER_SECTION); + + return rc; +} + +/* + * To deactivate a memory region, there are 3 cases to handle: + * + * 1. deactivation of a partial hot-added section: + * a) section was present at memory init. + * b) section was hot-added post memory init. + * 2. deactivation of a complete hot-added section. + * 3. deactivation of a complete section from memory init. + * + * For 1, when subsection_map does not empty we will not be freeing the + * usage map, but still need to free the vmemmap range. + */ +static void section_deactivate(unsigned long pfn, unsigned long nr_pages, + struct vmem_altmap *altmap) +{ + struct mem_section *ms = __pfn_to_section(pfn); + bool section_is_early = early_section(ms); + struct page *memmap = NULL; + bool empty; + + if (clear_subsection_map(pfn, nr_pages)) + return; + + empty = is_subsection_map_empty(ms); + if (empty) { + /* + * Mark the section invalid so that valid_section() + * return false. This prevents code from dereferencing + * ms->usage array. + */ + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; + + /* + * When removing an early section, the usage map is kept (as the + * usage maps of other sections fall into the same page). It + * will be re-used when re-adding the section - which is then no + * longer an early section. If the usage map is PageReserved, it + * was allocated during boot. + */ + if (!PageReserved(virt_to_page(ms->usage))) { + kfree_rcu(ms->usage, rcu); + WRITE_ONCE(ms->usage, NULL); + } + memmap = pfn_to_page(SECTION_ALIGN_DOWN(pfn)); + } + + /* + * The memmap of early sections is always fully populated. See + * section_activate() and pfn_valid() . + */ + if (!section_is_early) { + memmap_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE))); + depopulate_section_memmap(pfn, nr_pages, altmap); + } else if (memmap) { + memmap_boot_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), + PAGE_SIZE))); + free_map_bootmem(memmap); + } + + if (empty) + ms->section_mem_map = (unsigned long)NULL; +} + +static struct page * __meminit section_activate(int nid, unsigned long pfn, + unsigned long nr_pages, struct vmem_altmap *altmap, + struct dev_pagemap *pgmap) +{ + struct mem_section *ms = __pfn_to_section(pfn); + struct mem_section_usage *usage = NULL; + struct page *memmap; + int rc; + + if (!ms->usage) { + usage = kzalloc(mem_section_usage_size(), GFP_KERNEL); + if (!usage) + return ERR_PTR(-ENOMEM); + ms->usage = usage; + } + + rc = fill_subsection_map(pfn, nr_pages); + if (rc) { + if (usage) + ms->usage = NULL; + kfree(usage); + return ERR_PTR(rc); + } + + /* + * The early init code does not consider partially populated + * initial sections, it simply assumes that memory will never be + * referenced. If we hot-add memory into such a section then we + * do not need to populate the memmap and can simply reuse what + * is already there. + */ + if (nr_pages < PAGES_PER_SECTION && early_section(ms)) + return pfn_to_page(pfn); + + memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap); + if (!memmap) { + section_deactivate(pfn, nr_pages, altmap); + return ERR_PTR(-ENOMEM); + } + memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE)); + + return memmap; +} + +/** + * sparse_add_section - add a memory section, or populate an existing one + * @nid: The node to add section on + * @start_pfn: start pfn of the memory range + * @nr_pages: number of pfns to add in the section + * @altmap: alternate pfns to allocate the memmap backing store + * @pgmap: alternate compound page geometry for devmap mappings + * + * This is only intended for hotplug. + * + * Note that only VMEMMAP supports sub-section aligned hotplug, + * the proper alignment and size are gated by check_pfn_span(). + * + * + * Return: + * * 0 - On success. + * * -EEXIST - Section has been present. + * * -ENOMEM - Out of memory. + */ +int __meminit sparse_add_section(int nid, unsigned long start_pfn, + unsigned long nr_pages, struct vmem_altmap *altmap, + struct dev_pagemap *pgmap) +{ + unsigned long section_nr = pfn_to_section_nr(start_pfn); + struct mem_section *ms; + struct page *memmap; + int ret; + + ret = sparse_index_init(section_nr, nid); + if (ret < 0) + return ret; + + memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); + if (IS_ERR(memmap)) + return PTR_ERR(memmap); + + /* + * Poison uninitialized struct pages in order to catch invalid flags + * combinations. + */ + page_init_poison(memmap, sizeof(struct page) * nr_pages); + + ms = __nr_to_section(section_nr); + __section_mark_present(ms, section_nr); + + /* Align memmap to section boundary in the subsection case */ + if (section_nr_to_pfn(section_nr) != start_pfn) + memmap = pfn_to_page(section_nr_to_pfn(section_nr)); + sparse_init_one_section(ms, section_nr, memmap, ms->usage, 0); + + return 0; +} + +void sparse_remove_section(unsigned long pfn, unsigned long nr_pages, + struct vmem_altmap *altmap) +{ + struct mem_section *ms = __pfn_to_section(pfn); + + if (WARN_ON_ONCE(!valid_section(ms))) + return; + + section_deactivate(pfn, nr_pages, altmap); +} +#endif /* CONFIG_MEMORY_HOTPLUG */ diff --git a/mm/sparse.c b/mm/sparse.c index b5b2b6f7041b..007fd52c621e 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -79,7 +79,7 @@ static noinline struct mem_section __ref *sparse_index_alloc(int nid) return section; } -static int __meminit sparse_index_init(unsigned long section_nr, int nid) +int __meminit sparse_index_init(unsigned long section_nr, int nid) { unsigned long root = SECTION_NR_TO_ROOT(section_nr); struct mem_section *section; @@ -103,7 +103,7 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid) return 0; } #else /* !SPARSEMEM_EXTREME */ -static inline int sparse_index_init(unsigned long section_nr, int nid) +int sparse_index_init(unsigned long section_nr, int nid) { return 0; } @@ -161,58 +161,12 @@ static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn, * those loops early. */ unsigned long __highest_present_section_nr; -static void __section_mark_present(struct mem_section *ms, - unsigned long section_nr) -{ - if (section_nr > __highest_present_section_nr) - __highest_present_section_nr = section_nr; - - ms->section_mem_map |= SECTION_MARKED_PRESENT; -} static inline unsigned long first_present_section_nr(void) { return next_present_section_nr(-1); } -#ifdef CONFIG_SPARSEMEM_VMEMMAP -static void subsection_mask_set(unsigned long *map, unsigned long pfn, - unsigned long nr_pages) -{ - int idx = subsection_map_index(pfn); - int end = subsection_map_index(pfn + nr_pages - 1); - - bitmap_set(map, idx, end - idx + 1); -} - -void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages) -{ - int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1); - unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn); - - for (nr = start_sec_nr; nr <= end_sec_nr; nr++) { - struct mem_section *ms; - unsigned long pfns; - - pfns = min(nr_pages, PAGES_PER_SECTION - - (pfn & ~PAGE_SECTION_MASK)); - ms = __nr_to_section(nr); - subsection_mask_set(ms->usage->subsection_map, pfn, pfns); - - pr_debug("%s: sec: %lu pfns: %lu set(%d, %d)\n", __func__, nr, - pfns, subsection_map_index(pfn), - subsection_map_index(pfn + pfns - 1)); - - pfn += pfns; - nr_pages -= pfns; - } -} -#else -void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages) -{ -} -#endif - /* Record a memory area against a node. */ static void __init memory_present(int nid, unsigned long start, unsigned long end) { @@ -260,42 +214,6 @@ static void __init memblocks_present(void) memory_present(nid, start, end); } -/* - * Subtle, we encode the real pfn into the mem_map such that - * the identity pfn - section_mem_map will return the actual - * physical page frame number. - */ -static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long pnum) -{ - unsigned long coded_mem_map = - (unsigned long)(mem_map - (section_nr_to_pfn(pnum))); - BUILD_BUG_ON(SECTION_MAP_LAST_BIT > PFN_SECTION_SHIFT); - BUG_ON(coded_mem_map & ~SECTION_MAP_MASK); - return coded_mem_map; -} - -#ifdef CONFIG_MEMORY_HOTPLUG -/* - * Decode mem_map from the coded memmap - */ -struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum) -{ - /* mask off the extra low bits of information */ - coded_mem_map &= SECTION_MAP_MASK; - return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); -} -#endif /* CONFIG_MEMORY_HOTPLUG */ - -static void __meminit sparse_init_one_section(struct mem_section *ms, - unsigned long pnum, struct page *mem_map, - struct mem_section_usage *usage, unsigned long flags) -{ - ms->section_mem_map &= ~SECTION_MAP_MASK; - ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) - | SECTION_HAS_MEM_MAP | flags; - ms->usage = usage; -} - static unsigned long usemap_size(void) { return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long); @@ -306,102 +224,6 @@ size_t mem_section_usage_size(void) return sizeof(struct mem_section_usage) + usemap_size(); } -#ifdef CONFIG_MEMORY_HOTREMOVE -static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat) -{ -#ifndef CONFIG_NUMA - VM_BUG_ON(pgdat != &contig_page_data); - return __pa_symbol(&contig_page_data); -#else - return __pa(pgdat); -#endif -} - -static struct mem_section_usage * __init -sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat, - unsigned long size) -{ - struct mem_section_usage *usage; - unsigned long goal, limit; - int nid; - /* - * A page may contain usemaps for other sections preventing the - * page being freed and making a section unremovable while - * other sections referencing the usemap remain active. Similarly, - * a pgdat can prevent a section being removed. If section A - * contains a pgdat and section B contains the usemap, both - * sections become inter-dependent. This allocates usemaps - * from the same section as the pgdat where possible to avoid - * this problem. - */ - goal = pgdat_to_phys(pgdat) & (PAGE_SECTION_MASK << PAGE_SHIFT); - limit = goal + (1UL << PA_SECTION_SHIFT); - nid = early_pfn_to_nid(goal >> PAGE_SHIFT); -again: - usage = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid); - if (!usage && limit) { - limit = MEMBLOCK_ALLOC_ACCESSIBLE; - goto again; - } - return usage; -} - -static void __init check_usemap_section_nr(int nid, - struct mem_section_usage *usage) -{ - unsigned long usemap_snr, pgdat_snr; - static unsigned long old_usemap_snr; - static unsigned long old_pgdat_snr; - struct pglist_data *pgdat = NODE_DATA(nid); - int usemap_nid; - - /* First call */ - if (!old_usemap_snr) { - old_usemap_snr = NR_MEM_SECTIONS; - old_pgdat_snr = NR_MEM_SECTIONS; - } - - usemap_snr = pfn_to_section_nr(__pa(usage) >> PAGE_SHIFT); - pgdat_snr = pfn_to_section_nr(pgdat_to_phys(pgdat) >> PAGE_SHIFT); - if (usemap_snr == pgdat_snr) - return; - - if (old_usemap_snr == usemap_snr && old_pgdat_snr == pgdat_snr) - /* skip redundant message */ - return; - - old_usemap_snr = usemap_snr; - old_pgdat_snr = pgdat_snr; - - usemap_nid = sparse_early_nid(__nr_to_section(usemap_snr)); - if (usemap_nid != nid) { - pr_info("node %d must be removed before remove section %ld\n", - nid, usemap_snr); - return; - } - /* - * There is a circular dependency. - * Some platforms allow un-removable section because they will just - * gather other removable sections for dynamic partitioning. - * Just notify un-removable section's number here. - */ - pr_info("Section %ld and %ld (node %d) have a circular dependency on usemap and pgdat allocations\n", - usemap_snr, pgdat_snr, nid); -} -#else -static struct mem_section_usage * __init -sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat, - unsigned long size) -{ - return memblock_alloc_node(size, SMP_CACHE_BYTES, pgdat->node_id); -} - -static void __init check_usemap_section_nr(int nid, - struct mem_section_usage *usage) -{ -} -#endif /* CONFIG_MEMORY_HOTREMOVE */ - #ifdef CONFIG_SPARSEMEM_VMEMMAP unsigned long __init section_map_size(void) { @@ -498,7 +320,6 @@ void __init sparse_init_early_section(int nid, struct page *map, unsigned long pnum, unsigned long flags) { BUG_ON(!sparse_usagebuf || sparse_usagebuf >= sparse_usagebuf_end); - check_usemap_section_nr(nid, sparse_usagebuf); sparse_init_one_section(__nr_to_section(pnum), pnum, map, sparse_usagebuf, SECTION_IS_EARLY | flags); sparse_usagebuf = (void *)sparse_usagebuf + mem_section_usage_size(); @@ -509,8 +330,7 @@ static int __init sparse_usage_init(int nid, unsigned long map_count) unsigned long size; size = mem_section_usage_size() * map_count; - sparse_usagebuf = sparse_early_usemaps_alloc_pgdat_section( - NODE_DATA(nid), size); + sparse_usagebuf = memblock_alloc_node(size, SMP_CACHE_BYTES, nid); if (!sparse_usagebuf) { sparse_usagebuf_end = NULL; return -ENOMEM; @@ -600,6 +420,11 @@ void __init sparse_init(void) BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section))); memblocks_present(); + if (compound_info_has_mask()) { + VM_WARN_ON_ONCE(!IS_ALIGNED((unsigned long) pfn_to_page(0), + MAX_FOLIO_VMEMMAP_ALIGN)); + } + pnum_begin = first_present_section_nr(); nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); @@ -623,356 +448,3 @@ void __init sparse_init(void) sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count); vmemmap_populate_print_last(); } - -#ifdef CONFIG_MEMORY_HOTPLUG - -/* Mark all memory sections within the pfn range as online */ -void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn) -{ - unsigned long pfn; - - for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { - unsigned long section_nr = pfn_to_section_nr(pfn); - struct mem_section *ms; - - /* onlining code should never touch invalid ranges */ - if (WARN_ON(!valid_section_nr(section_nr))) - continue; - - ms = __nr_to_section(section_nr); - ms->section_mem_map |= SECTION_IS_ONLINE; - } -} - -/* Mark all memory sections within the pfn range as offline */ -void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) -{ - unsigned long pfn; - - for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { - unsigned long section_nr = pfn_to_section_nr(pfn); - struct mem_section *ms; - - /* - * TODO this needs some double checking. Offlining code makes - * sure to check pfn_valid but those checks might be just bogus - */ - if (WARN_ON(!valid_section_nr(section_nr))) - continue; - - ms = __nr_to_section(section_nr); - ms->section_mem_map &= ~SECTION_IS_ONLINE; - } -} - -#ifdef CONFIG_SPARSEMEM_VMEMMAP -static struct page * __meminit populate_section_memmap(unsigned long pfn, - unsigned long nr_pages, int nid, struct vmem_altmap *altmap, - struct dev_pagemap *pgmap) -{ - return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap); -} - -static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, - struct vmem_altmap *altmap) -{ - unsigned long start = (unsigned long) pfn_to_page(pfn); - unsigned long end = start + nr_pages * sizeof(struct page); - - vmemmap_free(start, end, altmap); -} -static void free_map_bootmem(struct page *memmap) -{ - unsigned long start = (unsigned long)memmap; - unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION); - - vmemmap_free(start, end, NULL); -} - -static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; - DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; - struct mem_section *ms = __pfn_to_section(pfn); - unsigned long *subsection_map = ms->usage - ? &ms->usage->subsection_map[0] : NULL; - - subsection_mask_set(map, pfn, nr_pages); - if (subsection_map) - bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); - - if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), - "section already deactivated (%#lx + %ld)\n", - pfn, nr_pages)) - return -EINVAL; - - bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); - return 0; -} - -static bool is_subsection_map_empty(struct mem_section *ms) -{ - return bitmap_empty(&ms->usage->subsection_map[0], - SUBSECTIONS_PER_SECTION); -} - -static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - struct mem_section *ms = __pfn_to_section(pfn); - DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; - unsigned long *subsection_map; - int rc = 0; - - subsection_mask_set(map, pfn, nr_pages); - - subsection_map = &ms->usage->subsection_map[0]; - - if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) - rc = -EINVAL; - else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION)) - rc = -EEXIST; - else - bitmap_or(subsection_map, map, subsection_map, - SUBSECTIONS_PER_SECTION); - - return rc; -} -#else -static struct page * __meminit populate_section_memmap(unsigned long pfn, - unsigned long nr_pages, int nid, struct vmem_altmap *altmap, - struct dev_pagemap *pgmap) -{ - return kvmalloc_node(array_size(sizeof(struct page), - PAGES_PER_SECTION), GFP_KERNEL, nid); -} - -static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, - struct vmem_altmap *altmap) -{ - kvfree(pfn_to_page(pfn)); -} - -static void free_map_bootmem(struct page *memmap) -{ - unsigned long maps_section_nr, removing_section_nr, i; - unsigned long type, nr_pages; - struct page *page = virt_to_page(memmap); - - nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page)) - >> PAGE_SHIFT; - - for (i = 0; i < nr_pages; i++, page++) { - type = bootmem_type(page); - - BUG_ON(type == NODE_INFO); - - maps_section_nr = pfn_to_section_nr(page_to_pfn(page)); - removing_section_nr = bootmem_info(page); - - /* - * When this function is called, the removing section is - * logical offlined state. This means all pages are isolated - * from page allocator. If removing section's memmap is placed - * on the same section, it must not be freed. - * If it is freed, page allocator may allocate it which will - * be removed physically soon. - */ - if (maps_section_nr != removing_section_nr) - put_page_bootmem(page); - } -} - -static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - return 0; -} - -static bool is_subsection_map_empty(struct mem_section *ms) -{ - return true; -} - -static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - return 0; -} -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ - -/* - * To deactivate a memory region, there are 3 cases to handle across - * two configurations (SPARSEMEM_VMEMMAP={y,n}): - * - * 1. deactivation of a partial hot-added section (only possible in - * the SPARSEMEM_VMEMMAP=y case). - * a) section was present at memory init. - * b) section was hot-added post memory init. - * 2. deactivation of a complete hot-added section. - * 3. deactivation of a complete section from memory init. - * - * For 1, when subsection_map does not empty we will not be freeing the - * usage map, but still need to free the vmemmap range. - * - * For 2 and 3, the SPARSEMEM_VMEMMAP={y,n} cases are unified - */ -static void section_deactivate(unsigned long pfn, unsigned long nr_pages, - struct vmem_altmap *altmap) -{ - struct mem_section *ms = __pfn_to_section(pfn); - bool section_is_early = early_section(ms); - struct page *memmap = NULL; - bool empty; - - if (clear_subsection_map(pfn, nr_pages)) - return; - - empty = is_subsection_map_empty(ms); - if (empty) { - unsigned long section_nr = pfn_to_section_nr(pfn); - - /* - * Mark the section invalid so that valid_section() - * return false. This prevents code from dereferencing - * ms->usage array. - */ - ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; - - /* - * When removing an early section, the usage map is kept (as the - * usage maps of other sections fall into the same page). It - * will be re-used when re-adding the section - which is then no - * longer an early section. If the usage map is PageReserved, it - * was allocated during boot. - */ - if (!PageReserved(virt_to_page(ms->usage))) { - kfree_rcu(ms->usage, rcu); - WRITE_ONCE(ms->usage, NULL); - } - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - } - - /* - * The memmap of early sections is always fully populated. See - * section_activate() and pfn_valid() . - */ - if (!section_is_early) { - memmap_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE))); - depopulate_section_memmap(pfn, nr_pages, altmap); - } else if (memmap) { - memmap_boot_pages_add(-1L * (DIV_ROUND_UP(nr_pages * sizeof(struct page), - PAGE_SIZE))); - free_map_bootmem(memmap); - } - - if (empty) - ms->section_mem_map = (unsigned long)NULL; -} - -static struct page * __meminit section_activate(int nid, unsigned long pfn, - unsigned long nr_pages, struct vmem_altmap *altmap, - struct dev_pagemap *pgmap) -{ - struct mem_section *ms = __pfn_to_section(pfn); - struct mem_section_usage *usage = NULL; - struct page *memmap; - int rc; - - if (!ms->usage) { - usage = kzalloc(mem_section_usage_size(), GFP_KERNEL); - if (!usage) - return ERR_PTR(-ENOMEM); - ms->usage = usage; - } - - rc = fill_subsection_map(pfn, nr_pages); - if (rc) { - if (usage) - ms->usage = NULL; - kfree(usage); - return ERR_PTR(rc); - } - - /* - * The early init code does not consider partially populated - * initial sections, it simply assumes that memory will never be - * referenced. If we hot-add memory into such a section then we - * do not need to populate the memmap and can simply reuse what - * is already there. - */ - if (nr_pages < PAGES_PER_SECTION && early_section(ms)) - return pfn_to_page(pfn); - - memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap); - if (!memmap) { - section_deactivate(pfn, nr_pages, altmap); - return ERR_PTR(-ENOMEM); - } - memmap_pages_add(DIV_ROUND_UP(nr_pages * sizeof(struct page), PAGE_SIZE)); - - return memmap; -} - -/** - * sparse_add_section - add a memory section, or populate an existing one - * @nid: The node to add section on - * @start_pfn: start pfn of the memory range - * @nr_pages: number of pfns to add in the section - * @altmap: alternate pfns to allocate the memmap backing store - * @pgmap: alternate compound page geometry for devmap mappings - * - * This is only intended for hotplug. - * - * Note that only VMEMMAP supports sub-section aligned hotplug, - * the proper alignment and size are gated by check_pfn_span(). - * - * - * Return: - * * 0 - On success. - * * -EEXIST - Section has been present. - * * -ENOMEM - Out of memory. - */ -int __meminit sparse_add_section(int nid, unsigned long start_pfn, - unsigned long nr_pages, struct vmem_altmap *altmap, - struct dev_pagemap *pgmap) -{ - unsigned long section_nr = pfn_to_section_nr(start_pfn); - struct mem_section *ms; - struct page *memmap; - int ret; - - ret = sparse_index_init(section_nr, nid); - if (ret < 0) - return ret; - - memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap); - if (IS_ERR(memmap)) - return PTR_ERR(memmap); - - /* - * Poison uninitialized struct pages in order to catch invalid flags - * combinations. - */ - page_init_poison(memmap, sizeof(struct page) * nr_pages); - - ms = __nr_to_section(section_nr); - set_section_nid(section_nr, nid); - __section_mark_present(ms, section_nr); - - /* Align memmap to section boundary in the subsection case */ - if (section_nr_to_pfn(section_nr) != start_pfn) - memmap = pfn_to_page(section_nr_to_pfn(section_nr)); - sparse_init_one_section(ms, section_nr, memmap, ms->usage, 0); - - return 0; -} - -void sparse_remove_section(unsigned long pfn, unsigned long nr_pages, - struct vmem_altmap *altmap) -{ - struct mem_section *ms = __pfn_to_section(pfn); - - if (WARN_ON_ONCE(!valid_section(ms))) - return; - - section_deactivate(pfn, nr_pages, altmap); -} -#endif /* CONFIG_MEMORY_HOTPLUG */ diff --git a/mm/swap.c b/mm/swap.c index bb19ccbece46..78b4aa811fc6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -20,7 +20,7 @@ #include <linux/swap.h> #include <linux/mman.h> #include <linux/pagemap.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/init.h> #include <linux/export.h> #include <linux/mm_inline.h> @@ -1018,7 +1018,7 @@ EXPORT_SYMBOL(folios_put_refs); void release_pages(release_pages_arg arg, int nr) { struct folio_batch fbatch; - int refs[PAGEVEC_SIZE]; + int refs[FOLIO_BATCH_SIZE]; struct encoded_page **encoded = arg.encoded_pages; int i; diff --git a/mm/swap.h b/mm/swap.h index bfafa637c458..a77016f2423b 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -37,6 +37,7 @@ struct swap_cluster_info { u8 flags; u8 order; atomic_long_t __rcu *table; /* Swap table entries, see mm/swap_table.h */ + unsigned int *extend_table; /* For large swap count, protected by ci->lock */ struct list_head list; }; @@ -84,7 +85,7 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster( struct swap_info_struct *si, pgoff_t offset) { VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */ - VM_WARN_ON_ONCE(offset >= si->max); + VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER)); return &si->cluster_info[offset / SWAPFILE_CLUSTER]; } @@ -183,6 +184,8 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci) spin_unlock_irq(&ci->lock); } +extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp); + /* * Below are the core routines for doing swap for a folio. * All helpers requires the folio to be locked, and a locked folio @@ -192,12 +195,13 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci) * * folio_alloc_swap(): the entry point for a folio to be swapped * out. It allocates swap slots and pins the slots with swap cache. - * The slots start with a swap count of zero. + * The slots start with a swap count of zero. The slots are pinned + * by swap cache reference which doesn't contribute to swap count. * * folio_dup_swap(): increases the swap count of a folio, usually * during it gets unmapped and a swap entry is installed to replace * it (e.g., swap entry in page table). A swap slot with swap - * count == 0 should only be increasd by this helper. + * count == 0 can only be increased by this helper. * * folio_put_swap(): does the opposite thing of folio_dup_swap(). */ @@ -206,9 +210,9 @@ int folio_dup_swap(struct folio *folio, struct page *subpage); void folio_put_swap(struct folio *folio, struct page *subpage); /* For internal use */ -extern void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, unsigned int nr_pages); +extern void __swap_cluster_free_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned int ci_off, unsigned int nr_pages); /* linux/mm/page_io.c */ int sio_pool_init(void); @@ -286,7 +290,6 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow); void __swap_cache_replace_folio(struct swap_cluster_info *ci, struct folio *old, struct folio *new); -void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents); void show_swap_cache_info(void); void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr); @@ -446,6 +449,11 @@ static inline int swap_writeout(struct folio *folio, return 0; } +static inline int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp) +{ + return -EINVAL; +} + static inline bool swap_cache_has_folio(swp_entry_t entry) { return false; diff --git a/mm/swap_state.c b/mm/swap_state.c index 48aff2c917c0..1415a5c54a43 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -15,7 +15,7 @@ #include <linux/leafops.h> #include <linux/init.h> #include <linux/pagemap.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/backing-dev.h> #include <linux/blkdev.h> #include <linux/migrate.h> @@ -140,21 +140,20 @@ void *swap_cache_get_shadow(swp_entry_t entry) void __swap_cache_add_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry) { - unsigned long new_tb; - unsigned int ci_start, ci_off, ci_end; + unsigned int ci_off = swp_cluster_offset(entry), ci_end; unsigned long nr_pages = folio_nr_pages(folio); + unsigned long pfn = folio_pfn(folio); + unsigned long old_tb; VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio); - new_tb = folio_to_swp_tb(folio); - ci_start = swp_cluster_offset(entry); - ci_off = ci_start; - ci_end = ci_start + nr_pages; + ci_end = ci_off + nr_pages; do { - VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off))); - __swap_table_set(ci, ci_off, new_tb); + old_tb = __swap_table_get(ci, ci_off); + VM_WARN_ON_ONCE(swp_tb_is_folio(old_tb)); + __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb))); } while (++ci_off < ci_end); folio_ref_add(folio, nr_pages); @@ -183,14 +182,13 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, unsigned long old_tb; struct swap_info_struct *si; struct swap_cluster_info *ci; - unsigned int ci_start, ci_off, ci_end, offset; + unsigned int ci_start, ci_off, ci_end; unsigned long nr_pages = folio_nr_pages(folio); si = __swap_entry_to_info(entry); ci_start = swp_cluster_offset(entry); ci_end = ci_start + nr_pages; ci_off = ci_start; - offset = swp_offset(entry); ci = swap_cluster_lock(si, swp_offset(entry)); if (unlikely(!ci->table)) { err = -ENOENT; @@ -202,13 +200,12 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, err = -EEXIST; goto failed; } - if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) { + if (unlikely(!__swp_tb_get_count(old_tb))) { err = -ENOENT; goto failed; } if (swp_tb_is_shadow(old_tb)) shadow = swp_tb_to_shadow(old_tb); - offset++; } while (++ci_off < ci_end); __swap_cache_add_folio(ci, folio, entry); swap_cluster_unlock(ci); @@ -237,8 +234,9 @@ failed: void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, swp_entry_t entry, void *shadow) { + int count; + unsigned long old_tb; struct swap_info_struct *si; - unsigned long old_tb, new_tb; unsigned int ci_start, ci_off, ci_end; bool folio_swapped = false, need_free = false; unsigned long nr_pages = folio_nr_pages(folio); @@ -249,20 +247,20 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio); si = __swap_entry_to_info(entry); - new_tb = shadow_swp_to_tb(shadow); ci_start = swp_cluster_offset(entry); ci_end = ci_start + nr_pages; ci_off = ci_start; do { - /* If shadow is NULL, we sets an empty shadow */ - old_tb = __swap_table_xchg(ci, ci_off, new_tb); + old_tb = __swap_table_get(ci, ci_off); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != folio); - if (__swap_count(swp_entry(si->type, - swp_offset(entry) + ci_off - ci_start))) + count = __swp_tb_get_count(old_tb); + if (count) folio_swapped = true; else need_free = true; + /* If shadow is NULL, we sets an empty shadow. */ + __swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, count)); } while (++ci_off < ci_end); folio->swap.val = 0; @@ -271,13 +269,13 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio, lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages); if (!folio_swapped) { - swap_entries_free(si, ci, swp_offset(entry), nr_pages); + __swap_cluster_free_entries(si, ci, ci_start, nr_pages); } else if (need_free) { + ci_off = ci_start; do { - if (!__swap_count(entry)) - swap_entries_free(si, ci, swp_offset(entry), 1); - entry.val++; - } while (--nr_pages); + if (!__swp_tb_get_count(__swap_table_get(ci, ci_off))) + __swap_cluster_free_entries(si, ci, ci_off, 1); + } while (++ci_off < ci_end); } } @@ -324,17 +322,18 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci, unsigned long nr_pages = folio_nr_pages(new); unsigned int ci_off = swp_cluster_offset(entry); unsigned int ci_end = ci_off + nr_pages; - unsigned long old_tb, new_tb; + unsigned long pfn = folio_pfn(new); + unsigned long old_tb; VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new)); VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new)); VM_WARN_ON_ONCE(!entry.val); /* Swap cache still stores N entries instead of a high-order entry */ - new_tb = folio_to_swp_tb(new); do { - old_tb = __swap_table_xchg(ci, ci_off, new_tb); + old_tb = __swap_table_get(ci, ci_off); WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old); + __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb))); } while (++ci_off < ci_end); /* @@ -351,27 +350,6 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci, } } -/** - * __swap_cache_clear_shadow - Clears a set of shadows in the swap cache. - * @entry: The starting index entry. - * @nr_ents: How many slots need to be cleared. - * - * Context: Caller must ensure the range is valid, all in one single cluster, - * not occupied by any folio, and lock the cluster. - */ -void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents) -{ - struct swap_cluster_info *ci = __swap_entry_to_cluster(entry); - unsigned int ci_off = swp_cluster_offset(entry), ci_end; - unsigned long old; - - ci_end = ci_off + nr_ents; - do { - old = __swap_table_xchg(ci, ci_off, null_to_swp_tb()); - WARN_ON_ONCE(swp_tb_is_folio(old)); - } while (++ci_off < ci_end); -} - /* * If we are the only user, then try to free up the swap cache. * @@ -407,7 +385,7 @@ void free_folio_and_swap_cache(struct folio *folio) void free_pages_and_swap_cache(struct encoded_page **pages, int nr) { struct folio_batch folios; - unsigned int refs[PAGEVEC_SIZE]; + unsigned int refs[FOLIO_BATCH_SIZE]; folio_batch_init(&folios); for (int i = 0; i < nr; i++) { diff --git a/mm/swap_table.h b/mm/swap_table.h index ea244a57a5b7..8415ffbe2b9c 100644 --- a/mm/swap_table.h +++ b/mm/swap_table.h @@ -18,10 +18,69 @@ struct swap_table { * (physical or virtual) device. The swap table in each cluster is a * 1:1 map of the swap slots in this cluster. * - * Each swap table entry could be a pointer (folio), a XA_VALUE - * (shadow), or NULL. + * Swap table entry type and bits layouts: + * + * NULL: |---------------- 0 ---------------| - Free slot + * Shadow: | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot + * PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot + * Pointer: |----------- Pointer ----------|100| - (Unused) + * Bad: |------------- 1 -------------|1000| - Bad slot + * + * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long. + * + * Usages: + * + * - NULL: Swap slot is unused, could be allocated. + * + * - Shadow: Swap slot is used and not cached (usually swapped out). It reuses + * the XA_VALUE format to be compatible with working set shadows. SHADOW_VAL + * part might be all 0 if the working shadow info is absent. In such a case, + * we still want to keep the shadow format as a placeholder. + * + * Memcg ID is embedded in SHADOW_VAL. + * + * - PFN: Swap slot is in use, and cached. Memcg info is recorded on the page + * struct. + * + * - Pointer: Unused yet. `0b100` is reserved for potential pointer usage + * because only the lower three bits can be used as a marker for 8 bytes + * aligned pointers. + * + * - Bad: Swap slot is reserved, protects swap header or holes on swap devices. */ +#if defined(MAX_POSSIBLE_PHYSMEM_BITS) +#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) +#elif defined(MAX_PHYSMEM_BITS) +#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) +#else +#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT) +#endif + +/* NULL Entry, all 0 */ +#define SWP_TB_NULL 0UL + +/* Swapped out: shadow */ +#define SWP_TB_SHADOW_MARK 0b1UL + +/* Cached: PFN */ +#define SWP_TB_PFN_BITS (SWAP_CACHE_PFN_BITS + SWP_TB_PFN_MARK_BITS) +#define SWP_TB_PFN_MARK 0b10UL +#define SWP_TB_PFN_MARK_BITS 2 +#define SWP_TB_PFN_MARK_MASK (BIT(SWP_TB_PFN_MARK_BITS) - 1) + +/* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended */ +#define SWP_TB_COUNT_BITS min(4, BITS_PER_LONG - SWP_TB_PFN_BITS) +#define SWP_TB_COUNT_MASK (~((~0UL) >> SWP_TB_COUNT_BITS)) +#define SWP_TB_COUNT_SHIFT (BITS_PER_LONG - SWP_TB_COUNT_BITS) +#define SWP_TB_COUNT_MAX ((1 << SWP_TB_COUNT_BITS) - 1) + +/* Bad slot: ends with 0b1000 and rests of bits are all 1 */ +#define SWP_TB_BAD ((~0UL) << 3) + +/* Macro for shadow offset calculation */ +#define SWAP_COUNT_SHIFT SWP_TB_COUNT_BITS + /* * Helpers for casting one type of info into a swap table entry. */ @@ -31,18 +90,47 @@ static inline unsigned long null_to_swp_tb(void) return 0; } -static inline unsigned long folio_to_swp_tb(struct folio *folio) +static inline unsigned long __count_to_swp_tb(unsigned char count) { + /* + * At least three values are needed to distinguish free (0), + * used (count > 0 && count < SWP_TB_COUNT_MAX), and + * overflow (count == SWP_TB_COUNT_MAX). + */ + BUILD_BUG_ON(SWP_TB_COUNT_MAX < 2 || SWP_TB_COUNT_BITS < 2); + VM_WARN_ON(count > SWP_TB_COUNT_MAX); + return ((unsigned long)count) << SWP_TB_COUNT_SHIFT; +} + +static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int count) +{ + unsigned long swp_tb; + BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *)); - return (unsigned long)folio; + BUILD_BUG_ON(SWAP_CACHE_PFN_BITS > + (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS)); + + swp_tb = (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK; + VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK); + + return swp_tb | __count_to_swp_tb(count); +} + +static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned int count) +{ + return pfn_to_swp_tb(folio_pfn(folio), count); } -static inline unsigned long shadow_swp_to_tb(void *shadow) +static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int count) { BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) != BITS_PER_BYTE * sizeof(unsigned long)); + BUILD_BUG_ON((unsigned long)xa_mk_value(0) != SWP_TB_SHADOW_MARK); + VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow)); - return (unsigned long)shadow; + VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK)); + + return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_MARK; } /* @@ -55,7 +143,7 @@ static inline bool swp_tb_is_null(unsigned long swp_tb) static inline bool swp_tb_is_folio(unsigned long swp_tb) { - return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb); + return ((swp_tb & SWP_TB_PFN_MARK_MASK) == SWP_TB_PFN_MARK); } static inline bool swp_tb_is_shadow(unsigned long swp_tb) @@ -63,19 +151,49 @@ static inline bool swp_tb_is_shadow(unsigned long swp_tb) return xa_is_value((void *)swp_tb); } +static inline bool swp_tb_is_bad(unsigned long swp_tb) +{ + return swp_tb == SWP_TB_BAD; +} + +static inline bool swp_tb_is_countable(unsigned long swp_tb) +{ + return (swp_tb_is_shadow(swp_tb) || swp_tb_is_folio(swp_tb) || + swp_tb_is_null(swp_tb)); +} + /* * Helpers for retrieving info from swap table. */ static inline struct folio *swp_tb_to_folio(unsigned long swp_tb) { VM_WARN_ON(!swp_tb_is_folio(swp_tb)); - return (void *)swp_tb; + return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS); } static inline void *swp_tb_to_shadow(unsigned long swp_tb) { VM_WARN_ON(!swp_tb_is_shadow(swp_tb)); - return (void *)swp_tb; + /* No shift needed, xa_value is stored as it is in the lower bits. */ + return (void *)(swp_tb & ~SWP_TB_COUNT_MASK); +} + +static inline unsigned char __swp_tb_get_count(unsigned long swp_tb) +{ + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); + return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT); +} + +static inline int swp_tb_get_count(unsigned long swp_tb) +{ + if (swp_tb_is_countable(swp_tb)) + return __swp_tb_get_count(swp_tb); + return -EINVAL; +} + +static inline unsigned long __swp_tb_mk_count(unsigned long swp_tb, int count) +{ + return ((swp_tb & ~SWP_TB_COUNT_MASK) | __count_to_swp_tb(count)); } /* @@ -120,6 +238,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci, atomic_long_t *table; unsigned long swp_tb; + VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER); + rcu_read_lock(); table = rcu_dereference(ci->table); swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb(); diff --git a/mm/swapfile.c b/mm/swapfile.c index 60e21414624b..9174f1eeffb0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,23 +48,22 @@ #include <linux/swap_cgroup.h> #include "swap_table.h" #include "internal.h" -#include "swap_table.h" #include "swap.h" -static bool swap_count_continued(struct swap_info_struct *, pgoff_t, - unsigned char); -static void free_swap_count_continuations(struct swap_info_struct *); static void swap_range_alloc(struct swap_info_struct *si, unsigned int nr_entries); -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr); -static void swap_put_entry_locked(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset); static bool folio_swapcache_freeable(struct folio *folio); static void move_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci, struct list_head *list, enum swap_cluster_flags new_flags); +/* + * Protects the swap_info array, and the SWP_USED flag. swap_info contains + * lazily allocated & freed swap device info struts, and SWP_USED indicates + * which device is used, ~SWP_USED devices and can be reused. + * + * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK flag. + */ static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; atomic_long_t nr_swap_pages; @@ -110,6 +109,7 @@ struct swap_info_struct *swap_info[MAX_SWAPFILES]; static struct kmem_cache *swap_table_cachep; +/* Protects si->swap_file for /proc/swaps usage */ static DEFINE_MUTEX(swapon_mutex); static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait); @@ -174,22 +174,19 @@ static long swap_usage_in_pages(struct swap_info_struct *si) /* Reclaim the swap entry if swap is getting full */ #define TTRS_FULL 0x4 -static bool swap_only_has_cache(struct swap_info_struct *si, - struct swap_cluster_info *ci, +static bool swap_only_has_cache(struct swap_cluster_info *ci, unsigned long offset, int nr_pages) { unsigned int ci_off = offset % SWAPFILE_CLUSTER; - unsigned char *map = si->swap_map + offset; - unsigned char *map_end = map + nr_pages; + unsigned int ci_end = ci_off + nr_pages; unsigned long swp_tb; do { swp_tb = __swap_table_get(ci, ci_off); VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb)); - if (*map) + if (swp_tb_get_count(swp_tb)) return false; - ++ci_off; - } while (++map < map_end); + } while (++ci_off < ci_end); return true; } @@ -248,7 +245,7 @@ again: * reference or pending writeback, and can't be allocated to others. */ ci = swap_cluster_lock(si, offset); - need_reclaim = swap_only_has_cache(si, ci, offset, nr_pages); + need_reclaim = swap_only_has_cache(ci, offset, nr_pages); swap_cluster_unlock(ci); if (!need_reclaim) goto out_unlock; @@ -446,16 +443,40 @@ static void swap_table_free(struct swap_table *table) swap_table_free_folio_rcu_cb); } +/* + * Sanity check to ensure nothing leaked, and the specified range is empty. + * One special case is that bad slots can't be freed, so check the number of + * bad slots for swapoff, and non-swapoff path must never free bad slots. + */ +static void swap_cluster_assert_empty(struct swap_cluster_info *ci, + unsigned int ci_off, unsigned int nr, + bool swapoff) +{ + unsigned int ci_end = ci_off + nr; + unsigned long swp_tb; + int bad_slots = 0; + + if (!IS_ENABLED(CONFIG_DEBUG_VM) && !swapoff) + return; + + do { + swp_tb = __swap_table_get(ci, ci_off); + if (swp_tb_is_bad(swp_tb)) + bad_slots++; + else + WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); + } while (++ci_off < ci_end); + + WARN_ON_ONCE(bad_slots != (swapoff ? ci->count : 0)); + WARN_ON_ONCE(nr == SWAPFILE_CLUSTER && ci->extend_table); +} + static void swap_cluster_free_table(struct swap_cluster_info *ci) { - unsigned int ci_off; struct swap_table *table; /* Only empty cluster's table is allow to be freed */ lockdep_assert_held(&ci->lock); - VM_WARN_ON_ONCE(!cluster_is_empty(ci)); - for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) - VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off))); table = (void *)rcu_dereference_protected(ci->table, true); rcu_assign_pointer(ci->table, NULL); @@ -476,8 +497,10 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * Only cluster isolation from the allocator does table allocation. * Swap allocator uses percpu clusters and holds the local lock. */ - lockdep_assert_held(&ci->lock); lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); + if (!(si->flags & SWP_SOLIDSTATE)) + lockdep_assert_held(&si->global_cluster_lock); + lockdep_assert_held(&ci->lock); /* The cluster must be free and was just isolated from the free list. */ VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); @@ -559,6 +582,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, false); swap_cluster_free_table(ci); move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); ci->order = 0; @@ -577,6 +601,7 @@ static struct swap_cluster_info *isolate_lock_cluster( struct swap_info_struct *si, struct list_head *list) { struct swap_cluster_info *ci, *found = NULL; + u8 flags = CLUSTER_FLAG_NONE; spin_lock(&si->lock); list_for_each_entry(ci, list, list) { @@ -589,6 +614,7 @@ static struct swap_cluster_info *isolate_lock_cluster( ci->flags != CLUSTER_FLAG_FULL); list_del(&ci->list); + flags = ci->flags; ci->flags = CLUSTER_FLAG_NONE; found = ci; break; @@ -597,6 +623,7 @@ static struct swap_cluster_info *isolate_lock_cluster( if (found && !cluster_table_is_alloced(found)) { /* Only an empty free cluster's swap table can be freed. */ + VM_WARN_ON_ONCE(flags != CLUSTER_FLAG_FREE); VM_WARN_ON_ONCE(list != &si->free_clusters); VM_WARN_ON_ONCE(!cluster_is_empty(found)); return swap_cluster_alloc_table(si, found); @@ -735,12 +762,32 @@ static void relocate_cluster(struct swap_info_struct *si, * slot. The cluster will not be added to the free cluster list, and its * usage counter will be increased by 1. Only used for initialization. */ -static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info, - unsigned long offset) +static int swap_cluster_setup_bad_slot(struct swap_info_struct *si, + struct swap_cluster_info *cluster_info, + unsigned int offset, bool mask) { + unsigned int ci_off = offset % SWAPFILE_CLUSTER; unsigned long idx = offset / SWAPFILE_CLUSTER; - struct swap_table *table; struct swap_cluster_info *ci; + struct swap_table *table; + int ret = 0; + + /* si->max may got shrunk by swap swap_activate() */ + if (offset >= si->max && !mask) { + pr_debug("Ignoring bad slot %u (max: %u)\n", offset, si->max); + return 0; + } + /* + * Account it, skip header slot: si->pages is initiated as + * si->max - 1. Also skip the masking of last cluster, + * si->pages doesn't include that part. + */ + if (offset && !mask) + si->pages -= 1; + if (!si->pages) { + pr_warn("Empty swap-file\n"); + return -EINVAL; + } ci = cluster_info + idx; if (!ci->table) { @@ -749,13 +796,20 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info, return -ENOMEM; rcu_assign_pointer(ci->table, table); } - - ci->count++; + spin_lock(&ci->lock); + /* Check for duplicated bad swap slots. */ + if (__swap_table_xchg(ci, ci_off, SWP_TB_BAD) != SWP_TB_NULL) { + pr_warn("Duplicated bad slot offset %d\n", offset); + ret = -EINVAL; + } else { + ci->count++; + } + spin_unlock(&ci->lock); WARN_ON(ci->count > SWAPFILE_CLUSTER); WARN_ON(ci->flags); - return 0; + return ret; } /* @@ -769,18 +823,16 @@ static bool cluster_reclaim_range(struct swap_info_struct *si, { unsigned int nr_pages = 1 << order; unsigned long offset = start, end = start + nr_pages; - unsigned char *map = si->swap_map; unsigned long swp_tb; spin_unlock(&ci->lock); do { - if (READ_ONCE(map[offset])) - break; swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swp_tb_is_folio(swp_tb)) { + if (swp_tb_get_count(swp_tb)) + break; + if (swp_tb_is_folio(swp_tb)) if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0) break; - } } while (++offset < end); spin_lock(&ci->lock); @@ -804,7 +856,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si, */ for (offset = start; offset < end; offset++) { swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (map[offset] || !swp_tb_is_null(swp_tb)) + if (!swp_tb_is_null(swp_tb)) return false; } @@ -816,57 +868,35 @@ static bool cluster_scan_range(struct swap_info_struct *si, unsigned long offset, unsigned int nr_pages, bool *need_reclaim) { - unsigned long end = offset + nr_pages; - unsigned char *map = si->swap_map; + unsigned int ci_off = offset % SWAPFILE_CLUSTER; + unsigned int ci_end = ci_off + nr_pages; unsigned long swp_tb; - if (cluster_is_empty(ci)) - return true; - do { - if (map[offset]) - return false; - swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - if (swp_tb_is_folio(swp_tb)) { + swp_tb = __swap_table_get(ci, ci_off); + if (swp_tb_is_null(swp_tb)) + continue; + if (swp_tb_is_folio(swp_tb) && !__swp_tb_get_count(swp_tb)) { if (!vm_swap_full()) return false; *need_reclaim = true; - } else { - /* A entry with no count and no cache must be null */ - VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); + continue; } - } while (++offset < end); + /* Slot with zero count can only be NULL or folio */ + VM_WARN_ON(!swp_tb_get_count(swp_tb)); + return false; + } while (++ci_off < ci_end); return true; } -/* - * Currently, the swap table is not used for count tracking, just - * do a sanity check here to ensure nothing leaked, so the swap - * table should be empty upon freeing. - */ -static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci, - unsigned int start, unsigned int nr) -{ - unsigned int ci_off = start % SWAPFILE_CLUSTER; - unsigned int ci_end = ci_off + nr; - unsigned long swp_tb; - - if (IS_ENABLED(CONFIG_DEBUG_VM)) { - do { - swp_tb = __swap_table_get(ci, ci_off); - VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); - } while (++ci_off < ci_end); - } -} - -static bool cluster_alloc_range(struct swap_info_struct *si, - struct swap_cluster_info *ci, - struct folio *folio, - unsigned int offset) +static bool __swap_cluster_alloc_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + struct folio *folio, + unsigned int ci_off) { - unsigned long nr_pages; unsigned int order; + unsigned long nr_pages; lockdep_assert_held(&ci->lock); @@ -885,13 +915,15 @@ static bool cluster_alloc_range(struct swap_info_struct *si, if (likely(folio)) { order = folio_order(folio); nr_pages = 1 << order; - __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset)); + swap_cluster_assert_empty(ci, ci_off, nr_pages, false); + __swap_cache_add_folio(ci, folio, swp_entry(si->type, + ci_off + cluster_offset(si, ci))); } else if (IS_ENABLED(CONFIG_HIBERNATION)) { order = 0; nr_pages = 1; - WARN_ON_ONCE(si->swap_map[offset]); - si->swap_map[offset] = 1; - swap_cluster_assert_table_empty(ci, offset, 1); + swap_cluster_assert_empty(ci, ci_off, 1, false); + /* Sets a fake shadow as placeholder */ + __swap_table_set(ci, ci_off, shadow_to_swp_tb(NULL, 1)); } else { /* Allocation without folio is only possible with hibernation */ WARN_ON_ONCE(1); @@ -917,8 +949,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, { unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID; unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER); - unsigned long end = min(start + SWAPFILE_CLUSTER, si->max); unsigned int order = likely(folio) ? folio_order(folio) : 0; + unsigned long end = start + SWAPFILE_CLUSTER; unsigned int nr_pages = 1 << order; bool need_reclaim, ret, usable; @@ -942,7 +974,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, if (!ret) continue; } - if (!cluster_alloc_range(si, ci, folio, offset)) + if (!__swap_cluster_alloc_entries(si, ci, folio, offset % SWAPFILE_CLUSTER)) break; found = offset; offset += nr_pages; @@ -989,7 +1021,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) long to_scan = 1; unsigned long offset, end; struct swap_cluster_info *ci; - unsigned char *map = si->swap_map; + unsigned long swp_tb; int nr_reclaim; if (force) @@ -1001,8 +1033,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) to_scan--; while (offset < end) { - if (!READ_ONCE(map[offset]) && - swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) { + swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER); + if (swp_tb_is_folio(swp_tb) && !__swp_tb_get_count(swp_tb)) { spin_unlock(&ci->lock); nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -1259,7 +1291,6 @@ static void swap_range_alloc(struct swap_info_struct *si, static void swap_range_free(struct swap_info_struct *si, unsigned long offset, unsigned int nr_entries) { - unsigned long begin = offset; unsigned long end = offset + nr_entries - 1; void (*swap_slot_free_notify)(struct block_device *, unsigned long); unsigned int i; @@ -1284,7 +1315,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } - __swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries); /* * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 @@ -1411,40 +1441,127 @@ start_over: return false; } +static int swap_extend_table_alloc(struct swap_info_struct *si, + struct swap_cluster_info *ci, gfp_t gfp) +{ + void *table; + + table = kzalloc(sizeof(ci->extend_table[0]) * SWAPFILE_CLUSTER, gfp); + if (!table) + return -ENOMEM; + + spin_lock(&ci->lock); + if (!ci->extend_table) + ci->extend_table = table; + else + kfree(table); + spin_unlock(&ci->lock); + return 0; +} + +int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp) +{ + int ret; + struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned long offset = swp_offset(entry); + + si = get_swap_device(entry); + if (!si) + return 0; + + ci = __swap_offset_to_cluster(si, offset); + ret = swap_extend_table_alloc(si, ci, gfp); + + put_swap_device(si); + return ret; +} + +static void swap_extend_table_try_free(struct swap_cluster_info *ci) +{ + unsigned long i; + bool can_free = true; + + if (!ci->extend_table) + return; + + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + if (ci->extend_table[i]) + can_free = false; + } + + if (can_free) { + kfree(ci->extend_table); + ci->extend_table = NULL; + } +} + +/* Decrease the swap count of one slot, without freeing it */ +static void __swap_cluster_put_entry(struct swap_cluster_info *ci, + unsigned int ci_off) +{ + int count; + unsigned long swp_tb; + + lockdep_assert_held(&ci->lock); + swp_tb = __swap_table_get(ci, ci_off); + count = __swp_tb_get_count(swp_tb); + + VM_WARN_ON_ONCE(count <= 0); + VM_WARN_ON_ONCE(count > SWP_TB_COUNT_MAX); + + if (count == SWP_TB_COUNT_MAX) { + count = ci->extend_table[ci_off]; + /* Overflow starts with SWP_TB_COUNT_MAX */ + VM_WARN_ON_ONCE(count < SWP_TB_COUNT_MAX); + count--; + if (count == (SWP_TB_COUNT_MAX - 1)) { + ci->extend_table[ci_off] = 0; + __swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, count)); + swap_extend_table_try_free(ci); + } else { + ci->extend_table[ci_off] = count; + } + } else { + __swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, --count)); + } +} + /** - * swap_put_entries_cluster - Decrease the swap count of a set of slots. + * swap_put_entries_cluster - Decrease the swap count of slots within one cluster * @si: The swap device. - * @start: start offset of slots. + * @offset: start offset of slots. * @nr: number of slots. - * @reclaim_cache: if true, also reclaim the swap cache. + * @reclaim_cache: if true, also reclaim the swap cache if slots are freed. * * This helper decreases the swap count of a set of slots and tries to * batch free them. Also reclaims the swap cache if @reclaim_cache is true. - * Context: The caller must ensure that all slots belong to the same - * cluster and their swap count doesn't go underflow. + * + * Context: The specified slots must be pinned by existing swap count or swap + * cache reference, so they won't be released until this helper returns. */ static void swap_put_entries_cluster(struct swap_info_struct *si, - unsigned long start, int nr, + pgoff_t offset, int nr, bool reclaim_cache) { - unsigned long offset = start, end = start + nr; - unsigned long batch_start = SWAP_ENTRY_INVALID; struct swap_cluster_info *ci; + unsigned int ci_off, ci_end; + pgoff_t end = offset + nr; bool need_reclaim = false; unsigned int nr_reclaimed; unsigned long swp_tb; - unsigned int count; + int ci_batch = -1; ci = swap_cluster_lock(si, offset); + ci_off = offset % SWAPFILE_CLUSTER; + ci_end = ci_off + nr; do { - swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER); - count = si->swap_map[offset]; - VM_WARN_ON(count < 1 || count == SWAP_MAP_BAD); - if (count == 1) { + swp_tb = __swap_table_get(ci, ci_off); + if (swp_tb_get_count(swp_tb) == 1) { /* count == 1 and non-cached slots will be batch freed. */ if (!swp_tb_is_folio(swp_tb)) { - if (!batch_start) - batch_start = offset; + if (ci_batch == -1) + ci_batch = ci_off; continue; } /* count will be 0 after put, slot can be reclaimed */ @@ -1456,21 +1573,20 @@ static void swap_put_entries_cluster(struct swap_info_struct *si, * slots will be freed when folio is removed from swap cache * (__swap_cache_del_folio). */ - swap_put_entry_locked(si, ci, offset); - if (batch_start) { - swap_entries_free(si, ci, batch_start, offset - batch_start); - batch_start = SWAP_ENTRY_INVALID; + __swap_cluster_put_entry(ci, ci_off); + if (ci_batch != -1) { + __swap_cluster_free_entries(si, ci, ci_batch, ci_off - ci_batch); + ci_batch = -1; } - } while (++offset < end); + } while (++ci_off < ci_end); - if (batch_start) - swap_entries_free(si, ci, batch_start, offset - batch_start); + if (ci_batch != -1) + __swap_cluster_free_entries(si, ci, ci_batch, ci_off - ci_batch); swap_cluster_unlock(ci); if (!need_reclaim || !reclaim_cache) return; - offset = start; do { nr_reclaimed = __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); @@ -1480,6 +1596,92 @@ static void swap_put_entries_cluster(struct swap_info_struct *si, } while (offset < end); } +/* Increase the swap count of one slot. */ +static int __swap_cluster_dup_entry(struct swap_cluster_info *ci, + unsigned int ci_off) +{ + int count; + unsigned long swp_tb; + + lockdep_assert_held(&ci->lock); + swp_tb = __swap_table_get(ci, ci_off); + /* Bad or special slots can't be handled */ + if (WARN_ON_ONCE(swp_tb_is_bad(swp_tb))) + return -EINVAL; + count = __swp_tb_get_count(swp_tb); + /* Must be either cached or have a count already */ + if (WARN_ON_ONCE(!count && !swp_tb_is_folio(swp_tb))) + return -ENOENT; + + if (likely(count < (SWP_TB_COUNT_MAX - 1))) { + __swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, count + 1)); + VM_WARN_ON_ONCE(ci->extend_table && ci->extend_table[ci_off]); + } else if (count == (SWP_TB_COUNT_MAX - 1)) { + if (ci->extend_table) { + VM_WARN_ON_ONCE(ci->extend_table[ci_off]); + ci->extend_table[ci_off] = SWP_TB_COUNT_MAX; + __swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, SWP_TB_COUNT_MAX)); + } else { + return -ENOMEM; + } + } else if (count == SWP_TB_COUNT_MAX) { + VM_WARN_ON_ONCE(ci->extend_table[ci_off] >= + type_max(typeof(ci->extend_table[0]))); + ++ci->extend_table[ci_off]; + } else { + /* Never happens unless counting went wrong */ + WARN_ON_ONCE(1); + } + + return 0; +} + +/** + * swap_dup_entries_cluster: Increase the swap count of slots within one cluster. + * @si: The swap device. + * @offset: start offset of slots. + * @nr: number of slots. + * + * Context: The specified slots must be pinned by existing swap count or swap + * cache reference, so they won't be released until this helper returns. + * Return: 0 on success. -ENOMEM if the swap count maxed out (SWP_TB_COUNT_MAX) + * and failed to allocate an extended table, -EINVAL if any entry is bad entry. + */ +static int swap_dup_entries_cluster(struct swap_info_struct *si, + pgoff_t offset, int nr) +{ + int err; + struct swap_cluster_info *ci; + unsigned int ci_start, ci_off, ci_end; + + ci_start = offset % SWAPFILE_CLUSTER; + ci_end = ci_start + nr; + ci_off = ci_start; + ci = swap_cluster_lock(si, offset); +restart: + do { + err = __swap_cluster_dup_entry(ci, ci_off); + if (unlikely(err)) { + if (err == -ENOMEM) { + spin_unlock(&ci->lock); + err = swap_extend_table_alloc(si, ci, GFP_ATOMIC); + spin_lock(&ci->lock); + if (!err) + goto restart; + } + goto failed; + } + } while (++ci_off < ci_end); + swap_cluster_unlock(ci); + return 0; +failed: + while (ci_off-- > ci_start) + __swap_cluster_put_entry(ci, ci_off); + swap_extend_table_try_free(ci); + swap_cluster_unlock(ci); + return err; +} + /** * folio_alloc_swap - allocate swap space for a folio * @folio: folio we want to move to swap @@ -1543,18 +1745,19 @@ again: * @subpage: if not NULL, only increase the swap count of this subpage. * * Typically called when the folio is unmapped and have its swap entry to - * take its palce. + * take its place: Swap entries allocated to a folio has count == 0 and pinned + * by swap cache. The swap cache pin doesn't increase the swap count. This + * helper sets the initial count == 1 and increases the count as the folio is + * unmapped and swap entries referencing the slots are generated to replace + * the folio. * * Context: Caller must ensure the folio is locked and in the swap cache. * NOTE: The caller also has to ensure there is no raced call to * swap_put_entries_direct on its swap entry before this helper returns, or - * the swap map may underflow. Currently, we only accept @subpage == NULL - * for shmem due to the limitation of swap continuation: shmem always - * duplicates the swap entry only once, so there is no such issue for it. + * the swap count may underflow. */ int folio_dup_swap(struct folio *folio, struct page *subpage) { - int err = 0; swp_entry_t entry = folio->swap; unsigned long nr_pages = folio_nr_pages(folio); @@ -1566,10 +1769,8 @@ int folio_dup_swap(struct folio *folio, struct page *subpage) nr_pages = 1; } - while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM) - err = add_swap_count_continuation(entry, GFP_ATOMIC); - - return err; + return swap_dup_entries_cluster(swap_entry_to_info(entry), + swp_offset(entry), nr_pages); } /** @@ -1598,28 +1799,6 @@ void folio_put_swap(struct folio *folio, struct page *subpage) swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false); } -static void swap_put_entry_locked(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset) -{ - unsigned char count; - - count = si->swap_map[offset]; - if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) { - if (count == COUNT_CONTINUED) { - if (swap_count_continued(si, offset, count)) - count = SWAP_MAP_MAX | COUNT_CONTINUED; - else - count = SWAP_MAP_MAX; - } else - count--; - } - - WRITE_ONCE(si->swap_map[offset], count); - if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) - swap_entries_free(si, ci, offset, 1); -} - /* * When we get a swap entry, if there aren't some other ways to * prevent swapoff, such as the folio in swap cache is locked, RCU @@ -1686,31 +1865,30 @@ put_out: } /* - * Drop the last ref of swap entries, caller have to ensure all entries - * belong to the same cgroup and cluster. + * Free a set of swap slots after their swap count dropped to zero, or will be + * zero after putting the last ref (saves one __swap_cluster_put_entry call). */ -void swap_entries_free(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, unsigned int nr_pages) +void __swap_cluster_free_entries(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned int ci_start, unsigned int nr_pages) { - swp_entry_t entry = swp_entry(si->type, offset); - unsigned char *map = si->swap_map + offset; - unsigned char *map_end = map + nr_pages; + unsigned long old_tb; + unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages; + unsigned long offset = cluster_offset(si, ci) + ci_start; - /* It should never free entries across different clusters */ - VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1)); - VM_BUG_ON(cluster_is_empty(ci)); - VM_BUG_ON(ci->count < nr_pages); + VM_WARN_ON(ci->count < nr_pages); ci->count -= nr_pages; do { - VM_WARN_ON(*map > 1); - *map = 0; - } while (++map < map_end); + old_tb = __swap_table_get(ci, ci_off); + /* Release the last ref, or after swap cache is dropped */ + VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1); + __swap_table_set(ci, ci_off, null_to_swp_tb()); + } while (++ci_off < ci_end); - mem_cgroup_uncharge_swap(entry, nr_pages); + mem_cgroup_uncharge_swap(swp_entry(si->type, offset), nr_pages); swap_range_free(si, offset, nr_pages); - swap_cluster_assert_table_empty(ci, offset, nr_pages); + swap_cluster_assert_empty(ci, ci_start, nr_pages, false); if (!ci->count) free_cluster(si, ci); @@ -1720,10 +1898,10 @@ void swap_entries_free(struct swap_info_struct *si, int __swap_count(swp_entry_t entry) { - struct swap_info_struct *si = __swap_entry_to_info(entry); - pgoff_t offset = swp_offset(entry); + struct swap_cluster_info *ci = __swap_entry_to_cluster(entry); + unsigned int ci_off = swp_cluster_offset(entry); - return si->swap_map[offset]; + return swp_tb_get_count(__swap_table_get(ci, ci_off)); } /** @@ -1735,103 +1913,79 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry) { pgoff_t offset = swp_offset(entry); struct swap_cluster_info *ci; - int count; + unsigned long swp_tb; ci = swap_cluster_lock(si, offset); - count = si->swap_map[offset]; + swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER); swap_cluster_unlock(ci); - return count && count != SWAP_MAP_BAD; + return swp_tb_get_count(swp_tb) > 0; } /* * How many references to @entry are currently swapped out? - * This considers COUNT_CONTINUED so it returns exact answer. + * This returns exact answer. */ int swp_swapcount(swp_entry_t entry) { - int count, tmp_count, n; struct swap_info_struct *si; struct swap_cluster_info *ci; - struct page *page; - pgoff_t offset; - unsigned char *map; + unsigned long swp_tb; + int count; si = get_swap_device(entry); if (!si) return 0; - offset = swp_offset(entry); - - ci = swap_cluster_lock(si, offset); - - count = si->swap_map[offset]; - if (!(count & COUNT_CONTINUED)) - goto out; - - count &= ~COUNT_CONTINUED; - n = SWAP_MAP_MAX + 1; - - page = vmalloc_to_page(si->swap_map + offset); - offset &= ~PAGE_MASK; - VM_BUG_ON(page_private(page) != SWP_CONTINUED); - - do { - page = list_next_entry(page, lru); - map = kmap_local_page(page); - tmp_count = map[offset]; - kunmap_local(map); - - count += (tmp_count & ~COUNT_CONTINUED) * n; - n *= (SWAP_CONT_MAX + 1); - } while (tmp_count & COUNT_CONTINUED); -out: + ci = swap_cluster_lock(si, swp_offset(entry)); + swp_tb = __swap_table_get(ci, swp_cluster_offset(entry)); + count = swp_tb_get_count(swp_tb); + if (count == SWP_TB_COUNT_MAX) + count = ci->extend_table[swp_cluster_offset(entry)]; swap_cluster_unlock(ci); put_swap_device(si); - return count; -} -static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, - swp_entry_t entry, int order) -{ - struct swap_cluster_info *ci; - unsigned char *map = si->swap_map; - unsigned int nr_pages = 1 << order; - unsigned long roffset = swp_offset(entry); - unsigned long offset = round_down(roffset, nr_pages); - int i; - bool ret = false; - - ci = swap_cluster_lock(si, offset); - if (nr_pages == 1) { - if (map[roffset]) - ret = true; - goto unlock_out; - } - for (i = 0; i < nr_pages; i++) { - if (map[offset + i]) { - ret = true; - break; - } - } -unlock_out: - swap_cluster_unlock(ci); - return ret; + return count < 0 ? 0 : count; } -static bool folio_swapped(struct folio *folio) +/* + * folio_maybe_swapped - Test if a folio covers any swap slot with count > 0. + * + * Check if a folio is swapped. Holding the folio lock ensures the folio won't + * go from not-swapped to swapped because the initial swap count increment can + * only be done by folio_dup_swap, which also locks the folio. But a concurrent + * decrease of swap count is possible through swap_put_entries_direct, so this + * may return a false positive. + * + * Context: Caller must ensure the folio is locked and in the swap cache. + */ +static bool folio_maybe_swapped(struct folio *folio) { swp_entry_t entry = folio->swap; - struct swap_info_struct *si; + struct swap_cluster_info *ci; + unsigned int ci_off, ci_end; + bool ret = false; VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); - si = __swap_entry_to_info(entry); - if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) - return swap_entry_swapped(si, entry); + ci = __swap_entry_to_cluster(entry); + ci_off = swp_cluster_offset(entry); + ci_end = ci_off + folio_nr_pages(folio); + /* + * Extra locking not needed, folio lock ensures its swap entries + * won't be released, the backing data won't be gone either. + */ + rcu_read_lock(); + do { + if (__swp_tb_get_count(__swap_table_get(ci, ci_off))) { + ret = true; + break; + } + } while (++ci_off < ci_end); + rcu_read_unlock(); - return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); + return ret; } static bool folio_swapcache_freeable(struct folio *folio) @@ -1877,7 +2031,7 @@ bool folio_free_swap(struct folio *folio) { if (!folio_swapcache_freeable(folio)) return false; - if (folio_swapped(folio)) + if (folio_maybe_swapped(folio)) return false; swap_cache_del_folio(folio); @@ -1926,8 +2080,9 @@ out: /* Allocate a slot for hibernation */ swp_entry_t swap_alloc_hibernation_slot(int type) { - struct swap_info_struct *si = swap_type_to_info(type); - unsigned long offset; + struct swap_info_struct *pcp_si, *si = swap_type_to_info(type); + unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID; + struct swap_cluster_info *ci; swp_entry_t entry = {0}; if (!si) @@ -1937,11 +2092,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type) if (get_swap_device_info(si)) { if (si->flags & SWP_WRITEOK) { /* - * Grab the local lock to be compliant - * with swap table allocation. + * Try the local cluster first if it matches the device. If + * not, try grab a new cluster and override local cluster. */ local_lock(&percpu_swap_cluster.lock); - offset = cluster_alloc_swap_entry(si, NULL); + pcp_si = this_cpu_read(percpu_swap_cluster.si[0]); + pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]); + if (pcp_si == si && pcp_offset) { + ci = swap_cluster_lock(si, pcp_offset); + if (cluster_is_usable(ci, 0)) + offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset); + else + swap_cluster_unlock(ci); + } + if (!offset) + offset = cluster_alloc_swap_entry(si, NULL); local_unlock(&percpu_swap_cluster.lock); if (offset) entry = swp_entry(si->type, offset); @@ -1964,7 +2129,8 @@ void swap_free_hibernation_slot(swp_entry_t entry) return; ci = swap_cluster_lock(si, offset); - swap_put_entry_locked(si, ci, offset); + __swap_cluster_put_entry(ci, offset % SWAPFILE_CLUSTER); + __swap_cluster_free_entries(si, ci, offset % SWAPFILE_CLUSTER, 1); swap_cluster_unlock(ci); /* In theory readahead might add it to the swap cache by accident */ @@ -2190,13 +2356,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned int type) { pte_t *pte = NULL; - struct swap_info_struct *si; - si = swap_info[type]; do { struct folio *folio; - unsigned long offset; - unsigned char swp_count; + unsigned long swp_tb; softleaf_t entry; int ret; pte_t ptent; @@ -2215,7 +2378,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (swp_type(entry) != type) continue; - offset = swp_offset(entry); pte_unmap(pte); pte = NULL; @@ -2232,8 +2394,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, &vmf); } if (!folio) { - swp_count = READ_ONCE(si->swap_map[offset]); - if (swp_count == 0 || swp_count == SWAP_MAP_BAD) + swp_tb = swap_table_get(__swap_entry_to_cluster(entry), + swp_cluster_offset(entry)); + if (swp_tb_get_count(swp_tb) <= 0) continue; return -ENOMEM; } @@ -2361,7 +2524,7 @@ unlock: } /* - * Scan swap_map from current position to next entry still in use. + * Scan swap table from current position to next entry still in use. * Return 0 if there are no inuse entries after prev till end of * the map. */ @@ -2370,7 +2533,6 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si, { unsigned int i; unsigned long swp_tb; - unsigned char count; /* * No need for swap_lock here: we're just looking @@ -2379,12 +2541,9 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si, * allocations from this area (while holding swap_lock). */ for (i = prev + 1; i < si->max; i++) { - count = READ_ONCE(si->swap_map[i]); swp_tb = swap_table_get(__swap_offset_to_cluster(si, i), i % SWAPFILE_CLUSTER); - if (count == SWAP_MAP_BAD) - continue; - if (count || swp_tb_is_folio(swp_tb)) + if (!swp_tb_is_null(swp_tb) && !swp_tb_is_bad(swp_tb)) break; if ((i % LATENCY_LIMIT) == 0) cond_resched(); @@ -2521,7 +2680,8 @@ static void drain_mmlist(void) /* * Free all of a swapdev's extent information */ -static void destroy_swap_extents(struct swap_info_struct *sis) +static void destroy_swap_extents(struct swap_info_struct *sis, + struct file *swap_file) { while (!RB_EMPTY_ROOT(&sis->swap_extent_root)) { struct rb_node *rb = sis->swap_extent_root.rb_node; @@ -2532,7 +2692,6 @@ static void destroy_swap_extents(struct swap_info_struct *sis) } if (sis->flags & SWP_ACTIVATED) { - struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; sis->flags &= ~SWP_ACTIVATED; @@ -2615,9 +2774,9 @@ EXPORT_SYMBOL_GPL(add_swap_extent); * Typically it is in the 1-4 megabyte range. So we can have hundreds of * extents in the rbtree. - akpm. */ -static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) +static int setup_swap_extents(struct swap_info_struct *sis, + struct file *swap_file, sector_t *span) { - struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; struct inode *inode = mapping->host; int ret; @@ -2635,7 +2794,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) sis->flags |= SWP_ACTIVATED; if ((sis->flags & SWP_FS_OPS) && sio_pool_init() != 0) { - destroy_swap_extents(sis); + destroy_swap_extents(sis, swap_file); return -ENOMEM; } return ret; @@ -2644,23 +2803,6 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) return generic_swapfile_activate(sis, swap_file, span); } -static void setup_swap_info(struct swap_info_struct *si, int prio, - unsigned char *swap_map, - struct swap_cluster_info *cluster_info, - unsigned long *zeromap) -{ - si->prio = prio; - /* - * the plist prio is negated because plist ordering is - * low-to-high, while swap ordering is high-to-low - */ - si->list.prio = -si->prio; - si->avail_list.prio = -si->prio; - si->swap_map = swap_map; - si->cluster_info = cluster_info; - si->zeromap = zeromap; -} - static void _enable_swap_info(struct swap_info_struct *si) { atomic_long_add(si->pages, &nr_swap_pages); @@ -2674,19 +2816,12 @@ static void _enable_swap_info(struct swap_info_struct *si) add_to_avail_list(si, true); } -static void enable_swap_info(struct swap_info_struct *si, int prio, - unsigned char *swap_map, - struct swap_cluster_info *cluster_info, - unsigned long *zeromap) +/* + * Called after the swap device is ready, resurrect its percpu ref, it's now + * safe to reference it. Add it to the list to expose it to the allocator. + */ +static void enable_swap_info(struct swap_info_struct *si) { - spin_lock(&swap_lock); - spin_lock(&si->lock); - setup_swap_info(si, prio, swap_map, cluster_info, zeromap); - spin_unlock(&si->lock); - spin_unlock(&swap_lock); - /* - * Finished initializing swap device, now it's safe to reference it. - */ percpu_ref_resurrect(&si->users); spin_lock(&swap_lock); spin_lock(&si->lock); @@ -2699,7 +2834,6 @@ static void reinsert_swap_info(struct swap_info_struct *si) { spin_lock(&swap_lock); spin_lock(&si->lock); - setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap); _enable_swap_info(si); spin_unlock(&si->lock); spin_unlock(&swap_lock); @@ -2723,8 +2857,8 @@ static void wait_for_allocation(struct swap_info_struct *si) } } -static void free_cluster_info(struct swap_cluster_info *cluster_info, - unsigned long maxpages) +static void free_swap_cluster_info(struct swap_cluster_info *cluster_info, + unsigned long maxpages) { struct swap_cluster_info *ci; int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); @@ -2736,7 +2870,7 @@ static void free_cluster_info(struct swap_cluster_info *cluster_info, /* Cluster with bad marks count will have a remaining table */ spin_lock(&ci->lock); if (rcu_dereference_protected(ci->table, true)) { - ci->count = 0; + swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, true); swap_cluster_free_table(ci); } spin_unlock(&ci->lock); @@ -2769,7 +2903,6 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si) SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p = NULL; - unsigned char *swap_map; unsigned long *zeromap; struct swap_cluster_info *cluster_info; struct file *swap_file, *victim; @@ -2846,9 +2979,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) flush_work(&p->reclaim_work); flush_percpu_swap_cluster(p); - destroy_swap_extents(p); - if (p->flags & SWP_CONTINUED) - free_swap_count_continuations(p); + destroy_swap_extents(p, p->swap_file); if (!(p->flags & SWP_SOLIDSTATE)) atomic_dec(&nr_rotate_swap); @@ -2860,8 +2991,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) swap_file = p->swap_file; p->swap_file = NULL; - swap_map = p->swap_map; - p->swap_map = NULL; zeromap = p->zeromap; p->zeromap = NULL; maxpages = p->max; @@ -2875,9 +3004,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) mutex_unlock(&swapon_mutex); kfree(p->global_cluster); p->global_cluster = NULL; - vfree(swap_map); kvfree(zeromap); - free_cluster_info(cluster_info, maxpages); + free_swap_cluster_info(cluster_info, maxpages); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); @@ -2934,7 +3062,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos) return SEQ_START_TOKEN; for (type = 0; (si = swap_type_to_info(type)); type++) { - if (!(si->flags & SWP_USED) || !si->swap_map) + if (!(si->swap_file)) continue; if (!--l) return si; @@ -2955,7 +3083,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos) ++(*pos); for (; (si = swap_type_to_info(type)); type++) { - if (!(si->flags & SWP_USED) || !si->swap_map) + if (!(si->swap_file)) continue; return si; } @@ -3095,7 +3223,6 @@ static struct swap_info_struct *alloc_swap_info(void) kvfree(defer); } spin_lock_init(&p->lock); - spin_lock_init(&p->cont_lock); atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT); init_completion(&p->comp); @@ -3222,35 +3349,9 @@ static unsigned long read_swap_header(struct swap_info_struct *si, return maxpages; } -static int setup_swap_map(struct swap_info_struct *si, - union swap_header *swap_header, - unsigned char *swap_map, - unsigned long maxpages) -{ - unsigned long i; - - swap_map[0] = SWAP_MAP_BAD; /* omit header page */ - for (i = 0; i < swap_header->info.nr_badpages; i++) { - unsigned int page_nr = swap_header->info.badpages[i]; - if (page_nr == 0 || page_nr > swap_header->info.last_page) - return -EINVAL; - if (page_nr < maxpages) { - swap_map[page_nr] = SWAP_MAP_BAD; - si->pages--; - } - } - - if (!si->pages) { - pr_warn("Empty swap-file\n"); - return -EINVAL; - } - - return 0; -} - -static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, - union swap_header *swap_header, - unsigned long maxpages) +static int setup_swap_clusters_info(struct swap_info_struct *si, + union swap_header *swap_header, + unsigned long maxpages) { unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); struct swap_cluster_info *cluster_info; @@ -3274,26 +3375,28 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, } /* - * Mark unusable pages as unavailable. The clusters aren't - * marked free yet, so no list operations are involved yet. - * - * See setup_swap_map(): header page, bad pages, - * and the EOF part of the last cluster. + * Mark unusable pages (header page, bad pages, and the EOF part of + * the last cluster) as unavailable. The clusters aren't marked free + * yet, so no list operations are involved yet. */ - err = swap_cluster_setup_bad_slot(cluster_info, 0); + err = swap_cluster_setup_bad_slot(si, cluster_info, 0, false); if (err) goto err; for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; - if (page_nr >= maxpages) - continue; - err = swap_cluster_setup_bad_slot(cluster_info, page_nr); + if (!page_nr || page_nr > swap_header->info.last_page) { + pr_warn("Bad slot offset is out of border: %d (last_page: %d)\n", + page_nr, swap_header->info.last_page); + err = -EINVAL; + goto err; + } + err = swap_cluster_setup_bad_slot(si, cluster_info, page_nr, false); if (err) goto err; } for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) { - err = swap_cluster_setup_bad_slot(cluster_info, i); + err = swap_cluster_setup_bad_slot(si, cluster_info, i, true); if (err) goto err; } @@ -3319,10 +3422,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, } } - return cluster_info; + si->cluster_info = cluster_info; + return 0; err: - free_cluster_info(cluster_info, maxpages); - return ERR_PTR(err); + free_swap_cluster_info(cluster_info, maxpages); + return err; } SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) @@ -3337,9 +3441,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) int nr_extents; sector_t span; unsigned long maxpages; - unsigned char *swap_map = NULL; - unsigned long *zeromap = NULL; - struct swap_cluster_info *cluster_info = NULL; struct folio *folio = NULL; struct inode *inode = NULL; bool inced_nr_rotate_swap = false; @@ -3350,6 +3451,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (!capable(CAP_SYS_ADMIN)) return -EPERM; + /* + * Allocate or reuse existing !SWP_USED swap_info. The returned + * si will stay in a dying status, so nothing will access its content + * until enable_swap_info resurrects its percpu ref and expose it. + */ si = alloc_swap_info(); if (IS_ERR(si)) return PTR_ERR(si); @@ -3365,7 +3471,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap; } - si->swap_file = swap_file; mapping = swap_file->f_mapping; dentry = swap_file->f_path.dentry; inode = mapping->host; @@ -3415,7 +3520,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) si->max = maxpages; si->pages = maxpages - 1; - nr_extents = setup_swap_extents(si, &span); + nr_extents = setup_swap_extents(si, swap_file, &span); if (nr_extents < 0) { error = nr_extents; goto bad_swap_unlock_inode; @@ -3428,18 +3533,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) maxpages = si->max; - /* OK, set up the swap map and apply the bad block list */ - swap_map = vzalloc(maxpages); - if (!swap_map) { - error = -ENOMEM; - goto bad_swap_unlock_inode; - } - - error = swap_cgroup_swapon(si->type, maxpages); + /* Set up the swap cluster info */ + error = setup_swap_clusters_info(si, swap_header, maxpages); if (error) goto bad_swap_unlock_inode; - error = setup_swap_map(si, swap_header, swap_map, maxpages); + error = swap_cgroup_swapon(si->type, maxpages); if (error) goto bad_swap_unlock_inode; @@ -3447,9 +3546,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might * be above MAX_PAGE_ORDER incase of a large swap file. */ - zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long), - GFP_KERNEL | __GFP_ZERO); - if (!zeromap) { + si->zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long), + GFP_KERNEL | __GFP_ZERO); + if (!si->zeromap) { error = -ENOMEM; goto bad_swap_unlock_inode; } @@ -3467,13 +3566,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) inced_nr_rotate_swap = true; } - cluster_info = setup_clusters(si, swap_header, maxpages); - if (IS_ERR(cluster_info)) { - error = PTR_ERR(cluster_info); - cluster_info = NULL; - goto bad_swap_unlock_inode; - } - if ((swap_flags & SWAP_FLAG_DISCARD) && si->bdev && bdev_max_discard_sectors(si->bdev)) { /* @@ -3524,7 +3616,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) prio = DEF_SWAP_PRIO; if (swap_flags & SWAP_FLAG_PREFER) prio = swap_flags & SWAP_FLAG_PRIO_MASK; - enable_swap_info(si, prio, swap_map, cluster_info, zeromap); + + /* + * The plist prio is negated because plist ordering is + * low-to-high, while swap ordering is high-to-low + */ + si->prio = prio; + si->list.prio = -si->prio; + si->avail_list.prio = -si->prio; + si->swap_file = swap_file; + + /* Sets SWP_WRITEOK, resurrect the percpu ref, expose the swap device */ + enable_swap_info(si); pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n", K(si->pages), name->name, si->prio, nr_extents, @@ -3548,16 +3651,19 @@ bad_swap: kfree(si->global_cluster); si->global_cluster = NULL; inode = NULL; - destroy_swap_extents(si); + destroy_swap_extents(si, swap_file); swap_cgroup_swapoff(si->type); + free_swap_cluster_info(si->cluster_info, si->max); + si->cluster_info = NULL; + kvfree(si->zeromap); + si->zeromap = NULL; + /* + * Clear the SWP_USED flag after all resources are freed so + * alloc_swap_info can reuse this si safely. + */ spin_lock(&swap_lock); - si->swap_file = NULL; si->flags = 0; spin_unlock(&swap_lock); - vfree(swap_map); - kvfree(zeromap); - if (cluster_info) - free_cluster_info(cluster_info, maxpages); if (inced_nr_rotate_swap) atomic_dec(&nr_rotate_swap); if (swap_file) @@ -3588,321 +3694,37 @@ void si_swapinfo(struct sysinfo *val) } /* - * Verify that nr swap entries are valid and increment their swap map counts. - * - * Returns error code in following case. - * - success -> 0 - * - swp_entry is invalid -> EINVAL - * - swap-mapped reference is requested but the entry is not used. -> ENOENT - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM - */ -static int swap_dup_entries(struct swap_info_struct *si, - struct swap_cluster_info *ci, - unsigned long offset, - unsigned char usage, int nr) -{ - int i; - unsigned char count; - - for (i = 0; i < nr; i++) { - count = si->swap_map[offset + i]; - /* - * For swapin out, allocator never allocates bad slots. for - * swapin, readahead is guarded by swap_entry_swapped. - */ - if (WARN_ON(count == SWAP_MAP_BAD)) - return -ENOENT; - /* - * Swap count duplication must be guarded by either swap cache folio (from - * folio_dup_swap) or external lock of existing entry (from swap_dup_entry_direct). - */ - if (WARN_ON(!count && - !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))) - return -ENOENT; - if (WARN_ON((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)) - return -EINVAL; - } - - for (i = 0; i < nr; i++) { - count = si->swap_map[offset + i]; - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) - count += usage; - else if (swap_count_continued(si, offset + i, count)) - count = COUNT_CONTINUED; - else { - /* - * Don't need to rollback changes, because if - * usage == 1, there must be nr == 1. - */ - return -ENOMEM; - } - - WRITE_ONCE(si->swap_map[offset + i], count); - } - - return 0; -} - -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) -{ - int err; - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); - - si = swap_entry_to_info(entry); - if (WARN_ON_ONCE(!si)) { - pr_err("%s%08lx\n", Bad_file, entry.val); - return -EINVAL; - } - - VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - ci = swap_cluster_lock(si, offset); - err = swap_dup_entries(si, ci, offset, usage, nr); - swap_cluster_unlock(ci); - return err; -} - -/* * swap_dup_entry_direct() - Increase reference count of a swap entry by one. * @entry: first swap entry from which we want to increase the refcount. * - * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required - * but could not be atomically allocated. Returns 0, just as if it succeeded, - * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which - * might occur if a page table entry has got corrupted. + * Returns 0 for success, or -ENOMEM if the extend table is required + * but could not be atomically allocated. Returns -EINVAL if the swap + * entry is invalid, which might occur if a page table entry has got + * corrupted. * * Context: Caller must ensure there is no race condition on the reference * owner. e.g., locking the PTL of a PTE containing the entry being increased. + * Also the swap entry must have a count >= 1. Otherwise folio_dup_swap should + * be used. */ int swap_dup_entry_direct(swp_entry_t entry) { - int err = 0; - while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM) - err = add_swap_count_continuation(entry, GFP_ATOMIC); - return err; -} - -/* - * add_swap_count_continuation - called when a swap count is duplicated - * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's - * page of the original vmalloc'ed swap_map, to hold the continuation count - * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called - * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc. - * - * These continuation pages are seldom referenced: the common paths all work - * on the original swap_map, only referring to a continuation page when the - * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX. - * - * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding - * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL) - * can be called after dropping locks. - */ -int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask) -{ struct swap_info_struct *si; - struct swap_cluster_info *ci; - struct page *head; - struct page *page; - struct page *list_page; - pgoff_t offset; - unsigned char count; - int ret = 0; - - /* - * When debugging, it's easier to use __GFP_ZERO here; but it's better - * for latency not to zero a page while GFP_ATOMIC and holding locks. - */ - page = alloc_page(gfp_mask | __GFP_HIGHMEM); - - si = get_swap_device(entry); - if (!si) { - /* - * An acceptable race has occurred since the failing - * __swap_duplicate(): the swap device may be swapoff - */ - goto outer; - } - - offset = swp_offset(entry); - ci = swap_cluster_lock(si, offset); - - count = si->swap_map[offset]; - - if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) { - /* - * The higher the swap count, the more likely it is that tasks - * will race to add swap count continuation: we need to avoid - * over-provisioning. - */ - goto out; - } - - if (!page) { - ret = -ENOMEM; - goto out; + si = swap_entry_to_info(entry); + if (WARN_ON_ONCE(!si)) { + pr_err("%s%08lx\n", Bad_file, entry.val); + return -EINVAL; } - head = vmalloc_to_page(si->swap_map + offset); - offset &= ~PAGE_MASK; - - spin_lock(&si->cont_lock); /* - * Page allocation does not initialize the page's lru field, - * but it does always reset its private field. + * The caller must be increasing the swap count from a direct + * reference of the swap slot (e.g. a swap entry in page table). + * So the swap count must be >= 1. */ - if (!page_private(head)) { - BUG_ON(count & COUNT_CONTINUED); - INIT_LIST_HEAD(&head->lru); - set_page_private(head, SWP_CONTINUED); - si->flags |= SWP_CONTINUED; - } - - list_for_each_entry(list_page, &head->lru, lru) { - unsigned char *map; - - /* - * If the previous map said no continuation, but we've found - * a continuation page, free our allocation and use this one. - */ - if (!(count & COUNT_CONTINUED)) - goto out_unlock_cont; - - map = kmap_local_page(list_page) + offset; - count = *map; - kunmap_local(map); - - /* - * If this continuation count now has some space in it, - * free our allocation and use this one. - */ - if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX) - goto out_unlock_cont; - } - - list_add_tail(&page->lru, &head->lru); - page = NULL; /* now it's attached, don't free it */ -out_unlock_cont: - spin_unlock(&si->cont_lock); -out: - swap_cluster_unlock(ci); - put_swap_device(si); -outer: - if (page) - __free_page(page); - return ret; -} - -/* - * swap_count_continued - when the original swap_map count is incremented - * from SWAP_MAP_MAX, check if there is already a continuation page to carry - * into, carry if so, or else fail until a new continuation page is allocated; - * when the original swap_map count is decremented from 0 with continuation, - * borrow from the continuation and report whether it still holds more. - * Called while __swap_duplicate() or caller of swap_put_entry_locked() - * holds cluster lock. - */ -static bool swap_count_continued(struct swap_info_struct *si, - pgoff_t offset, unsigned char count) -{ - struct page *head; - struct page *page; - unsigned char *map; - bool ret; - - head = vmalloc_to_page(si->swap_map + offset); - if (page_private(head) != SWP_CONTINUED) { - BUG_ON(count & COUNT_CONTINUED); - return false; /* need to add count continuation */ - } - - spin_lock(&si->cont_lock); - offset &= ~PAGE_MASK; - page = list_next_entry(head, lru); - map = kmap_local_page(page) + offset; - - if (count == SWAP_MAP_MAX) /* initial increment from swap_map */ - goto init_map; /* jump over SWAP_CONT_MAX checks */ + VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry)); - if (count == (SWAP_MAP_MAX | COUNT_CONTINUED)) { /* incrementing */ - /* - * Think of how you add 1 to 999 - */ - while (*map == (SWAP_CONT_MAX | COUNT_CONTINUED)) { - kunmap_local(map); - page = list_next_entry(page, lru); - BUG_ON(page == head); - map = kmap_local_page(page) + offset; - } - if (*map == SWAP_CONT_MAX) { - kunmap_local(map); - page = list_next_entry(page, lru); - if (page == head) { - ret = false; /* add count continuation */ - goto out; - } - map = kmap_local_page(page) + offset; -init_map: *map = 0; /* we didn't zero the page */ - } - *map += 1; - kunmap_local(map); - while ((page = list_prev_entry(page, lru)) != head) { - map = kmap_local_page(page) + offset; - *map = COUNT_CONTINUED; - kunmap_local(map); - } - ret = true; /* incremented */ - - } else { /* decrementing */ - /* - * Think of how you subtract 1 from 1000 - */ - BUG_ON(count != COUNT_CONTINUED); - while (*map == COUNT_CONTINUED) { - kunmap_local(map); - page = list_next_entry(page, lru); - BUG_ON(page == head); - map = kmap_local_page(page) + offset; - } - BUG_ON(*map == 0); - *map -= 1; - if (*map == 0) - count = 0; - kunmap_local(map); - while ((page = list_prev_entry(page, lru)) != head) { - map = kmap_local_page(page) + offset; - *map = SWAP_CONT_MAX | count; - count = COUNT_CONTINUED; - kunmap_local(map); - } - ret = count == COUNT_CONTINUED; - } -out: - spin_unlock(&si->cont_lock); - return ret; -} - -/* - * free_swap_count_continuations - swapoff free all the continuation pages - * appended to the swap_map, after swap_map is quiesced, before vfree'ing it. - */ -static void free_swap_count_continuations(struct swap_info_struct *si) -{ - pgoff_t offset; - - for (offset = 0; offset < si->max; offset += PAGE_SIZE) { - struct page *head; - head = vmalloc_to_page(si->swap_map + offset); - if (page_private(head)) { - struct page *page, *next; - - list_for_each_entry_safe(page, next, &head->lru, lru) { - list_del(&page->lru); - __free_page(page); - } - } - } + return swap_dup_entries_cluster(si, swp_offset(entry), 1); } #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) diff --git a/mm/truncate.c b/mm/truncate.c index 12467c1bd711..2931d66c16d0 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -17,7 +17,7 @@ #include <linux/export.h> #include <linux/pagemap.h> #include <linux/highmem.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/task_io_accounting_ops.h> #include <linux/shmem_fs.h> #include <linux/rmap.h> @@ -369,7 +369,7 @@ void truncate_inode_pages_range(struct address_space *mapping, pgoff_t start; /* inclusive */ pgoff_t end; /* exclusive */ struct folio_batch fbatch; - pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t indices[FOLIO_BATCH_SIZE]; pgoff_t index; int i; struct folio *folio; @@ -534,7 +534,7 @@ EXPORT_SYMBOL(truncate_inode_pages_final); unsigned long mapping_try_invalidate(struct address_space *mapping, pgoff_t start, pgoff_t end, unsigned long *nr_failed) { - pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t indices[FOLIO_BATCH_SIZE]; struct folio_batch fbatch; pgoff_t index = start; unsigned long ret; @@ -672,7 +672,7 @@ failed: int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) { - pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t indices[FOLIO_BATCH_SIZE]; struct folio_batch fbatch; pgoff_t index; int i; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 927086bb4a3c..89879c3ba344 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -357,7 +357,7 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, if (mm_forbids_zeropage(dst_vma->vm_mm)) return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr); - _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), + _dst_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr), dst_vma->vm_page_prot)); ret = -EAGAIN; dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl); @@ -573,7 +573,7 @@ retry: * in the case of shared pmds. fault mutex prevents * races with other faulting threads. */ - idx = linear_page_index(dst_vma, dst_addr); + idx = hugetlb_linear_page_index(dst_vma, dst_addr); mapping = dst_vma->vm_file->f_mapping; hash = hugetlb_fault_mutex_hash(mapping, idx); mutex_lock(&hugetlb_fault_mutex_table[hash]); @@ -1229,7 +1229,7 @@ static int move_zeropage_pte(struct mm_struct *mm, return -EAGAIN; } - zero_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), + zero_pte = pte_mkspecial(pfn_pte(zero_pfn(dst_addr), dst_vma->vm_page_prot)); ptep_clear_flush(src_vma, src_addr, src_pte); set_pte_at(mm, dst_addr, dst_pte, zero_pte); @@ -1976,6 +1976,9 @@ struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, { struct vm_area_struct *ret; bool give_up_on_oom = false; + vma_flags_t new_vma_flags = vma->flags; + + vma_flags_clear_mask(&new_vma_flags, __VMA_UFFD_FLAGS); /* * If we are modifying only and not splitting, just give up on the merge @@ -1989,8 +1992,8 @@ struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, uffd_wp_range(vma, start, end - start, false); ret = vma_modify_flags_uffd(vmi, prev, vma, start, end, - vma->vm_flags & ~__VM_UFFD_FLAGS, - NULL_VM_UFFD_CTX, give_up_on_oom); + &new_vma_flags, NULL_VM_UFFD_CTX, + give_up_on_oom); /* * In the vma_merge() successful mprotect-like case 8: @@ -2010,10 +2013,11 @@ int userfaultfd_register_range(struct userfaultfd_ctx *ctx, unsigned long start, unsigned long end, bool wp_async) { + vma_flags_t vma_flags = legacy_to_vma_flags(vm_flags); VMA_ITERATOR(vmi, ctx->mm, start); struct vm_area_struct *prev = vma_prev(&vmi); unsigned long vma_end; - vm_flags_t new_flags; + vma_flags_t new_vma_flags; if (vma->vm_start < start) prev = vma; @@ -2024,23 +2028,26 @@ int userfaultfd_register_range(struct userfaultfd_ctx *ctx, VM_WARN_ON_ONCE(!vma_can_userfault(vma, vm_flags, wp_async)); VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx && vma->vm_userfaultfd_ctx.ctx != ctx); - VM_WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)); + VM_WARN_ON_ONCE(!vma_test(vma, VMA_MAYWRITE_BIT)); /* * Nothing to do: this vma is already registered into this * userfaultfd and with the right tracking mode too. */ if (vma->vm_userfaultfd_ctx.ctx == ctx && - (vma->vm_flags & vm_flags) == vm_flags) + vma_test_all_mask(vma, vma_flags)) goto skip; if (vma->vm_start > start) start = vma->vm_start; vma_end = min(end, vma->vm_end); - new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; + new_vma_flags = vma->flags; + vma_flags_clear_mask(&new_vma_flags, __VMA_UFFD_FLAGS); + vma_flags_set_mask(&new_vma_flags, vma_flags); + vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, - new_flags, + &new_vma_flags, (struct vm_userfaultfd_ctx){ctx}, /* give_up_on_oom = */false); if (IS_ERR(vma)) diff --git a/mm/util.c b/mm/util.c index b05ab6f97e11..f063fd4de1e8 100644 --- a/mm/util.c +++ b/mm/util.c @@ -618,6 +618,35 @@ unsigned long vm_mmap(struct file *file, unsigned long addr, } EXPORT_SYMBOL(vm_mmap); +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK +/* + * Perform a userland memory mapping for a shadow stack into the current + * process address space. This is intended to be used by architectures that + * support user shadow stacks. + */ +unsigned long vm_mmap_shadow_stack(unsigned long addr, unsigned long len, + unsigned long flags) +{ + struct mm_struct *mm = current->mm; + unsigned long ret, unused; + vm_flags_t vm_flags = VM_SHADOW_STACK; + + flags |= MAP_ANONYMOUS | MAP_PRIVATE; + if (addr) + flags |= MAP_FIXED_NOREPLACE; + + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + vm_flags |= VM_NOHUGEPAGE; + + mmap_write_lock(mm); + ret = do_mmap(NULL, addr, len, PROT_READ | PROT_WRITE, flags, + vm_flags, 0, &unused, NULL); + mmap_write_unlock(mm); + + return ret; +} +#endif /* CONFIG_ARCH_HAS_USER_SHADOW_STACK */ + /** * __vmalloc_array - allocate memory for a virtually contiguous array. * @n: number of elements. @@ -1135,39 +1164,75 @@ EXPORT_SYMBOL(flush_dcache_folio); #endif /** - * __compat_vma_mmap() - See description for compat_vma_mmap() - * for details. This is the same operation, only with a specific file operations - * struct which may or may not be the same as vma->vm_file->f_op. - * @f_op: The file operations whose .mmap_prepare() hook is specified. - * @file: The file which backs or will back the mapping. - * @vma: The VMA to apply the .mmap_prepare() hook to. + * compat_set_desc_from_vma() - assigns VMA descriptor @desc fields from a VMA. + * @desc: A VMA descriptor whose fields need to be set. + * @file: The file object describing the file being mmap()'d. + * @vma: The VMA whose fields we wish to assign to @desc. + * + * This is a compatibility function to allow an mmap() hook to call + * mmap_prepare() hooks when drivers nest these. This function specifically + * allows the construction of a vm_area_desc value, @desc, from a VMA @vma for + * the purposes of doing this. + * + * Once the conversion of drivers is complete this function will no longer be + * required and will be removed. + */ +void compat_set_desc_from_vma(struct vm_area_desc *desc, + const struct file *file, + const struct vm_area_struct *vma) +{ + memset(desc, 0, sizeof(*desc)); + + desc->mm = vma->vm_mm; + desc->file = (struct file *)file; + desc->start = vma->vm_start; + desc->end = vma->vm_end; + + desc->pgoff = vma->vm_pgoff; + desc->vm_file = vma->vm_file; + desc->vma_flags = vma->flags; + desc->page_prot = vma->vm_page_prot; + + /* Default. */ + desc->action.type = MMAP_NOTHING; +} +EXPORT_SYMBOL(compat_set_desc_from_vma); + +/** + * __compat_vma_mmap() - Similar to compat_vma_mmap(), only it allows + * flexibility as to how the mmap_prepare callback is invoked, which is useful + * for drivers which invoke nested mmap_prepare callbacks in an mmap() hook. + * @desc: A VMA descriptor upon which an mmap_prepare() hook has already been + * executed. + * @vma: The VMA to which @desc should be applied. + * + * The function assumes that you have obtained a VMA descriptor @desc from + * compat_set_desc_from_vma(), and already executed the mmap_prepare() hook upon + * it. + * + * It then performs any specified mmap actions, and invokes the vm_ops->mapped() + * hook if one is present. + * + * See the description of compat_vma_mmap() for more details. + * + * Once the conversion of drivers is complete this function will no longer be + * required and will be removed. + * * Returns: 0 on success or error. */ -int __compat_vma_mmap(const struct file_operations *f_op, - struct file *file, struct vm_area_struct *vma) -{ - struct vm_area_desc desc = { - .mm = vma->vm_mm, - .file = file, - .start = vma->vm_start, - .end = vma->vm_end, - - .pgoff = vma->vm_pgoff, - .vm_file = vma->vm_file, - .vma_flags = vma->flags, - .page_prot = vma->vm_page_prot, - - .action.type = MMAP_NOTHING, /* Default */ - }; +int __compat_vma_mmap(struct vm_area_desc *desc, + struct vm_area_struct *vma) +{ int err; - err = f_op->mmap_prepare(&desc); + /* Perform any preparatory tasks for mmap action. */ + err = mmap_action_prepare(desc); if (err) return err; - - mmap_action_prepare(&desc.action, &desc); - set_vma_from_desc(vma, &desc); - return mmap_action_complete(&desc.action, vma); + /* Update the VMA from the descriptor. */ + compat_set_vma_from_desc(vma, desc); + /* Complete any specified mmap actions. */ + return mmap_action_complete(vma, &desc->action); } EXPORT_SYMBOL(__compat_vma_mmap); @@ -1178,10 +1243,10 @@ EXPORT_SYMBOL(__compat_vma_mmap); * @vma: The VMA to apply the .mmap_prepare() hook to. * * Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain - * stacked filesystems invoke a nested mmap hook of an underlying file. + * stacked drivers invoke a nested mmap hook of an underlying file. * - * Until all filesystems are converted to use .mmap_prepare(), we must be - * conservative and continue to invoke these stacked filesystems using the + * Until all drivers are converted to use .mmap_prepare(), we must be + * conservative and continue to invoke these stacked drivers using the * deprecated .mmap() hook. * * However we have a problem if the underlying file system possesses an @@ -1192,17 +1257,40 @@ EXPORT_SYMBOL(__compat_vma_mmap); * establishes a struct vm_area_desc descriptor, passes to the underlying * .mmap_prepare() hook and applies any changes performed by it. * - * Once the conversion of filesystems is complete this function will no longer - * be required and will be removed. + * Once the conversion of drivers is complete this function will no longer be + * required and will be removed. * * Returns: 0 on success or error. */ int compat_vma_mmap(struct file *file, struct vm_area_struct *vma) { - return __compat_vma_mmap(file->f_op, file, vma); + struct vm_area_desc desc; + struct mmap_action *action; + int err; + + compat_set_desc_from_vma(&desc, file, vma); + err = vfs_mmap_prepare(file, &desc); + if (err) + return err; + action = &desc.action; + + /* being invoked from .mmmap means we don't have to enforce this. */ + action->hide_from_rmap_until_complete = false; + + return __compat_vma_mmap(&desc, vma); } EXPORT_SYMBOL(compat_vma_mmap); +int __vma_check_mmap_hook(struct vm_area_struct *vma) +{ + /* vm_ops->mapped is not valid if mmap() is specified. */ + if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped)) + return -EINVAL; + + return 0; +} +EXPORT_SYMBOL(__vma_check_mmap_hook); + static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio, const struct page *page) { @@ -1237,7 +1325,7 @@ static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio, */ void snapshot_page(struct page_snapshot *ps, const struct page *page) { - unsigned long head, nr_pages = 1; + unsigned long info, nr_pages = 1; struct folio *foliop; int loops = 5; @@ -1247,8 +1335,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page) again: memset(&ps->folio_snapshot, 0, sizeof(struct folio)); memcpy(&ps->page_snapshot, page, sizeof(*page)); - head = ps->page_snapshot.compound_head; - if ((head & 1) == 0) { + info = ps->page_snapshot.compound_info; + if (!(info & 1)) { ps->idx = 0; foliop = (struct folio *)&ps->page_snapshot; if (!folio_test_large(foliop)) { @@ -1259,7 +1347,15 @@ again: } foliop = (struct folio *)page; } else { - foliop = (struct folio *)(head - 1); + /* See compound_head() */ + if (compound_info_has_mask()) { + unsigned long p = (unsigned long)page; + + foliop = (struct folio *)(p & info); + } else { + foliop = (struct folio *)(info - 1); + } + ps->idx = folio_page_idx(foliop, page); } @@ -1283,70 +1379,95 @@ again: } } -static int mmap_action_finish(struct mmap_action *action, - const struct vm_area_struct *vma, int err) +static int call_vma_mapped(struct vm_area_struct *vma) { + const struct vm_operations_struct *vm_ops = vma->vm_ops; + void *vm_private_data = vma->vm_private_data; + int err; + + if (!vm_ops || !vm_ops->mapped) + return 0; + + err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, + vma->vm_file, &vm_private_data); + if (err) + return err; + + if (vm_private_data != vma->vm_private_data) + vma->vm_private_data = vm_private_data; + return 0; +} + +static int mmap_action_finish(struct vm_area_struct *vma, + struct mmap_action *action, int err) +{ + size_t len; + + if (!err) + err = call_vma_mapped(vma); + if (!err && action->success_hook) + err = action->success_hook(vma); + + /* do_munmap() might take rmap lock, so release if held. */ + maybe_rmap_unlock_action(vma, action); + if (!err) + return 0; + /* * If an error occurs, unmap the VMA altogether and return an error. We * only clear the newly allocated VMA, since this function is only * invoked if we do NOT merge, so we only clean up the VMA we created. */ - if (err) { - const size_t len = vma_pages(vma) << PAGE_SHIFT; - - do_munmap(current->mm, vma->vm_start, len, NULL); - - if (action->error_hook) { - /* We may want to filter the error. */ - err = action->error_hook(err); - - /* The caller should not clear the error. */ - VM_WARN_ON_ONCE(!err); - } - return err; + len = vma_pages(vma) << PAGE_SHIFT; + do_munmap(current->mm, vma->vm_start, len, NULL); + if (action->error_hook) { + /* We may want to filter the error. */ + err = action->error_hook(err); + /* The caller should not clear the error. */ + VM_WARN_ON_ONCE(!err); } - - if (action->success_hook) - return action->success_hook(vma); - - return 0; + return err; } #ifdef CONFIG_MMU /** * mmap_action_prepare - Perform preparatory setup for an VMA descriptor * action which need to be performed. - * @desc: The VMA descriptor to prepare for @action. - * @action: The action to perform. + * @desc: The VMA descriptor to prepare for its @desc->action. + * + * Returns: %0 on success, otherwise error. */ -void mmap_action_prepare(struct mmap_action *action, - struct vm_area_desc *desc) +int mmap_action_prepare(struct vm_area_desc *desc) { - switch (action->type) { + switch (desc->action.type) { case MMAP_NOTHING: - break; + return 0; case MMAP_REMAP_PFN: - remap_pfn_range_prepare(desc, action->remap.start_pfn); - break; + return remap_pfn_range_prepare(desc); case MMAP_IO_REMAP_PFN: - io_remap_pfn_range_prepare(desc, action->remap.start_pfn, - action->remap.size); - break; + return io_remap_pfn_range_prepare(desc); + case MMAP_SIMPLE_IO_REMAP: + return simple_ioremap_prepare(desc); + case MMAP_MAP_KERNEL_PAGES: + return map_kernel_pages_prepare(desc); } + + WARN_ON_ONCE(1); + return -EINVAL; } EXPORT_SYMBOL(mmap_action_prepare); /** * mmap_action_complete - Execute VMA descriptor action. - * @action: The action to perform. * @vma: The VMA to perform the action upon. + * @action: The action to perform. * * Similar to mmap_action_prepare(). * * Return: 0 on success, or error, at which point the VMA will be unmapped. */ -int mmap_action_complete(struct mmap_action *action, - struct vm_area_struct *vma) +int mmap_action_complete(struct vm_area_struct *vma, + struct mmap_action *action) { int err = 0; @@ -1354,37 +1475,42 @@ int mmap_action_complete(struct mmap_action *action, case MMAP_NOTHING: break; case MMAP_REMAP_PFN: - err = remap_pfn_range_complete(vma, action->remap.start, - action->remap.start_pfn, action->remap.size, - action->remap.pgprot); + err = remap_pfn_range_complete(vma, action); + break; + case MMAP_MAP_KERNEL_PAGES: + err = map_kernel_pages_complete(vma, action); break; case MMAP_IO_REMAP_PFN: - err = io_remap_pfn_range_complete(vma, action->remap.start, - action->remap.start_pfn, action->remap.size, - action->remap.pgprot); + case MMAP_SIMPLE_IO_REMAP: + /* Should have been delegated. */ + WARN_ON_ONCE(1); + err = -EINVAL; break; } - return mmap_action_finish(action, vma, err); + return mmap_action_finish(vma, action, err); } EXPORT_SYMBOL(mmap_action_complete); #else -void mmap_action_prepare(struct mmap_action *action, - struct vm_area_desc *desc) +int mmap_action_prepare(struct vm_area_desc *desc) { - switch (action->type) { + switch (desc->action.type) { case MMAP_NOTHING: break; case MMAP_REMAP_PFN: case MMAP_IO_REMAP_PFN: + case MMAP_SIMPLE_IO_REMAP: + case MMAP_MAP_KERNEL_PAGES: WARN_ON_ONCE(1); /* nommu cannot handle these. */ break; } + + return 0; } EXPORT_SYMBOL(mmap_action_prepare); -int mmap_action_complete(struct mmap_action *action, - struct vm_area_struct *vma) +int mmap_action_complete(struct vm_area_struct *vma, + struct mmap_action *action) { int err = 0; @@ -1393,13 +1519,15 @@ int mmap_action_complete(struct mmap_action *action, break; case MMAP_REMAP_PFN: case MMAP_IO_REMAP_PFN: + case MMAP_SIMPLE_IO_REMAP: + case MMAP_MAP_KERNEL_PAGES: WARN_ON_ONCE(1); /* nommu cannot handle this. */ err = -EINVAL; break; } - return mmap_action_finish(action, vma, err); + return mmap_action_finish(vma, action, err); } EXPORT_SYMBOL(mmap_action_complete); #endif @@ -38,13 +38,11 @@ struct mmap_state { /* Determine if we can check KSM flags early in mmap() logic. */ bool check_ksm_early :1; - /* If we map new, hold the file rmap lock on mapping. */ - bool hold_file_rmap_lock :1; /* If .mmap_prepare changed the file, we don't need to pin. */ bool file_doesnt_need_get :1; }; -#define MMAP_STATE(name, mm_, vmi_, addr_, len_, pgoff_, vm_flags_, file_) \ +#define MMAP_STATE(name, mm_, vmi_, addr_, len_, pgoff_, vma_flags_, file_) \ struct mmap_state name = { \ .mm = mm_, \ .vmi = vmi_, \ @@ -52,9 +50,9 @@ struct mmap_state { .end = (addr_) + (len_), \ .pgoff = pgoff_, \ .pglen = PHYS_PFN(len_), \ - .vm_flags = vm_flags_, \ + .vma_flags = vma_flags_, \ .file = file_, \ - .page_prot = vm_get_page_prot(vm_flags_), \ + .page_prot = vma_get_page_prot(vma_flags_), \ } #define VMG_MMAP_STATE(name, map_, vma_) \ @@ -63,7 +61,7 @@ struct mmap_state { .vmi = (map_)->vmi, \ .start = (map_)->addr, \ .end = (map_)->end, \ - .vm_flags = (map_)->vm_flags, \ + .vma_flags = (map_)->vma_flags, \ .pgoff = (map_)->pgoff, \ .file = (map_)->file, \ .prev = (map_)->prev, \ @@ -86,10 +84,15 @@ static bool vma_is_fork_child(struct vm_area_struct *vma) static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next) { struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev; + vma_flags_t diff; if (!mpol_equal(vmg->policy, vma_policy(vma))) return false; - if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) + + diff = vma_flags_diff_pair(&vma->flags, &vmg->vma_flags); + vma_flags_clear_mask(&diff, VMA_IGNORE_MERGE_FLAGS); + + if (!vma_flags_empty(&diff)) return false; if (vma->vm_file != vmg->file) return false; @@ -180,7 +183,7 @@ static void init_multi_vma_prep(struct vma_prepare *vp, } /* - * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) + * Return true if we can merge this (vma_flags,anon_vma,file,vm_pgoff) * in front of (at a lower virtual address and file offset than) the vma. * * We cannot merge two vmas if they have differently assigned (non-NULL) @@ -206,7 +209,7 @@ static bool can_vma_merge_before(struct vma_merge_struct *vmg) } /* - * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) + * Return true if we can merge this (vma_flags,anon_vma,file,vm_pgoff) * beyond (at a higher virtual address and file offset than) the vma. * * We cannot merge two vmas if they have differently assigned (non-NULL) @@ -590,7 +593,7 @@ out_free_vma: static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) { - if (vma->vm_mm->map_count >= sysctl_max_map_count) + if (vma->vm_mm->map_count >= get_sysctl_max_map_count()) return -ENOMEM; return __split_vma(vmi, vma, addr, new_below); @@ -805,7 +808,8 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) { - vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; + vma_flags_t sticky_flags = vma_flags_and_mask(&vmg->vma_flags, + VMA_STICKY_FLAGS); struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next; @@ -844,7 +848,8 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( * furthermost left or right side of the VMA, then we have no chance of * merging and should abort. */ - if (vmg->vm_flags & VM_SPECIAL || (!left_side && !right_side)) + if (vma_flags_test_any_mask(&vmg->vma_flags, VMA_SPECIAL_FLAGS) || + (!left_side && !right_side)) return NULL; if (left_side) @@ -898,15 +903,22 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( vma_start_write(middle); if (merge_right) { + vma_flags_t next_sticky; + vma_start_write(next); vmg->target = next; - sticky_flags |= (next->vm_flags & VM_STICKY); + next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS); + vma_flags_set_mask(&sticky_flags, next_sticky); } if (merge_left) { + vma_flags_t prev_sticky; + vma_start_write(prev); vmg->target = prev; - sticky_flags |= (prev->vm_flags & VM_STICKY); + + prev_sticky = vma_flags_and_mask(&prev->flags, VMA_STICKY_FLAGS); + vma_flags_set_mask(&sticky_flags, prev_sticky); } if (merge_both) { @@ -976,7 +988,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort; - vm_flags_set(vmg->target, sticky_flags); + vma_set_flags_mask(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target; @@ -1059,7 +1071,8 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) vmg->state = VMA_MERGE_NOMERGE; /* Special VMAs are unmergeable, also if no prev/next. */ - if ((vmg->vm_flags & VM_SPECIAL) || (!prev && !next)) + if (vma_flags_test_any_mask(&vmg->vma_flags, VMA_SPECIAL_FLAGS) || + (!prev && !next)) return NULL; can_merge_left = can_vma_merge_left(vmg); @@ -1154,12 +1167,16 @@ int vma_expand(struct vma_merge_struct *vmg) struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next; bool remove_next = false; - vm_flags_t sticky_flags; + vma_flags_t sticky_flags = + vma_flags_and_mask(&vmg->vma_flags, VMA_STICKY_FLAGS); + vma_flags_t target_sticky; int ret = 0; mmap_assert_write_locked(vmg->mm); vma_start_write(target); + target_sticky = vma_flags_and_mask(&target->flags, VMA_STICKY_FLAGS); + if (next && target != next && vmg->end == next->vm_end) remove_next = true; @@ -1174,10 +1191,7 @@ int vma_expand(struct vma_merge_struct *vmg) VM_WARN_ON_VMG(target->vm_start < vmg->start || target->vm_end > vmg->end, vmg); - sticky_flags = vmg->vm_flags & VM_STICKY; - sticky_flags |= target->vm_flags & VM_STICKY; - if (remove_next) - sticky_flags |= next->vm_flags & VM_STICKY; + vma_flags_set_mask(&sticky_flags, target_sticky); /* * If we are removing the next VMA or copying from a VMA @@ -1194,13 +1208,18 @@ int vma_expand(struct vma_merge_struct *vmg) return ret; if (remove_next) { + vma_flags_t next_sticky; + vma_start_write(next); vmg->__remove_next = true; + + next_sticky = vma_flags_and_mask(&next->flags, VMA_STICKY_FLAGS); + vma_flags_set_mask(&sticky_flags, next_sticky); } if (commit_merge(vmg)) goto nomem; - vm_flags_set(target, sticky_flags); + vma_set_flags_mask(target, sticky_flags); return 0; nomem: @@ -1394,7 +1413,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && - vms->vma->vm_mm->map_count >= sysctl_max_map_count) { + vms->vma->vm_mm->map_count >= get_sysctl_max_map_count()) { error = -ENOMEM; goto map_count_exceeded; } @@ -1440,17 +1459,17 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, nrpages = vma_pages(next); vms->nr_pages += nrpages; - if (next->vm_flags & VM_LOCKED) + if (vma_test(next, VMA_LOCKED_BIT)) vms->locked_vm += nrpages; - if (next->vm_flags & VM_ACCOUNT) + if (vma_test(next, VMA_ACCOUNT_BIT)) vms->nr_accounted += nrpages; if (is_exec_mapping(next->vm_flags)) vms->exec_vm += nrpages; else if (is_stack_mapping(next->vm_flags)) vms->stack_vm += nrpages; - else if (is_data_mapping(next->vm_flags)) + else if (is_data_mapping_vma_flags(&next->flags)) vms->data_vm += nrpages; if (vms->uf) { @@ -1689,13 +1708,13 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) struct vm_area_struct *vma_modify_flags(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - vm_flags_t *vm_flags_ptr) + vma_flags_t *vma_flags_ptr) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - const vm_flags_t vm_flags = *vm_flags_ptr; + const vma_flags_t vma_flags = *vma_flags_ptr; struct vm_area_struct *ret; - vmg.vm_flags = vm_flags; + vmg.vma_flags = vma_flags; ret = vma_modify(&vmg); if (IS_ERR(ret)) @@ -1707,7 +1726,7 @@ struct vm_area_struct *vma_modify_flags(struct vma_iterator *vmi, * them to the caller. */ if (vmg.state == VMA_MERGE_SUCCESS) - *vm_flags_ptr = ret->vm_flags; + *vma_flags_ptr = ret->flags; return ret; } @@ -1737,12 +1756,13 @@ struct vm_area_struct *vma_modify_policy(struct vma_iterator *vmi, struct vm_area_struct *vma_modify_flags_uffd(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, - unsigned long start, unsigned long end, vm_flags_t vm_flags, - struct vm_userfaultfd_ctx new_ctx, bool give_up_on_oom) + unsigned long start, unsigned long end, + const vma_flags_t *vma_flags, struct vm_userfaultfd_ctx new_ctx, + bool give_up_on_oom) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - vmg.vm_flags = vm_flags; + vmg.vma_flags = *vma_flags; vmg.uffd_ctx = new_ctx; if (give_up_on_oom) vmg.give_up_on_oom = true; @@ -1950,10 +1970,15 @@ out: */ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) { + vma_flags_t diff = vma_flags_diff_pair(&a->flags, &b->flags); + + vma_flags_clear_mask(&diff, VMA_ACCESS_FLAGS); + vma_flags_clear_mask(&diff, VMA_IGNORE_MERGE_FLAGS); + return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file && - !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) && + vma_flags_empty(&diff) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); } @@ -2041,14 +2066,13 @@ static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) static bool vma_is_shared_writable(struct vm_area_struct *vma) { - return (vma->vm_flags & (VM_WRITE | VM_SHARED)) == - (VM_WRITE | VM_SHARED); + return vma_test_all(vma, VMA_WRITE_BIT, VMA_SHARED_BIT); } static bool vma_fs_can_writeback(struct vm_area_struct *vma) { /* No managed pages to writeback. */ - if (vma->vm_flags & VM_PFNMAP) + if (vma_test(vma, VMA_PFNMAP_BIT)) return false; return vma->vm_file && vma->vm_file->f_mapping && @@ -2314,8 +2338,10 @@ void mm_drop_all_locks(struct mm_struct *mm) * We account for memory if it's a private writeable mapping, * not hugepages and VM_NORESERVE wasn't set. */ -static bool accountable_mapping(struct file *file, vm_flags_t vm_flags) +static bool accountable_mapping(struct mmap_state *map) { + const struct file *file = map->file; + /* * hugetlb has its own accounting separate from the core VM * VM_HUGETLB may not be set yet so we cannot check for that flag. @@ -2323,7 +2349,9 @@ static bool accountable_mapping(struct file *file, vm_flags_t vm_flags) if (file && is_file_hugepages(file)) return false; - return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; + return vma_flags_test(&map->vma_flags, VMA_WRITE_BIT) && + !vma_flags_test_any(&map->vma_flags, VMA_NORESERVE_BIT, + VMA_SHARED_BIT); } /* @@ -2361,7 +2389,7 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, static void update_ksm_flags(struct mmap_state *map) { - map->vm_flags = ksm_vma_flags(map->mm, map->file, map->vm_flags); + map->vma_flags = ksm_vma_flags(map->mm, map->file, map->vma_flags); } static void set_desc_from_map(struct vm_area_desc *desc, @@ -2422,11 +2450,11 @@ static int __mmap_setup(struct mmap_state *map, struct vm_area_desc *desc, } /* Check against address space limit. */ - if (!may_expand_vm(map->mm, map->vm_flags, map->pglen - vms->nr_pages)) + if (!may_expand_vm(map->mm, &map->vma_flags, map->pglen - vms->nr_pages)) return -ENOMEM; /* Private writable mapping: check memory availability. */ - if (accountable_mapping(map->file, map->vm_flags)) { + if (accountable_mapping(map)) { map->charged = map->pglen; map->charged -= vms->nr_accounted; if (map->charged) { @@ -2436,7 +2464,7 @@ static int __mmap_setup(struct mmap_state *map, struct vm_area_desc *desc, } vms->nr_accounted = 0; - map->vm_flags |= VM_ACCOUNT; + vma_flags_set(&map->vma_flags, VMA_ACCOUNT_BIT); } /* @@ -2484,12 +2512,12 @@ static int __mmap_new_file_vma(struct mmap_state *map, * Drivers should not permit writability when previously it was * disallowed. */ - VM_WARN_ON_ONCE(map->vm_flags != vma->vm_flags && - !(map->vm_flags & VM_MAYWRITE) && - (vma->vm_flags & VM_MAYWRITE)); + VM_WARN_ON_ONCE(!vma_flags_same_pair(&map->vma_flags, &vma->flags) && + !vma_flags_test(&map->vma_flags, VMA_MAYWRITE_BIT) && + vma_test(vma, VMA_MAYWRITE_BIT)); map->file = vma->vm_file; - map->vm_flags = vma->vm_flags; + map->vma_flags = vma->flags; return 0; } @@ -2500,10 +2528,12 @@ static int __mmap_new_file_vma(struct mmap_state *map, * * @map: Mapping state. * @vmap: Output pointer for the new VMA. + * @action: Any mmap_prepare action that is still to complete. * * Returns: Zero on success, or an error. */ -static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap, + struct mmap_action *action) { struct vma_iterator *vmi = map->vmi; int error = 0; @@ -2520,7 +2550,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) vma_iter_config(vmi, map->addr, map->end); vma_set_range(vma, map->addr, map->end, map->pgoff); - vm_flags_init(vma, map->vm_flags); + vma->flags = map->vma_flags; vma->vm_page_prot = map->page_prot; if (vma_iter_prealloc(vmi, vma)) { @@ -2530,7 +2560,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) if (map->file) error = __mmap_new_file_vma(map, vma); - else if (map->vm_flags & VM_SHARED) + else if (vma_flags_test(&map->vma_flags, VMA_SHARED_BIT)) error = shmem_zero_setup(vma); else vma_set_anonymous(vma); @@ -2540,7 +2570,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) if (!map->check_ksm_early) { update_ksm_flags(map); - vm_flags_init(vma, map->vm_flags); + vma->flags = map->vma_flags; } #ifdef CONFIG_SPARC64 @@ -2552,7 +2582,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) vma_start_write(vma); vma_iter_store_new(vmi, vma); map->mm->map_count++; - vma_link_file(vma, map->hold_file_rmap_lock); + vma_link_file(vma, action->hide_from_rmap_until_complete); /* * vma_merge_new_range() calls khugepaged_enter_vma() too, the below @@ -2580,7 +2610,6 @@ free_vma: static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma) { struct mm_struct *mm = map->mm; - vm_flags_t vm_flags = vma->vm_flags; perf_event_mmap(vma); @@ -2588,11 +2617,9 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma) vms_complete_munmap_vmas(&map->vms, &map->mas_detach); vm_stat_account(mm, vma->vm_flags, map->pglen); - if (vm_flags & VM_LOCKED) { - if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) || - is_vm_hugetlb_page(vma) || - vma == get_gate_vma(mm)) - vm_flags_clear(vma, VM_LOCKED_MASK); + if (vma_test(vma, VMA_LOCKED_BIT)) { + if (!vma_supports_mlock(vma)) + vma_clear_flags_mask(vma, VMA_LOCKED_MASK); else mm->locked_vm += map->pglen; } @@ -2608,20 +2635,21 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma) * a completely new data area). */ if (pgtable_supports_soft_dirty()) - vm_flags_set(vma, VM_SOFTDIRTY); + vma_set_flags(vma, VMA_SOFTDIRTY_BIT); vma_set_page_prot(vma); } -static void call_action_prepare(struct mmap_state *map, - struct vm_area_desc *desc) +static int call_action_prepare(struct mmap_state *map, + struct vm_area_desc *desc) { - struct mmap_action *action = &desc->action; + int err; - mmap_action_prepare(action, desc); + err = mmap_action_prepare(desc); + if (err) + return err; - if (action->hide_from_rmap_until_complete) - map->hold_file_rmap_lock = true; + return 0; } /* @@ -2645,7 +2673,9 @@ static int call_mmap_prepare(struct mmap_state *map, if (err) return err; - call_action_prepare(map, desc); + err = call_action_prepare(map, desc); + if (err) + return err; /* Update fields permitted to be changed. */ map->pgoff = desc->pgoff; @@ -2699,33 +2729,15 @@ static bool can_set_ksm_flags_early(struct mmap_state *map) return false; } -static int call_action_complete(struct mmap_state *map, - struct vm_area_desc *desc, - struct vm_area_struct *vma) -{ - struct mmap_action *action = &desc->action; - int ret; - - ret = mmap_action_complete(action, vma); - - /* If we held the file rmap we need to release it. */ - if (map->hold_file_rmap_lock) { - struct file *file = vma->vm_file; - - i_mmap_unlock_write(file->f_mapping); - } - return ret; -} - static unsigned long __mmap_region(struct file *file, unsigned long addr, - unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + unsigned long len, vma_flags_t vma_flags, + unsigned long pgoff, struct list_head *uf) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; bool have_mmap_prepare = file && file->f_op->mmap_prepare; VMA_ITERATOR(vmi, mm, addr); - MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file); + MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vma_flags, file); struct vm_area_desc desc = { .mm = mm, .file = file, @@ -2756,7 +2768,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, /* ...but if we can't, allocate a new VMA. */ if (!vma) { - error = __mmap_new_vma(&map, &vma); + error = __mmap_new_vma(&map, &vma, &desc.action); if (error) goto unacct_error; allocated_new = true; @@ -2768,8 +2780,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, __mmap_complete(&map, vma); if (have_mmap_prepare && allocated_new) { - error = call_action_complete(&map, &desc, vma); - + error = mmap_action_complete(vma, &desc.action); if (error) return error; } @@ -2816,16 +2827,17 @@ abort_munmap: * been performed. */ unsigned long mmap_region(struct file *file, unsigned long addr, - unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + unsigned long len, vm_flags_t vm_flags, + unsigned long pgoff, struct list_head *uf) { unsigned long ret; bool writable_file_mapping = false; + const vma_flags_t vma_flags = legacy_to_vma_flags(vm_flags); mmap_assert_write_locked(current->mm); /* Check to see if MDWE is applicable. */ - if (map_deny_write_exec(vm_flags, vm_flags)) + if (map_deny_write_exec(&vma_flags, &vma_flags)) return -EACCES; /* Allow architectures to sanity-check the vm_flags. */ @@ -2833,7 +2845,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return -EINVAL; /* Map writable and ensure this isn't a sealed memfd. */ - if (file && is_shared_maywrite_vm_flags(vm_flags)) { + if (file && is_shared_maywrite(&vma_flags)) { int error = mapping_map_writable(file->f_mapping); if (error) @@ -2841,7 +2853,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, writable_file_mapping = true; } - ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf); + ret = __mmap_region(file, addr, len, vma_flags, pgoff, uf); /* Clear our write mapping regardless of error. */ if (writable_file_mapping) @@ -2851,20 +2863,22 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return ret; } -/* +/** * do_brk_flags() - Increase the brk vma if the flags match. * @vmi: The vma iterator * @addr: The start address * @len: The length of the increase * @vma: The vma, - * @vm_flags: The VMA Flags + * @vma_flags: The VMA Flags * * Extend the brk VMA from addr to addr + len. If the VMA is NULL or the flags * do not match then create a new anonymous VMA. Eventually we may be able to * do some brk-specific accounting here. + * + * Returns: %0 on success, or otherwise an error. */ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long addr, unsigned long len, vm_flags_t vm_flags) + unsigned long addr, unsigned long len, vma_flags_t vma_flags) { struct mm_struct *mm = current->mm; @@ -2872,12 +2886,15 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * Check against address space limits by the changed size * Note: This happens *after* clearing old mappings in some code paths. */ - vm_flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; - vm_flags = ksm_vma_flags(mm, NULL, vm_flags); - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) + vma_flags_set_mask(&vma_flags, VMA_DATA_DEFAULT_FLAGS); + vma_flags_set(&vma_flags, VMA_ACCOUNT_BIT); + vma_flags_set_mask(&vma_flags, mm->def_vma_flags); + + vma_flags = ksm_vma_flags(mm, NULL, vma_flags); + if (!may_expand_vm(mm, &vma_flags, len >> PAGE_SHIFT)) return -ENOMEM; - if (mm->map_count > sysctl_max_map_count) + if (mm->map_count > get_sysctl_max_map_count()) return -ENOMEM; if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) @@ -2888,7 +2905,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * occur after forking, so the expand will only happen on new VMAs. */ if (vma && vma->vm_end == addr) { - VMG_STATE(vmg, mm, vmi, addr, addr + len, vm_flags, PHYS_PFN(addr)); + VMG_STATE(vmg, mm, vmi, addr, addr + len, vma_flags, PHYS_PFN(addr)); vmg.prev = vma; /* vmi is positioned at prev, which this mode expects. */ @@ -2909,8 +2926,8 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, vma_set_anonymous(vma); vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT); - vm_flags_init(vma, vm_flags); - vma->vm_page_prot = vm_get_page_prot(vm_flags); + vma->flags = vma_flags; + vma->vm_page_prot = vm_get_page_prot(vma_flags_to_legacy(vma_flags)); vma_start_write(vma); if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail; @@ -2921,10 +2938,10 @@ out: perf_event_mmap(vma); mm->total_vm += len >> PAGE_SHIFT; mm->data_vm += len >> PAGE_SHIFT; - if (vm_flags & VM_LOCKED) + if (vma_flags_test(&vma_flags, VMA_LOCKED_BIT)) mm->locked_vm += (len >> PAGE_SHIFT); if (pgtable_supports_soft_dirty()) - vm_flags_set(vma, VM_SOFTDIRTY); + vma_set_flags(vma, VMA_SOFTDIRTY_BIT); return 0; mas_store_fail: @@ -2973,7 +2990,8 @@ retry: gap = vma_iter_addr(&vmi) + info->start_gap; gap += (info->align_offset - gap) & info->align_mask; tmp = vma_next(&vmi); - if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */ + /* Avoid prev check if possible */ + if (tmp && vma_test_any_mask(tmp, VMA_STARTGAP_FLAGS)) { if (vm_start_gap(tmp) < gap + length - 1) { low_limit = tmp->vm_end; vma_iter_reset(&vmi); @@ -3025,7 +3043,8 @@ retry: gap -= (gap - info->align_offset) & info->align_mask; gap_end = vma_iter_end(&vmi); tmp = vma_next(&vmi); - if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */ + /* Avoid prev check if possible */ + if (tmp && vma_test_any_mask(tmp, VMA_STARTGAP_FLAGS)) { if (vm_start_gap(tmp) < gap_end) { high_limit = vm_start_gap(tmp); vma_iter_reset(&vmi); @@ -3055,7 +3074,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long new_start; /* address space limit tests */ - if (!may_expand_vm(mm, vma->vm_flags, grow)) + if (!may_expand_vm(mm, &vma->flags, grow)) return -ENOMEM; /* Stack limit test */ @@ -3063,12 +3082,16 @@ static int acct_stack_growth(struct vm_area_struct *vma, return -ENOMEM; /* mlock limit tests */ - if (!mlock_future_ok(mm, vma->vm_flags & VM_LOCKED, grow << PAGE_SHIFT)) + if (!mlock_future_ok(mm, vma_test(vma, VMA_LOCKED_BIT), + grow << PAGE_SHIFT)) return -ENOMEM; /* Check to ensure the stack will not grow into a hugetlb-only region */ - new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start : - vma->vm_end - size; + new_start = vma->vm_end - size; +#ifdef CONFIG_STACK_GROWSUP + if (vma_test(vma, VMA_GROWSUP_BIT)) + new_start = vma->vm_start; +#endif if (is_hugepage_only_range(vma->vm_mm, new_start, size)) return -EFAULT; @@ -3082,7 +3105,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, return 0; } -#if defined(CONFIG_STACK_GROWSUP) +#ifdef CONFIG_STACK_GROWSUP /* * PA-RISC uses this for its stack. * vma is the last one with address > vma->vm_end. Have to extend vma. @@ -3095,7 +3118,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) int error = 0; VMA_ITERATOR(vmi, mm, vma->vm_start); - if (!(vma->vm_flags & VM_GROWSUP)) + if (!vma_test(vma, VMA_GROWSUP_BIT)) return -EFAULT; mmap_assert_write_locked(mm); @@ -3115,7 +3138,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) next = find_vma_intersection(mm, vma->vm_end, gap_addr); if (next && vma_is_accessible(next)) { - if (!(next->vm_flags & VM_GROWSUP)) + if (!vma_test(next, VMA_GROWSUP_BIT)) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ } @@ -3149,7 +3172,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (vma->vm_pgoff + (size >> PAGE_SHIFT) >= vma->vm_pgoff) { error = acct_stack_growth(vma, size, grow); if (!error) { - if (vma->vm_flags & VM_LOCKED) + if (vma_test(vma, VMA_LOCKED_BIT)) mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); @@ -3180,7 +3203,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address) int error = 0; VMA_ITERATOR(vmi, mm, vma->vm_start); - if (!(vma->vm_flags & VM_GROWSDOWN)) + if (!vma_test(vma, VMA_GROWSDOWN_BIT)) return -EFAULT; mmap_assert_write_locked(mm); @@ -3193,7 +3216,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address) prev = vma_prev(&vmi); /* Check that both stack segments have the same anon_vma? */ if (prev) { - if (!(prev->vm_flags & VM_GROWSDOWN) && + if (!vma_test(prev, VMA_GROWSDOWN_BIT) && vma_is_accessible(prev) && (address - prev->vm_end < stack_guard_gap)) return -ENOMEM; @@ -3228,7 +3251,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address) if (grow <= vma->vm_pgoff) { error = acct_stack_growth(vma, size, grow); if (!error) { - if (vma->vm_flags & VM_LOCKED) + if (vma_test(vma, VMA_LOCKED_BIT)) mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); @@ -3274,11 +3297,10 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { unsigned long charged = vma_pages(vma); - if (find_vma_intersection(mm, vma->vm_start, vma->vm_end)) return -ENOMEM; - if ((vma->vm_flags & VM_ACCOUNT) && + if (vma_test(vma, VMA_ACCOUNT_BIT) && security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; @@ -3300,10 +3322,31 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) } if (vma_link(mm, vma)) { - if (vma->vm_flags & VM_ACCOUNT) + if (vma_test(vma, VMA_ACCOUNT_BIT)) vm_unacct_memory(charged); return -ENOMEM; } return 0; } + +/** + * vma_mmu_pagesize - Default MMU page size granularity for this VMA. + * @vma: The user mapping. + * + * In the common case, the default page size used by the MMU matches the + * default page size used by the kernel (see vma_kernel_pagesize()). On + * architectures where it differs, an architecture-specific 'strong' version + * of this symbol is required. + * + * The default MMU page size is not affected by Transparent Huge Pages + * being in effect, or any usage of larger MMU page sizes (either through + * architectural huge-page mappings or other explicit/implicit coalescing of + * virtual ranges performed by the MMU). + * + * Return: The default MMU page size granularity for this VMA. + */ +__weak unsigned long vma_mmu_pagesize(struct vm_area_struct *vma) +{ + return vma_kernel_pagesize(vma); +} @@ -98,7 +98,11 @@ struct vma_merge_struct { unsigned long end; pgoff_t pgoff; - vm_flags_t vm_flags; + union { + /* Temporary while VMA flags are being converted. */ + vm_flags_t vm_flags; + vma_flags_t vma_flags; + }; struct file *file; struct anon_vma *anon_vma; struct mempolicy *policy; @@ -233,13 +237,13 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, return vma->vm_pgoff + PHYS_PFN(addr - vma->vm_start); } -#define VMG_STATE(name, mm_, vmi_, start_, end_, vm_flags_, pgoff_) \ +#define VMG_STATE(name, mm_, vmi_, start_, end_, vma_flags_, pgoff_) \ struct vma_merge_struct name = { \ .mm = mm_, \ .vmi = vmi_, \ .start = start_, \ .end = end_, \ - .vm_flags = vm_flags_, \ + .vma_flags = vma_flags_, \ .pgoff = pgoff_, \ .state = VMA_MERGE_START, \ } @@ -296,7 +300,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi, * f_op->mmap() but which might have an underlying file system which implements * f_op->mmap_prepare(). */ -static inline void set_vma_from_desc(struct vm_area_struct *vma, +static inline void compat_set_vma_from_desc(struct vm_area_struct *vma, struct vm_area_desc *desc) { /* @@ -338,24 +342,23 @@ void unmap_region(struct unmap_desc *unmap); * @vma: The VMA containing the range @start to @end to be updated. * @start: The start of the range to update. May be offset within @vma. * @end: The exclusive end of the range to update, may be offset within @vma. - * @vm_flags_ptr: A pointer to the VMA flags that the @start to @end range is + * @vma_flags_ptr: A pointer to the VMA flags that the @start to @end range is * about to be set to. On merge, this will be updated to include sticky flags. * * IMPORTANT: The actual modification being requested here is NOT applied, * rather the VMA is perhaps split, perhaps merged to accommodate the change, * and the caller is expected to perform the actual modification. * - * In order to account for sticky VMA flags, the @vm_flags_ptr parameter points + * In order to account for sticky VMA flags, the @vma_flags_ptr parameter points * to the requested flags which are then updated so the caller, should they * overwrite any existing flags, correctly retains these. * * Returns: A VMA which contains the range @start to @end ready to have its - * flags altered to *@vm_flags. + * flags altered to *@vma_flags. */ __must_check struct vm_area_struct *vma_modify_flags(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, - unsigned long start, unsigned long end, - vm_flags_t *vm_flags_ptr); + unsigned long start, unsigned long end, vma_flags_t *vma_flags_ptr); /** * vma_modify_name() - Perform any necessary split/merge in preparation for @@ -414,7 +417,7 @@ __must_check struct vm_area_struct *vma_modify_policy(struct vma_iterator *vmi, * @vma: The VMA containing the range @start to @end to be updated. * @start: The start of the range to update. May be offset within @vma. * @end: The exclusive end of the range to update, may be offset within @vma. - * @vm_flags: The VMA flags that the @start to @end range is about to be set to. + * @vma_flags: The VMA flags that the @start to @end range is about to be set to. * @new_ctx: The userfaultfd context that the @start to @end range is about to * be set to. * @give_up_on_oom: If an out of memory condition occurs on merge, simply give @@ -425,11 +428,11 @@ __must_check struct vm_area_struct *vma_modify_policy(struct vma_iterator *vmi, * and the caller is expected to perform the actual modification. * * Returns: A VMA which contains the range @start to @end ready to have its VMA - * flags changed to @vm_flags and its userfaultfd context changed to @new_ctx. + * flags changed to @vma_flags and its userfaultfd context changed to @new_ctx. */ __must_check struct vm_area_struct *vma_modify_flags_uffd(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, - unsigned long start, unsigned long end, vm_flags_t vm_flags, + unsigned long start, unsigned long end, const vma_flags_t *vma_flags, struct vm_userfaultfd_ctx new_ctx, bool give_up_on_oom); __must_check struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg); @@ -461,7 +464,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct list_head *uf); int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma, - unsigned long addr, unsigned long request, unsigned long flags); + unsigned long addr, unsigned long request, + vma_flags_t vma_flags); unsigned long unmapped_area(struct vm_unmapped_area_info *info); unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info); @@ -523,6 +527,11 @@ static inline bool is_data_mapping(vm_flags_t flags) return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE; } +static inline bool is_data_mapping_vma_flags(const vma_flags_t *vma_flags) +{ + return vma_flags_test(vma_flags, VMA_WRITE_BIT) && + !vma_flags_test_any(vma_flags, VMA_SHARED_BIT, VMA_STACK_BIT); +} static inline void vma_iter_config(struct vma_iterator *vmi, unsigned long index, unsigned long last) @@ -693,4 +702,55 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap, int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift); #endif +#ifdef CONFIG_MMU +/* + * Denies creating a writable executable mapping or gaining executable permissions. + * + * This denies the following: + * + * a) mmap(PROT_WRITE | PROT_EXEC) + * + * b) mmap(PROT_WRITE) + * mprotect(PROT_EXEC) + * + * c) mmap(PROT_WRITE) + * mprotect(PROT_READ) + * mprotect(PROT_EXEC) + * + * But allows the following: + * + * d) mmap(PROT_READ | PROT_EXEC) + * mmap(PROT_READ | PROT_EXEC | PROT_BTI) + * + * This is only applicable if the user has set the Memory-Deny-Write-Execute + * (MDWE) protection mask for the current process. + * + * @old specifies the VMA flags the VMA originally possessed, and @new the ones + * we propose to set. + * + * Return: false if proposed change is OK, true if not ok and should be denied. + */ +static inline bool map_deny_write_exec(const vma_flags_t *old, + const vma_flags_t *new) +{ + /* If MDWE is disabled, we have nothing to deny. */ + if (!mm_flags_test(MMF_HAS_MDWE, current->mm)) + return false; + + /* If the new VMA is not executable, we have nothing to deny. */ + if (!vma_flags_test(new, VMA_EXEC_BIT)) + return false; + + /* Under MDWE we do not accept newly writably executable VMAs... */ + if (vma_flags_test(new, VMA_WRITE_BIT)) + return true; + + /* ...nor previously non-executable VMAs becoming executable. */ + if (!vma_flags_test(old, VMA_EXEC_BIT)) + return true; + + return false; +} +#endif + #endif /* __MM_VMA_H */ diff --git a/mm/vma_exec.c b/mm/vma_exec.c index 8134e1afca68..5cee8b7efa0f 100644 --- a/mm/vma_exec.c +++ b/mm/vma_exec.c @@ -36,7 +36,8 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; VMA_ITERATOR(vmi, mm, new_start); - VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff); + VMG_STATE(vmg, mm, &vmi, new_start, old_end, EMPTY_VMA_FLAGS, + vma->vm_pgoff); struct vm_area_struct *next; struct mmu_gather tlb; PAGETABLE_MOVE(pmc, vma, vma, old_start, new_start, length); @@ -135,7 +136,7 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap, * use STACK_TOP because that can depend on attributes which aren't * configured yet. */ - BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP); + VM_WARN_ON_ONCE(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP); vma->vm_end = STACK_TOP_MAX; vma->vm_start = vma->vm_end - PAGE_SIZE; if (pgtable_supports_soft_dirty()) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 61caa55a4402..b31b208f6ecb 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1068,14 +1068,8 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); static void drain_vmap_area_work(struct work_struct *work); static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work); -static __cacheline_aligned_in_smp atomic_long_t nr_vmalloc_pages; static __cacheline_aligned_in_smp atomic_long_t vmap_lazy_nr; -unsigned long vmalloc_nr_pages(void) -{ - return atomic_long_read(&nr_vmalloc_pages); -} - static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root) { struct rb_node *n = root->rb_node; @@ -3189,7 +3183,7 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) kasan_populate_early_vm_area_shadow(vm->addr, vm->size); } -static void clear_vm_uninitialized_flag(struct vm_struct *vm) +void clear_vm_uninitialized_flag(struct vm_struct *vm) { /* * Before removing VM_UNINITIALIZED, @@ -3465,9 +3459,6 @@ void vfree(const void *addr) if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS)) vm_reset_perms(vm); - /* All pages of vm should be charged to same memcg, so use first one. */ - if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES)) - mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages); for (i = 0; i < vm->nr_pages; i++) { struct page *page = vm->pages[i]; @@ -3476,11 +3467,11 @@ void vfree(const void *addr) * High-order allocs for huge vmallocs are split, so * can be freed as an array of order-0 allocations */ + if (!(vm->flags & VM_MAP_PUT_PAGES)) + mod_lruvec_page_state(page, NR_VMALLOC, -1); __free_page(page); cond_resched(); } - if (!(vm->flags & VM_MAP_PUT_PAGES)) - atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages); kvfree(vm->pages); kfree(vm); } @@ -3668,6 +3659,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, continue; } + mod_lruvec_page_state(page, NR_VMALLOC, 1 << large_order); + split_page(page, large_order); for (i = 0; i < (1U << large_order); i++) pages[nr_allocated + i] = page + i; @@ -3688,6 +3681,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (!order) { while (nr_allocated < nr_pages) { unsigned int nr, nr_pages_request; + int i; /* * A maximum allowed request is hard-coded and is 100 @@ -3711,6 +3705,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid, nr_pages_request, pages + nr_allocated); + for (i = nr_allocated; i < nr_allocated + nr; i++) + mod_lruvec_page_state(pages[i], NR_VMALLOC, 1); + nr_allocated += nr; /* @@ -3735,6 +3732,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (unlikely(!page)) break; + mod_lruvec_page_state(page, NR_VMALLOC, 1 << order); + /* * High-order allocations must be able to be treated as * independent small pages by callers (as they can with @@ -3798,6 +3797,8 @@ static void defer_vm_area_cleanup(struct vm_struct *area) * non-blocking (no __GFP_DIRECT_RECLAIM) - memalloc_noreclaim_save() * GFP_NOFS - memalloc_nofs_save() * GFP_NOIO - memalloc_noio_save() + * __GFP_RETRY_MAYFAIL, __GFP_NORETRY - memalloc_noreclaim_save() + * to prevent OOMs * * Returns a flag cookie to pair with restore. */ @@ -3806,7 +3807,8 @@ memalloc_apply_gfp_scope(gfp_t gfp_mask) { unsigned int flags = 0; - if (!gfpflags_allow_blocking(gfp_mask)) + if (!gfpflags_allow_blocking(gfp_mask) || + (gfp_mask & (__GFP_RETRY_MAYFAIL | __GFP_NORETRY))) flags = memalloc_noreclaim_save(); else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) flags = memalloc_nofs_save(); @@ -3877,12 +3879,6 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, vmalloc_gfp_adjust(gfp_mask, page_order), node, page_order, nr_small_pages, area->pages); - atomic_long_add(area->nr_pages, &nr_vmalloc_pages); - /* All pages of vm should be charged to same memcg, so use first one. */ - if (gfp_mask & __GFP_ACCOUNT && area->nr_pages) - mod_memcg_page_state(area->pages[0], MEMCG_VMALLOC, - area->nr_pages); - /* * If not enough pages were obtained to accomplish an * allocation request, free them via vfree() if any. @@ -3901,7 +3897,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, if (!fatal_signal_pending(current) && page_order == 0) warn_alloc(gfp_mask, NULL, "vmalloc error: size %lu, failed to allocate pages", - area->nr_pages * PAGE_SIZE); + nr_small_pages * PAGE_SIZE); goto fail; } @@ -3940,7 +3936,8 @@ fail: * GFP_KERNEL_ACCOUNT. Xfs uses __GFP_NOLOCKDEP. */ #define GFP_VMALLOC_SUPPORTED (GFP_KERNEL | GFP_ATOMIC | GFP_NOWAIT |\ - __GFP_NOFAIL | __GFP_ZERO | __GFP_NORETRY |\ + __GFP_NOFAIL | __GFP_ZERO |\ + __GFP_NORETRY | __GFP_RETRY_MAYFAIL |\ GFP_NOFS | GFP_NOIO | GFP_KERNEL_ACCOUNT |\ GFP_USER | __GFP_NOLOCKDEP) @@ -3971,12 +3968,15 @@ static gfp_t vmalloc_fix_flags(gfp_t flags) * virtual range with protection @prot. * * Supported GFP classes: %GFP_KERNEL, %GFP_ATOMIC, %GFP_NOWAIT, - * %GFP_NOFS and %GFP_NOIO. Zone modifiers are not supported. + * %__GFP_RETRY_MAYFAIL, %__GFP_NORETRY, %GFP_NOFS and %GFP_NOIO. + * Zone modifiers are not supported. * Please note %GFP_ATOMIC and %GFP_NOWAIT are supported only * by __vmalloc(). * - * Retry modifiers: only %__GFP_NOFAIL is supported; %__GFP_NORETRY - * and %__GFP_RETRY_MAYFAIL are not supported. + * Retry modifiers: only %__GFP_NOFAIL is fully supported; + * %__GFP_NORETRY and %__GFP_RETRY_MAYFAIL are supported with limitation, + * i.e. page tables are allocated with NOWAIT semantic so they might fail + * under moderate memory pressure. * * %__GFP_NOWARN can be used to suppress failure messages. * @@ -4575,20 +4575,20 @@ finished: * @count: number of bytes to be read. * * This function checks that addr is a valid vmalloc'ed area, and - * copy data from that area to a given buffer. If the given memory range + * copies data from that area to a given iterator. If the given memory range * of [addr...addr+count) includes some valid address, data is copied to - * proper area of @buf. If there are memory holes, they'll be zero-filled. + * proper area of @iter. If there are memory holes, they'll be zero-filled. * IOREMAP area is treated as memory hole and no copy is done. * * If [addr...addr+count) doesn't includes any intersects with alive - * vm_struct area, returns 0. @buf should be kernel's buffer. + * vm_struct area, returns 0. * - * Note: In usual ops, vread() is never necessary because the caller + * Note: In usual ops, vread_iter() is never necessary because the caller * should know vmalloc() area is valid and can use memcpy(). * This is for routines which have to access vmalloc area without * any information, as /proc/kcore. * - * Return: number of bytes for which addr and buf should be increased + * Return: number of bytes for which addr and iter should be advanced * (same number as @count) or %0 if [addr...addr+count) doesn't * include any intersection with valid vmalloc area */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 0fc9373e8251..4bf091b1c8af 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -44,7 +44,7 @@ #include <linux/sysctl.h> #include <linux/memory-tiers.h> #include <linux/oom.h> -#include <linux/pagevec.h> +#include <linux/folio_batch.h> #include <linux/prefetch.h> #include <linux/printk.h> #include <linux/dax.h> @@ -905,7 +905,7 @@ static enum folio_references folio_check_references(struct folio *folio, if (referenced_ptes == -1) return FOLIOREF_KEEP; - if (lru_gen_enabled()) { + if (lru_gen_enabled() && !lru_gen_switching()) { if (!referenced_ptes) return FOLIOREF_RECLAIM; @@ -963,8 +963,7 @@ static void folio_check_dirty_writeback(struct folio *folio, * They could be mistakenly treated as file lru. So further anon * test is needed. */ - if (!folio_is_file_lru(folio) || - (folio_test_anon(folio) && !folio_test_swapbacked(folio))) { + if (!folio_is_file_lru(folio) || folio_test_lazyfree(folio)) { *dirty = false; *writeback = false; return; @@ -986,13 +985,11 @@ static void folio_check_dirty_writeback(struct folio *folio, static struct folio *alloc_demote_folio(struct folio *src, unsigned long private) { + struct migration_target_control *mtc, target_nid_mtc; struct folio *dst; - nodemask_t *allowed_mask; - struct migration_target_control *mtc; mtc = (struct migration_target_control *)private; - allowed_mask = mtc->nmask; /* * make sure we allocate from the target node first also trying to * demote or reclaim pages from the target node via kswapd if we are @@ -1002,15 +999,13 @@ static struct folio *alloc_demote_folio(struct folio *src, * a demotion of cold pages from the target memtier. This can result * in the kernel placing hot pages in slower(lower) memory tiers. */ - mtc->nmask = NULL; - mtc->gfp_mask |= __GFP_THISNODE; - dst = alloc_migration_target(src, (unsigned long)mtc); + target_nid_mtc = *mtc; + target_nid_mtc.nmask = NULL; + target_nid_mtc.gfp_mask |= __GFP_THISNODE; + dst = alloc_migration_target(src, (unsigned long)&target_nid_mtc); if (dst) return dst; - mtc->gfp_mask &= ~__GFP_THISNODE; - mtc->nmask = allowed_mask; - return alloc_migration_target(src, (unsigned long)mtc); } @@ -1070,7 +1065,7 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) /* * We can "enter_fs" for swap-cache with only __GFP_IO * providing this isn't SWP_FS_OPS. - * ->flags can be updated non-atomically (scan_swap_map_slots), + * ->flags can be updated non-atomically, * but that will never affect SWP_FS_OPS, so the data_race * is safe. */ @@ -1508,7 +1503,7 @@ retry: } } - if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) { + if (folio_test_lazyfree(folio)) { /* follow __remove_mapping for reference */ if (!folio_ref_freeze(folio, 1)) goto keep_locked; @@ -1984,7 +1979,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, unsigned long nr_taken; struct reclaim_stat stat; bool file = is_file_lru(lru); - enum vm_event_item item; + enum node_stat_item item; struct pglist_data *pgdat = lruvec_pgdat(lruvec); bool stalled = false; @@ -2010,10 +2005,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); item = PGSCAN_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) - __count_vm_events(item, nr_scanned); - count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); - __count_vm_events(PGSCAN_ANON + file, nr_scanned); + mod_lruvec_state(lruvec, item, nr_scanned); + mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); spin_unlock_irq(&lruvec->lru_lock); @@ -2030,10 +2023,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); item = PGSTEAL_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) - __count_vm_events(item, nr_reclaimed); - count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); - __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); + mod_lruvec_state(lruvec, item, nr_reclaimed); + mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); @@ -2120,9 +2111,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); - if (!cgroup_reclaim(sc)) - __count_vm_events(PGREFILL, nr_scanned); - count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); + mod_lruvec_state(lruvec, PGREFILL, nr_scanned); spin_unlock_irq(&lruvec->lru_lock); @@ -2319,7 +2308,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) unsigned long file; struct lruvec *target_lruvec; - if (lru_gen_enabled()) + if (lru_gen_enabled() && !lru_gen_switching()) return; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -2658,6 +2647,7 @@ static bool can_age_anon_pages(struct lruvec *lruvec, #ifdef CONFIG_LRU_GEN +DEFINE_STATIC_KEY_FALSE(lru_switch); #ifdef CONFIG_LRU_GEN_ENABLED DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); #define get_cap(cap) static_branch_likely(&lru_gen_caps[cap]) @@ -3506,6 +3496,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); DEFINE_MAX_SEQ(walk->lruvec); int gen = lru_gen_from_seq(max_seq); + unsigned int nr; pmd_t pmdval; pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval, &ptl); @@ -3524,11 +3515,13 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, lazy_mmu_mode_enable(); restart: - for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { + for (i = pte_index(start), addr = start; addr != end; i += nr, addr += nr * PAGE_SIZE) { unsigned long pfn; struct folio *folio; - pte_t ptent = ptep_get(pte + i); + pte_t *cur_pte = pte + i; + pte_t ptent = ptep_get(cur_pte); + nr = 1; total++; walk->mm_stats[MM_LEAF_TOTAL]++; @@ -3540,7 +3533,16 @@ restart: if (!folio) continue; - if (!ptep_clear_young_notify(args->vma, addr, pte + i)) + if (folio_test_large(folio)) { + const unsigned int max_nr = (end - addr) >> PAGE_SHIFT; + + nr = folio_pte_batch_flags(folio, NULL, cur_pte, &ptent, + max_nr, FPB_MERGE_YOUNG_DIRTY); + total += nr - 1; + walk->mm_stats[MM_LEAF_TOTAL] += nr - 1; + } + + if (!test_and_clear_young_ptes_notify(args->vma, addr, cur_pte, nr)) continue; if (last != folio) { @@ -3553,8 +3555,8 @@ restart: if (pte_dirty(ptent)) dirty = true; - young++; - walk->mm_stats[MM_LEAF_YOUNG]++; + young += nr; + walk->mm_stats[MM_LEAF_YOUNG] += nr; } walk_update_folio(walk, last, gen, dirty); @@ -3631,7 +3633,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area if (!folio) goto next; - if (!pmdp_clear_young_notify(vma, addr, pmd + i)) + if (!pmdp_test_and_clear_young_notify(vma, addr, pmd + i)) goto next; if (last != folio) { @@ -4198,7 +4200,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * the PTE table to the Bloom filter. This forms a feedback loop between the * eviction and the aging. */ -bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int nr) { int i; bool dirty; @@ -4221,7 +4223,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) lockdep_assert_held(pvmw->ptl); VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); - if (!ptep_clear_young_notify(vma, addr, pte)) + if (!test_and_clear_young_ptes_notify(vma, addr, pte, nr)) return false; if (spin_is_contended(pvmw->ptl)) @@ -4255,10 +4257,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) pte -= (addr - start) / PAGE_SIZE; - for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { + for (i = 0, addr = start; addr != end; + i += nr, pte += nr, addr += nr * PAGE_SIZE) { unsigned long pfn; - pte_t ptent = ptep_get(pte + i); + pte_t ptent = ptep_get(pte); + nr = 1; pfn = get_pte_pfn(ptent, vma, addr, pgdat); if (pfn == -1) continue; @@ -4267,7 +4271,14 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) if (!folio) continue; - if (!ptep_clear_young_notify(vma, addr, pte + i)) + if (folio_test_large(folio)) { + const unsigned int max_nr = (end - addr) >> PAGE_SHIFT; + + nr = folio_pte_batch_flags(folio, NULL, pte, &ptent, + max_nr, FPB_MERGE_YOUNG_DIRTY); + } + + if (!test_and_clear_young_ptes_notify(vma, addr, pte, nr)) continue; if (last != folio) { @@ -4280,7 +4291,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) if (pte_dirty(ptent)) dirty = true; - young++; + young += nr; } walk_update_folio(walk, last, gen, dirty); @@ -4543,7 +4554,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, { int i; int gen; - enum vm_event_item item; + enum node_stat_item item; int sorted = 0; int scanned = 0; int isolated = 0; @@ -4551,7 +4562,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, int scan_batch = min(nr_to_scan, MAX_LRU_BATCH); int remaining = scan_batch; struct lru_gen_folio *lrugen = &lruvec->lrugen; - struct mem_cgroup *memcg = lruvec_memcg(lruvec); VM_WARN_ON_ONCE(!list_empty(list)); @@ -4602,13 +4612,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, } item = PGSCAN_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) { - __count_vm_events(item, isolated); - __count_vm_events(PGREFILL, sorted); - } - count_memcg_events(memcg, item, isolated); - count_memcg_events(memcg, PGREFILL, sorted); - __count_vm_events(PGSCAN_ANON + type, isolated); + mod_lruvec_state(lruvec, item, isolated); + mod_lruvec_state(lruvec, PGREFILL, sorted); + mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated); trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch, scanned, skipped, isolated, type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); @@ -4693,7 +4699,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, LIST_HEAD(clean); struct folio *folio; struct folio *next; - enum vm_event_item item; + enum node_stat_item item; struct reclaim_stat stat; struct lru_gen_mm_walk *walk; bool skip_retry = false; @@ -4757,10 +4763,8 @@ retry: stat.nr_demoted); item = PGSTEAL_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) - __count_vm_events(item, reclaimed); - count_memcg_events(memcg, item, reclaimed); - __count_vm_events(PGSTEAL_ANON + type, reclaimed); + mod_lruvec_state(lruvec, item, reclaimed); + mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); spin_unlock_irq(&lruvec->lru_lock); @@ -5178,6 +5182,8 @@ static void lru_gen_change_state(bool enabled) if (enabled == lru_gen_enabled()) goto unlock; + static_branch_enable_cpuslocked(&lru_switch); + if (enabled) static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); else @@ -5208,6 +5214,9 @@ static void lru_gen_change_state(bool enabled) cond_resched(); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + static_branch_disable_cpuslocked(&lru_switch); + unlock: mutex_unlock(&state_mutex); put_online_mems(); @@ -5780,9 +5789,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) bool proportional_reclaim; struct blk_plug plug; - if (lru_gen_enabled() && !root_reclaim(sc)) { + if ((lru_gen_enabled() || lru_gen_switching()) && !root_reclaim(sc)) { lru_gen_shrink_lruvec(lruvec, sc); - return; + + if (!lru_gen_switching()) + return; + } get_scan_count(lruvec, sc, nr); @@ -6042,10 +6054,13 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) struct lruvec *target_lruvec; bool reclaimable = false; - if (lru_gen_enabled() && root_reclaim(sc)) { + if ((lru_gen_enabled() || lru_gen_switching()) && root_reclaim(sc)) { memset(&sc->nr, 0, sizeof(sc->nr)); lru_gen_shrink_node(pgdat, sc); - return; + + if (!lru_gen_switching()) + return; + } target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -6315,7 +6330,7 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) struct lruvec *target_lruvec; unsigned long refaults; - if (lru_gen_enabled()) + if (lru_gen_enabled() && !lru_gen_switching()) return; target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); @@ -6596,11 +6611,11 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, return 1; set_task_reclaim_state(current, &sc.reclaim_state); - trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask); + trace_mm_vmscan_direct_reclaim_begin(sc.gfp_mask, order, 0); nr_reclaimed = do_try_to_free_pages(zonelist, &sc); - trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); + trace_mm_vmscan_direct_reclaim_end(nr_reclaimed, 0); set_task_reclaim_state(current, NULL); return nr_reclaimed; @@ -6629,8 +6644,9 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); - trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order, - sc.gfp_mask); + trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.gfp_mask, + sc.order, + memcg); /* * NOTE: Although we can get the priority field, using it @@ -6641,7 +6657,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, */ shrink_lruvec(lruvec, &sc); - trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); + trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed, memcg); *nr_scanned = sc.nr_scanned; @@ -6677,13 +6693,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); set_task_reclaim_state(current, &sc.reclaim_state); - trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask); + trace_mm_vmscan_memcg_reclaim_begin(sc.gfp_mask, 0, memcg); noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = do_try_to_free_pages(zonelist, &sc); memalloc_noreclaim_restore(noreclaim_flag); - trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); + trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed, memcg); set_task_reclaim_state(current, NULL); return nr_reclaimed; @@ -6704,9 +6720,12 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc) struct mem_cgroup *memcg; struct lruvec *lruvec; - if (lru_gen_enabled()) { + if (lru_gen_enabled() || lru_gen_switching()) { lru_gen_age_node(pgdat, sc); - return; + + if (!lru_gen_switching()) + return; + } lruvec = mem_cgroup_lruvec(NULL, pgdat); @@ -7657,7 +7676,7 @@ static unsigned long __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, delayacct_freepages_end(); psi_memstall_leave(&pflags); - trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed); + trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed, 0); return sc->nr_reclaimed; } diff --git a/mm/vmstat.c b/mm/vmstat.c index ac9affbe48b7..c360c1b29ac9 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -547,7 +547,7 @@ EXPORT_SYMBOL(__dec_node_page_state); #ifdef CONFIG_HAVE_CMPXCHG_LOCAL /* * If we have cmpxchg_local support then we do not need to incur the overhead - * that comes with local_irq_save/restore if we use this_cpu_cmpxchg. + * that comes with local_irq_save/restore if we use this_cpu_try_cmpxchg(). * * mod_state() modifies the zone counter state through atomic per cpu * operations. @@ -1255,6 +1255,7 @@ const char * const vmstat_text[] = { [I(NR_KERNEL_MISC_RECLAIMABLE)] = "nr_kernel_misc_reclaimable", [I(NR_FOLL_PIN_ACQUIRED)] = "nr_foll_pin_acquired", [I(NR_FOLL_PIN_RELEASED)] = "nr_foll_pin_released", + [I(NR_VMALLOC)] = "nr_vmalloc", [I(NR_KERNEL_STACK_KB)] = "nr_kernel_stack", #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) [I(NR_KERNEL_SCS_KB)] = "nr_shadow_call_stack", @@ -1276,6 +1277,19 @@ const char * const vmstat_text[] = { [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", + [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", + [I(PGSTEAL_DIRECT)] = "pgsteal_direct", + [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", + [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", + [I(PGSTEAL_ANON)] = "pgsteal_anon", + [I(PGSTEAL_FILE)] = "pgsteal_file", + [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", + [I(PGSCAN_DIRECT)] = "pgscan_direct", + [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", + [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", + [I(PGSCAN_ANON)] = "pgscan_anon", + [I(PGSCAN_FILE)] = "pgscan_file", + [I(PGREFILL)] = "pgrefill", #ifdef CONFIG_HUGETLB_PAGE [I(NR_HUGETLB)] = "nr_hugetlb", #endif @@ -1320,21 +1334,8 @@ const char * const vmstat_text[] = { [I(PGMAJFAULT)] = "pgmajfault", [I(PGLAZYFREED)] = "pglazyfreed", - [I(PGREFILL)] = "pgrefill", [I(PGREUSE)] = "pgreuse", - [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", - [I(PGSTEAL_DIRECT)] = "pgsteal_direct", - [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", - [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", - [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", - [I(PGSCAN_DIRECT)] = "pgscan_direct", - [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", - [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", [I(PGSCAN_DIRECT_THROTTLE)] = "pgscan_direct_throttle", - [I(PGSCAN_ANON)] = "pgscan_anon", - [I(PGSCAN_FILE)] = "pgscan_file", - [I(PGSTEAL_ANON)] = "pgsteal_anon", - [I(PGSTEAL_FILE)] = "pgsteal_file", #ifdef CONFIG_NUMA [I(PGSCAN_ZONE_RECLAIM_SUCCESS)] = "zone_reclaim_success", diff --git a/mm/workingset.c b/mm/workingset.c index 13422d304715..37a94979900f 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -16,6 +16,7 @@ #include <linux/dax.h> #include <linux/fs.h> #include <linux/mm.h> +#include "swap_table.h" #include "internal.h" /* @@ -184,7 +185,9 @@ #define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ WORKINGSET_SHIFT + NODES_SHIFT + \ MEM_CGROUP_ID_SHIFT) +#define EVICTION_SHIFT_ANON (EVICTION_SHIFT + SWAP_COUNT_SHIFT) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) +#define EVICTION_MASK_ANON (~0UL >> EVICTION_SHIFT_ANON) /* * Eviction timestamps need to be able to cover the full range of @@ -194,12 +197,12 @@ * that case, we have to sacrifice granularity for distance, and group * evictions into coarser buckets by shaving off lower timestamp bits. */ -static unsigned int bucket_order __read_mostly; +static unsigned int bucket_order[ANON_AND_FILE] __read_mostly; static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, - bool workingset) + bool workingset, bool file) { - eviction &= EVICTION_MASK; + eviction &= file ? EVICTION_MASK : EVICTION_MASK_ANON; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; eviction = (eviction << WORKINGSET_SHIFT) | workingset; @@ -244,7 +247,8 @@ static void *lru_gen_eviction(struct folio *folio) struct mem_cgroup *memcg = folio_memcg(folio); struct pglist_data *pgdat = folio_pgdat(folio); - BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); + BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > + BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON)); lruvec = mem_cgroup_lruvec(memcg, pgdat); lrugen = &lruvec->lrugen; @@ -254,7 +258,7 @@ static void *lru_gen_eviction(struct folio *folio) hist = lru_hist_from_seq(min_seq); atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); - return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset); + return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, type); } /* @@ -262,7 +266,7 @@ static void *lru_gen_eviction(struct folio *folio) * Fills in @lruvec, @token, @workingset with the values unpacked from shadow. */ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec, - unsigned long *token, bool *workingset) + unsigned long *token, bool *workingset, bool file) { int memcg_id; unsigned long max_seq; @@ -275,7 +279,7 @@ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec, *lruvec = mem_cgroup_lruvec(memcg, pgdat); max_seq = READ_ONCE((*lruvec)->lrugen.max_seq); - max_seq &= EVICTION_MASK >> LRU_REFS_WIDTH; + max_seq &= (file ? EVICTION_MASK : EVICTION_MASK_ANON) >> LRU_REFS_WIDTH; return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS; } @@ -293,7 +297,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow) rcu_read_lock(); - recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset); + recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset, type); if (lruvec != folio_lruvec(folio)) goto unlock; @@ -331,7 +335,7 @@ static void *lru_gen_eviction(struct folio *folio) } static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec, - unsigned long *token, bool *workingset) + unsigned long *token, bool *workingset, bool file) { return false; } @@ -381,6 +385,7 @@ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) { struct pglist_data *pgdat = folio_pgdat(folio); + int file = folio_is_file_lru(folio); unsigned long eviction; struct lruvec *lruvec; int memcgid; @@ -397,10 +402,10 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_private_id(lruvec_memcg(lruvec)); eviction = atomic_long_read(&lruvec->nonresident_age); - eviction >>= bucket_order; + eviction >>= bucket_order[file]; workingset_age_nonresident(lruvec, folio_nr_pages(folio)); return pack_shadow(memcgid, pgdat, eviction, - folio_test_workingset(folio)); + folio_test_workingset(folio), file); } /** @@ -431,14 +436,15 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset, bool recent; rcu_read_lock(); - recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction, workingset); + recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction, + workingset, file); rcu_read_unlock(); return recent; } rcu_read_lock(); unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); - eviction <<= bucket_order; + eviction <<= bucket_order[file]; /* * Look up the memcg associated with the stored ID. It might @@ -495,7 +501,8 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset, * longest time, so the occasional inappropriate activation * leading to pressure on the active list is not a problem. */ - refault_distance = (refault - eviction) & EVICTION_MASK; + refault_distance = ((refault - eviction) & + (file ? EVICTION_MASK : EVICTION_MASK_ANON)); /* * Compare the distance to the existing workingset size. We @@ -780,8 +787,8 @@ static struct lock_class_key shadow_nodes_key; static int __init workingset_init(void) { + unsigned int timestamp_bits, timestamp_bits_anon; struct shrinker *workingset_shadow_shrinker; - unsigned int timestamp_bits; unsigned int max_order; int ret = -ENOMEM; @@ -794,11 +801,15 @@ static int __init workingset_init(void) * double the initial memory by using totalram_pages as-is. */ timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; + timestamp_bits_anon = BITS_PER_LONG - EVICTION_SHIFT_ANON; max_order = fls_long(totalram_pages() - 1); - if (max_order > timestamp_bits) - bucket_order = max_order - timestamp_bits; - pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n", - timestamp_bits, max_order, bucket_order); + if (max_order > (BITS_PER_LONG - EVICTION_SHIFT)) + bucket_order[WORKINGSET_FILE] = max_order - timestamp_bits; + if (max_order > timestamp_bits_anon) + bucket_order[WORKINGSET_ANON] = max_order - timestamp_bits_anon; + pr_info("workingset: timestamp_bits=%d (anon: %d) max_order=%d bucket_order=%u (anon: %d)\n", + timestamp_bits, timestamp_bits_anon, max_order, + bucket_order[WORKINGSET_FILE], bucket_order[WORKINGSET_ANON]); workingset_shadow_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 2c1430bf8d57..63128ddb7959 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1727,7 +1727,19 @@ static int zs_page_migrate(struct page *newpage, struct page *page, if (!zspage_write_trylock(zspage)) { spin_unlock(&class->lock); write_unlock(&pool->lock); - return -EINVAL; + /* + * Return -EBUSY but not -EAGAIN: the zspage's reader-lock + * owner may hold the lock for an unbounded duration due to a + * slow decompression or reader-lock owner preemption. + * Since migration retries are bounded by + * NR_MAX_MIGRATE_PAGES_RETRY and performed with virtually no + * delay between attempts, there is no guarantee the lock will + * be released in time for a retry to succeed. + * -EAGAIN implies "try again soon", which does not hold here. + * -EBUSY more accurately conveys "resource is occupied, + * migration cannot proceed". + */ + return -EBUSY; } /* We're committed, tell the world that this is a Zsmalloc page. */ @@ -1741,6 +1753,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page, */ d_addr = kmap_local_zpdesc(newzpdesc); copy_page(d_addr, s_addr); + kmsan_copy_page_meta(zpdesc_page(newzpdesc), zpdesc_page(zpdesc)); kunmap_local(d_addr); for (addr = s_addr + offset; addr < s_addr + PAGE_SIZE; diff --git a/mm/zswap.c b/mm/zswap.c index 16b2ef7223e1..0823cadd02b6 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1595,11 +1595,11 @@ int zswap_load(struct folio *folio) { swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); - bool swapcache = folio_test_swapcache(folio); struct xarray *tree = swap_zswap_tree(swp); struct zswap_entry *entry; VM_WARN_ON_ONCE(!folio_test_locked(folio)); + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); if (zswap_never_enabled()) return -ENOENT; @@ -1630,22 +1630,15 @@ int zswap_load(struct folio *folio) count_objcg_events(entry->objcg, ZSWPIN, 1); /* - * When reading into the swapcache, invalidate our entry. The - * swapcache can be the authoritative owner of the page and + * We are reading into the swapcache, invalidate zswap entry. + * The swapcache is the authoritative owner of the page and * its mappings, and the pressure that results from having two * in-memory copies outweighs any benefits of caching the * compression work. - * - * (Most swapins go through the swapcache. The notable - * exception is the singleton fault on SWP_SYNCHRONOUS_IO - * files, which reads into a private page and may free it if - * the fault fails. We remain the primary owner of the entry.) */ - if (swapcache) { - folio_mark_dirty(folio); - xa_erase(tree, offset); - zswap_entry_free(entry); - } + folio_mark_dirty(folio); + xa_erase(tree, offset); + zswap_entry_free(entry); folio_unlock(folio); return 0; |
