summaryrefslogtreecommitdiff
path: root/mm/debug.c
AgeCommit message (Collapse)Author
2025-03-17mm/debug: add line breaksLiu Ye
Missing a newline character at the end of the format string. Link: https://lkml.kernel.org/r/20250312093717.364031-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: move _pincount in folio to page[2] on 32bitDavid Hildenbrand
Let's free up some space on 32bit in page[1] by moving the _pincount to page[2]. For order-1 folios (never anon folios!) on 32bit, we will now also use the GUP_PIN_COUNTING_BIAS approach. A fully-mapped order-1 folio requires 2 references. With GUP_PIN_COUNTING_BIAS being 1024, we'd detect such folios as "maybe pinned" with 512 full mappings, instead of 1024 for order-0. As anon folios are out of the picture (which are the most relevant users of checking for pinnings on *mapped* pages) and we are talking about 32bit, this is not expected to cause any trouble. In __dump_page(), copy one additional folio page if we detect a folio with an order > 1, so we can dump the pincount on order > 1 folios reliably. Note that THPs on 32bit are not particularly common (and we don't care too much about performance), but we want to keep it working reliably, because likely we want to use large folios there as well in the future, independent of PMD leaf support. Once we dynamically allocate "struct folio", fortunately the 32bit specifics will likely go away again; even small folios could then have a pincount and folio_has_pincount() would essentially always return "true". Link: https://lkml.kernel.org/r/20250303163014.1128035-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/debug: print vm_refcnt state when dumping the vmaSuren Baghdasaryan
vm_refcnt encodes a number of useful states: - whether vma is attached or detached - the number of current vma readers - presence of a vma writer Let's include it in the vma dump. Link: https://lkml.kernel.org/r/20250213224655.1680278-15-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: eliminate adj_start parameter from commit_merge()Lorenzo Stoakes
Introduce internal vmg->__adjust_middle_start and vmg->__adjust_next_start merge flags, enabling us to indicate to commit_merge() that we are performing a merge which either spans only part of vmg->middle, or part of vmg->next respectively. In the former instance, we change the start of vmg->middle to match the attributes of vmg->prev, without spanning all of vmg->middle. This implies that vmg->prev->vm_end and vmg->middle->vm_start are both increased to form the new merged VMA (vmg->prev) and the new subsequent VMA (vmg->middle). In the latter case, we change the end of vmg->middle to match the attributes of vmg->next, without spanning all of vmg->next. This implies that vmg->middle->vm_end and vmg->next->vm_start are both decreased to form the new merged VMA (vmg->next) and the new prior VMA (vmg->middle). Since we now have a stable set of prev, middle, next VMAs threaded through vmg and with these flags set know what is happening, we can perform the calculation in commit_merge() instead. This allows us to drop the confusing adj_start parameter and instead pass semantic information to commit_merge(). In the latter case the -(middle->vm_end - start) calculation becomes -(middle->vm-end - vmg->end), however this is correct as vmg->end is set to the start parameter. This is because in this case (rather confusingly), we manipulate vmg->middle, but ultimately return vmg->next, whose range will be correctly specified. At this point vmg->start, end is the new range for the prior VMA rather than the merged one. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/bcec0cd980b373a5eb02236cb033034ce1effe42.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: further refactor commit_merge()Lorenzo Stoakes
The current VMA merge mechanism contains a number of confusing mechanisms around removal of VMAs on merge and the shrinking of the VMA adjacent to vma->target in the case of merges which result in a partial merge with that adjacent VMA. Since we now have a STABLE set of VMAs - prev, middle, next - we are now able to have the caller of commit_merge() explicitly tell us which VMAs need deleting, using newly introduced internal VMA merge flags. Doing so allows us to embed this state within the VMG and remove the confusing remove, remove2 parameters from commit_merge(). We additionally are able to eliminate the highly confusing and misleading 'expanded' parameter - a parameter that in reality refers to whether or not the return VMA is the target one or the one immediately adjacent. We can infer which is the case from whether or not the adj_start parameter is negative. This also allows us to simplify further logic around iterator configuration and VMA iterator stores. Doing so means we can also eliminate the adjust parameter, as we are able to infer which VMA ought to be adjusted from adj_start - a positive value implies we adjust the start of 'middle', a negative one implies we adjust the start of 'next'. We are then able to have commit_merge() explicitly return the target VMA, or NULL on inability to pre-allocate memory. Errors were previously filtered so behaviour does not change. We additionally move from the slightly odd use of a bitwise-flag enum vmg->merge_flags field to vmg bitfields. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/7bf2ed24af68aac18672b7acebbd9102f48c5b03.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: simplify vma merge structure and expand commentsLorenzo Stoakes
Patch series "mm: further simplify VMA merge operation", v3. While significant efforts have been made to improve the VMA merge operation, there remains remnants of the bad (or rather confusing) old days, which make the code difficult to understand, more bug prone and thus harder to modify. This series attempts to significantly improve matters in a number of respects - with a focus on simplifying the commit_merge() function which actually actions the merge operation - and importantly, adjusting the two most confusing merge cases - those in which we 'adjust' the VMA immediately adjacent to the one being merged. One source of confusion are the VMAs being threaded through the operation themselves - vmg->prev, vmg->vma and vmg->next. At the start of the operation, vmg->vma is either NULL if a new VMA is propose to be added, or if not then a pointer to an existing VMA being modified, and prev/next are (perhaps not present) VMAs sat immediately before and after the range specified in vmg->start, end, respectively. However, during the VMA merge operation, we change vmg->start, end and pgoff to span the newly merged range and vmg->vma to either be: a. The ultimately returned VMA (in most cases) or b. A VMA which we will manipulate, but ultimately instead return vmg->next. Case b. especially here is confusing for somebody reading this code, but the fact we update this state, along with vmg->start, end, pgoff only makes matters worse. We simplify things by replacing vmg->vma with vmg->middle and never changing it - this is always either NULL (for a new VMA) or the VMA being modified between vmg->prev and vmg->next. We further simplify by placing the merged VMA in a new vmg->target field - whether case b. above is the case or not. The reader of the code can now simply rely on vmg->middle being the middle VMA and vmg->target being the ultimately merged VMA. We additionally tackle the confusing cases where we 'adjust' VMAs other than the one we ultimately return as the merged VMA (this includes case b. above). These are: (1) merge <-----------> |------||--------| |------------|---| | prev || middle | -> | target | m | |------||--------| |------------|---| In which case middle must be adjusted so middle->vm_start is increased as well as performing the merge. (2) (equivalent to case b. above) <-------------> |---------||------| |---|-------------| | middle || next | -> | m | target | |---------||------| |---|-------------| In which case next must be adjusted so next->vm_start is decreased as well as performing the merge. This cases have previously been performed by calculating and passing around a dubious and confusing 'adj_start' parameter along side a pointer to an 'adjust' VMA indicating which VMA requires additional adjustment (middle in case 1 and next in case 2). With the VMG structure in place we are able to avoid this by simply setting a merge flag to describe each case: (1) Sets the vmg->__adjust_middle_start flag (2) Sets the vmg->__adjust_next_start flag By doing so it turns out we can vastly simplify the logic and calculate what is required to perform the operation. Taken together the refactorings make it far easier to understand what is being done even in these more confusing cases, make the code far more maintainable, debuggable, and testable, providing more internal state indicating what is happening in the merge operation. The changes have no functional net impact on the merge operation and everything should still behave as it did before. This patch (of 5): The merge code, while much improved, still has a number of points of confusion. As part of a broader series cleaning this up to make this more maintainable, we start by addressing some confusion around vma_merge_struct fields. So far, the caller either provides no vmg->vma (a new VMA) or supplies the existing VMA which is being altered, setting vmg->start,end,pgoff to the proposed VMA dimensions. vmg->vma is then updated, as are vmg->start,end,pgoff as the merge process proceeds and the appropriate merge strategy is determined. This is rather confusing, as vmg->vma starts off as the 'middle' VMA between vmg->prev,next, but becomes the 'target' VMA, except in one specific edge case (merge next, shrink middle). Int his patch we introduce vmg->middle to describe the VMA that is between vmg->prev and vmg->next, and does NOT change during the merge operation. We replace vmg->vma with vmg->target, and use this only during the merge operation itself. Aside from the merge right, shrink middle case, this becomes the VMA that forms the basis of the VMA that is returned. This edge case can be addressed in a future commit. We also add a number of comments to explain what is going on. Finally, we adjust the ASCII diagrams showing each merge case in vma_merge_existing_range() to be clearer - the arrow range previously showed the vmg->start, end spanned area, but it is clearer to change this to show the final merged VMA. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/cover.1738326519.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4dfe60f1419d55e5d0516f56349695d73a57184c.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/debug: introduce VM_WARN_ON_VMG() to dump VMA merge stateLorenzo Stoakes
Patch series "mm/debug: introduce and use VM_WARN_ON_VMG()". We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set, during VMA merge operations to ensure state is as expected. However, when syzkaller or the like encounters these asserts, often the information provided by the report is insufficient to narrow down what the problem is. We noticed this recently in [0], where a non-repro issue resisted debugging due to simply not having sufficient information to go on. This series improves the situation by providing VM_WARN_ON_VMG() which acts like VM_WARN_ON() (i.e. only actually being invoked if CONFIG_DEBUG_VM is set), while dumping significant information about the VMA merge state, the mm_struct describing the virtual address space, all associated VMAs and, if CONFIG_DEBUG_VM_MAPLE_TREE is set, the associated maple tree. [0]:https://lore.kernel.org/all/6774c98f.050a0220.25abdd.0991.GAE@google.com/ This patch (of 2): We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set, during VMA merge operations to ensure state is as expected. However, when syzkaller or the like encounters these asserts, often the information provided by the report is insufficient to narrow down what the problem is. This might not be so much of an issue if the reported problem is reproducible, but if it is a rarely encountered race or some other case which precludes a repro, it is a very big problem (see [0] for the motivating case). It is therefore sensible to provide a means by which we can easily and conveniently dump a lot more information in these circumstances. The aggregation of merge state into a single struct threaded through the operation makes this trivial - we can simply introduce a variant on VM_WARN_ON() which takes the VMA merge state object (vmg) and use that to dump information. This patch therefore introduces VM_WARN_ON_VMG() which provides this functionality. It additionally dumps full mm state, VMA state for each of the three VMAs the vmg contains (prev, next, vma) and if CONFIG_DEBUG_VM_MAPLE_TREE is enabled, dumps the maple tree from the provided VMA iterator if non-NULL. This patch has no functional impact if CONFIG_DEBUG_VM is not set. [0]:https://lore.kernel.org/all/6774c98f.050a0220.25abdd.0991.GAE@google.com/ Link: https://lkml.kernel.org/r/cover.1735932169.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/13b09b52d4d103ee86acaf0ae612539648ae29e0.1735932169.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: open-code page_folio() in dump_page()Matthew Wilcox (Oracle)
page_folio() calls page_fixed_fake_head() which will misidentify this page as being a fake head and load off the end of 'precise'. We may have a pointer to a fake head, but that's OK because it contains the right information for dump_page(). gcc-15 is smart enough to catch this with -Warray-bounds: In function 'page_fixed_fake_head', inlined from '_compound_head' at ../include/linux/page-flags.h:251:24, inlined from '__dump_page' at ../mm/debug.c:123:11: ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside +array bounds of 'struct page[1]' [-Warray-bounds=] Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: support only one page_type per pageMatthew Wilcox (Oracle)
By using a few values in the top byte, users of page_type can store up to 24 bits of additional data in page_type. It also reduces the code size as (with replacement of READ_ONCE() with data_race()), the kernel can check just a single byte. eg: ffffffff811e3a79: 8b 47 30 mov 0x30(%rdi),%eax ffffffff811e3a7c: 55 push %rbp ffffffff811e3a7d: 48 89 e5 mov %rsp,%rbp ffffffff811e3a80: 25 00 00 00 82 and $0x82000000,%eax ffffffff811e3a85: 3d 00 00 00 80 cmp $0x80000000,%eax ffffffff811e3a8a: 74 4d je ffffffff811e3ad9 <folio_mapping+0x69> becomes: ffffffff811e3a69: 80 7f 33 f5 cmpb $0xf5,0x33(%rdi) ffffffff811e3a6d: 55 push %rbp ffffffff811e3a6e: 48 89 e5 mov %rsp,%rbp ffffffff811e3a71: 74 4d je ffffffff811e3ac0 <folio_mapping+0x60> replacing three instructions with one. [wangkefeng.wang@huawei.com: fix ubsan warnings] Link: https://lkml.kernel.org/r/2d19c48a-c550-4345-bf36-d05cd303c5de@huawei.com Link: https://lkml.kernel.org/r/20240821173914.2270383-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03printf: remove %pGt supportMatthew Wilcox (Oracle)
Patch series "Increase the number of bits available in page_type". Kent wants more than 16 bits in page_type, so I resurrected this old patch and expanded it a bit. It's a bit more efficient than our current scheme (1 4-byte insn vs 3 insns of 13 bytes total) to test a single page type. This patch (of 4): An upcoming patch will convert page type from being a bitfield to a single byte, so we will not be able to use %pG to print the page type any more. The printing of the symbolic name will be restored in that patch. Link: https://lkml.kernel.org/r/20240821173914.2270383-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240821173914.2270383-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/debug: print only page mapcount (excluding folio entire mapcount) in ↵David Hildenbrand
__dump_folio() Let's simplify and only print the page mapcount: we already print the large folio mapcount and the entire folio mapcount for large folios separately; that should be sufficient to figure out what's happening. While at it, print the page mapcount also if it had an underflow, filtering out only typed pages. Link: https://lkml.kernel.org/r/20240409192301.907377-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: track mapcount of large folios in single valueDavid Hildenbrand
Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Link: https://lkml.kernel.org/r/20240409192301.907377-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: switch mm->get_unmapped_area() to a flagRick Edgecombe
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: improve dumping of mapcount and page_typeMatthew Wilcox (Oracle)
For pages that have a page_type, set the mapcount to 0, which will reduce the confusion in people reading page dumps ("Why does this page have a mapcount of -128?"). Now that hugetlbfs is a page_type, read the entire_mapcount for any large folio; this is fine for all folios as no user reuses the entire_mapcount field. For pages which do not have a page type, do not print it to reduce clutter. Link: https://lkml.kernel.org/r/20240321142448.1645400-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06mm: make dump_page() take a const argumentMatthew Wilcox (Oracle)
Now that __dump_page() takes a const argument, we can make dump_page() take a const struct page too. Link: https://lkml.kernel.org/r/20240227192337.757313-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06mm: add __dump_folio()Matthew Wilcox (Oracle)
Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the page & folio into a stack variable so we don't hit BUG_ON() if an allocation is freed under us and what was a folio pointer becomes a pointer to a tail page. [willy@infradead.org: fix build issue] Link: https://lkml.kernel.org/r/ZeAKCyTn_xS3O9cE@casper.infradead.org [willy@infradead.org: fix __dump_folio] Link: https://lkml.kernel.org/r/ZeJJegP8zM7S9GTy@casper.infradead.org [willy@infradead.org: fix pointer confusion] Link: https://lkml.kernel.org/r/ZeYa00ixxC4k1ot-@casper.infradead.org [akpm@linux-foundation.org: s/printk/pr_warn/] Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm: update validate_mm() to use vma iteratorLiam R. Howlett
Use the vma iterator in the validation code and combine the code to check the maple tree into the main validate_mm() function. Introduce a new function vma_iter_dump_tree() to dump the maple tree in hex layout. Replace all calls to validate_mm_mt() with validate_mm(). [Liam.Howlett@oracle.com: update validate_mm() to use vma iterator CONFIG flag] Link: https://lkml.kernel.org/r/20230606183538.588190-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230518145544.1722059-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm/debug: use %pGt to display page_type in dump_page()Hyeonggon Yoo
Some page flags are stored in page_type rather than ->flags field. Use newly introduced page type %pGt in dump_page(). Below are some examples: page:00000000da7184dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x101cb3 flags: 0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff) page_type: 0xffffffff() raw: 02ffff0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: newly allocated page page:00000000da7184dd refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x101cb3 flags: 0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff) page_type: 0xffffff7f(buddy) raw: 02ffff0000000000 ffff88813fff8e80 ffff88813fff8e80 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: freed page page:0000000042202316 refcount:3 mapcount:2 mapping:0000000000000000 index:0x7f634722a pfn:0x11994e memcg:ffff888100135000 anon flags: 0x2ffff0000080024(uptodate|active|swapbacked|node=0|zone=2|lastcpupid=0xffff) page_type: 0x1() raw: 02ffff0000080024 0000000000000000 dead000000000122 ffff8881193398f1 raw: 00000007f634722a 0000000000000000 0000000300000001 ffff888100135000 page dumped because: user-mapped page Link: https://lkml.kernel.org/r/20230130042514.2418-4-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm, printk: introduce new format %pGt for page_typeHyeonggon Yoo
%pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part] Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09mm: export dump_mm()Suren Baghdasaryan
mmap_assert_write_locked() is used in vm_flags modifiers. Because mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes modified from inside a module, it's necessary to export dump_mm() function. Link: https://lkml.kernel.org/r/20230126193752.297968-8-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm/debug: remove call to head_compound_mapcount()Matthew Wilcox (Oracle)
Call folio_entire_mapcount() instead. Link: https://lkml.kernel.org/r/20230111142915.1001531-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: convert head_subpages_mapcount() into folio_nr_pages_mapped()Matthew Wilcox (Oracle)
Calling this 'mapcount' is confusing since mapcount is usually the number of times something is mapped; instead this is the number of mapped pages. It's also better to enforce that this is a folio rather than a head page. Move folio_nr_pages_mapped() into mm/internal.h since this is not something we want device drivers or filesystems poking at. Get rid of folio_subpages_mapcount_ptr() and use folio->_nr_pages_mapped directly. Link: https://lkml.kernel.org/r/20230111142915.1001531-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: remove folio_pincount_ptr() and head_compound_pincount()Matthew Wilcox (Oracle)
We can use folio->_pincount directly, since all users are guarded by tests of compound/large. Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm,thp,rmap: simplify compound page mapcount handlingHugh Dickins
Compound page (folio) mapcount calculations have been different for anon and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. And each huge mapping and unmapping of a file (or shmem) THP involved atomically incrementing and decrementing the mapcount of every subpage of that huge page, dirtying many struct page cachelines. Add subpages_mapcount field to the struct folio and first tail page, so that the total of subpage mapcounts is available in one place near the head: then page_mapcount() and total_mapcount() and page_mapped(), and their folio equivalents, are so quick that anon and file and hugetlb don't need to be optimized differently. Delete the unloved PageDoubleMap. page_add and page_remove rmap functions must now maintain the subpages_mapcount as well as the subpage _mapcount, when dealing with pte mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and NR_FILE_MAPPED statistics still needs reading through the subpages, using nr_subpages_unmapped() - but only when first or last pmd mapping finds subpages_mapcount raised (double-map case, not the common case). But are those counts (used to decide when to split an anon THP, and in vmscan's pagecache_reclaimable heuristic) correctly maintained? Not quite: since page_remove_rmap() (and also split_huge_pmd()) is often called without page lock, there can be races when a subpage pte mapcount 0<->1 while compound pmd mapcount 0<->1 is scanning - races which the previous implementation had prevented. The statistics might become inaccurate, and even drift down until they underflow through 0. That is not good enough, but is better dealt with in a followup patch. Update a few comments on first and second tail page overlaid fields. hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but subpages_mapcount and compound_pincount are already correctly at 0, so delete its reinitialization of compound_pincount. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 119 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 92ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. But there might be some benchmarks which would show a slowdown, because tail struct pages now fall out of cache until final freeing checks them. Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: remove the vma linked listLiam R. Howlett
Replace any vm_next use with vma_find(). Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the maple tree. Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap(). At the same time, alter the loop to be more compact. Now that free_pgtables() and unmap_vmas() take a maple tree as an argument, rearrange do_mas_align_munmap() to use the new tree to hold the vmas to remove. Remove __vma_link_list() and __vma_unlink_list() as they are exclusively used to update the linked list. Drop linked list update from __insert_vm_struct(). Rework validation of tree as it was depending on the linked list. [yang.lee@linux.alibaba.com: fix one kernel-doc comment] Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949 Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alibaba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm: remove vmacacheLiam R. Howlett
By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-03-24mm: unexport page_init_poisonChristoph Hellwig
page_init_poison is only used in core MM code, so unexport it. Link: https://lkml.kernel.org/r/20220207063446.1833404-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21mm: Turn head_compound_mapcount() into folio_entire_mapcount()Matthew Wilcox (Oracle)
Adjust documentation to be more clear. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Make compound_pincount always availableMatthew Wilcox (Oracle)
Move compound_pincount from the third page to the second page, which means it's available for all compound pages. That lets us delete hpage_pincount_available(). On 32-bit systems, there isn't enough space for both compound_pincount and compound_nr in the second page (it would collide with page->private, which is in use for pages in the swap cache), so revert the optimisation of storing both compound_order and compound_nr on 32-bit systems. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-15mm,fs: split dump_mapping() out from dump_page()Matthew Wilcox (Oracle)
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be able to call it when we don't have a struct page. Split it out and move it to fs/inode.c. Take the opportunity to simplify some of the debug messages a little. Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06mm/migrate: de-duplicate migrate_reason stringsJohn Hubbard
In order to remove the need to manually keep three different files in synch, provide a common definition of the mapping between enum migrate_reason, and the associated strings for each enum item. 1. Use the tracing system's mapping of enums to strings, by redefining and reusing the MIGRATE_REASON and supporting macros, and using that to populate the string array in mm/debug.c. 2. Move enum migrate_reason to migrate_mode.h. This is not strictly necessary for this patch, but migrate mode and migrate reason go together, so this will slightly clarify things. Link: https://lkml.kernel.org/r/20210922041755.141817-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Weizhao Ouyang <o451686892@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-03Merge branch 'per_signal_struct_coredumps-for-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull per signal_struct coredumps from Eric Biederman: "Current coredumps are mixed up with the exit code, the signal handling code, and the ptrace code making coredumps much more complicated than necessary and difficult to follow. This series of changes starts with ptrace_stop and cleans it up, making it easier to follow what is happening in ptrace_stop. Then cleans up the exec interactions with coredumps. Then cleans up the coredump interactions with exit. Finally the coredump interactions with the signal handling code is cleaned up. The first and last changes are bug fixes for minor bugs. I believe the fact that vfork followed by execve can kill the process the called vfork if exec fails is sufficient justification to change the userspace visible behavior. In previous discussions some of these changes were organized differently and individually appeared to make the code base worse. As currently written I believe they all stand on their own as cleanups and bug fixes. Which means that even if the worst should happen and the last change needs to be reverted for some unimaginable reason, the code base will still be improved. If the worst does not happen there are a more cleanups that can be made. Signals that generate coredumps can easily become eligible for short circuit delivery in complete_signal. The entire rendezvous for generating a coredump can move into get_signal. The function force_sig_info_to_task be written in a way that does not modify the signal handling state of the target task (because coredumps are eligible for short circuit delivery). Many of these future cleanups can be done another way but nothing so cleanly as if coredumps become per signal_struct" * 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: coredump: Limit coredumps to a single thread group coredump: Don't perform any cleanups before dumping core exit: Factor coredump_exit_mm out of exit_mm exec: Check for a pending fatal signal instead of core_state ptrace: Remove the unnecessary arguments from arch_ptrace_stop signal: Remove the bogus sigkill_pending in ptrace_stop
2021-11-02Merge tag 'printk-for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Extend %pGp print format to print hex value of the page flags - Use kvmalloc instead of kmalloc to allocate devkmsg buffers - Misc cleanup and warning fixes * tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: Update %pGp documentation about that it prints hex value lib/vsprintf.c: Amend static asserts for format specifier flags vsprintf: Make %pGp print the hex value test_printf: Append strings more efficiently test_printf: Remove custom appending of '|' test_printf: Remove separate page_flags variable test_printf: Make pft array const ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK printk: use gnu_printf format attribute for printk_sprint() printk: avoid -Wsometimes-uninitialized warning printk: use kvmalloc instead of kmalloc for devkmsg_user
2021-10-27vsprintf: Make %pGp print the hex valueMatthew Wilcox (Oracle)
All existing users of %pGp want the hex value as well as the decoded flag names. This looks awkward (passing the same parameter to printf twice), so move that functionality into the core. If we want, we can make that optional with flag arguments to %pGp in the future. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211019142621.2810043-6-willy@infradead.org
2021-10-08coredump: Limit coredumps to a single thread groupEric W. Biederman
Today when a signal is delivered with a handler of SIG_DFL whose default behavior is to generate a core dump not only that process but every process that shares the mm is killed. In the case of vfork this looks like a real world problem. Consider the following well defined sequence. if (vfork() == 0) { execve(...); _exit(EXIT_FAILURE); } If a signal that generates a core dump is received after vfork but before the execve changes the mm the process that called vfork will also be killed (as the mm is shared). Similarly if the execve fails after the point of no return the kernel delivers SIGSEGV which will kill both the exec'ing process and because the mm is shared the process that called vfork as well. As far as I can tell this behavior is a violation of people's reasonable expectations, POSIX, and is unnecessarily fragile when the system is low on memory. Solve this by making a userspace visible change to only kill a single process/thread group. This is possible because Jann Horn recently modified[1] the coredump code so that the mm can safely be modified while the coredump is happening. With LinuxThreads long gone I don't expect anyone to have a notice this behavior change in practice. To accomplish this move the core_state pointer from mm_struct to signal_struct, which allows different thread groups to coredump simultatenously. In zap_threads remove the work to kill anything except for the current thread group. v2: Remove core_state from the VM_BUG_ON_MM print to fix compile failure when CONFIG_DEBUG_VM is enabled. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> [1] a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133 Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-09-24mm/debug: sync up latest migrate_reason to migrate_reason_namesWeizhao Ouyang
Sync up MR_DEMOTION to migrate_reason_names and add a synch prompt. Link: https://lkml.kernel.org/r/20210921064553.293905-3-o451686892@gmail.com Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm/debug: sync up MR_CONTIG_RANGE and MR_LONGTERM_PINWeizhao Ouyang
Sync up MR_CONTIG_RANGE and MR_LONGTERM_PIN to migrate_reason_names. Link: https://lkml.kernel.org/r/20210921064553.293905-2-o451686892@gmail.com Fixes: 310253514bbf ("mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGE") Fixes: d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm/debug: factor PagePoisoned out of __dump_pageMatthew Wilcox (Oracle)
Move the PagePoisoned test into dump_page(). Skip the hex print for poisoned pages -- we know they're full of ffffffff. Move the reason printing from __dump_page() to dump_page(). Link: https://lkml.kernel.org/r/20210416231531.2521383-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29mm: make __dump_page staticMatthew Wilcox (Oracle)
Patch series "Constify struct page arguments". While working on various solutions to the 32-bit struct page size regression, one of the problems I found was the networking stack expects to be able to pass const struct page pointers around, and the mm doesn't provide a lot of const-friendly functions to call. The root tangle of problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls dump_page(), which calls a lot of functions which don't take a const struct page (but could be const). This patch (of 6): The only caller of __dump_page() now opencodes dump_page(), so remove it as an externally visible symbol. Link: https://lkml.kernel.org/r/20210416231531.2521383-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210416231531.2521383-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24mm/debug: improve memcg debuggingMatthew Wilcox (Oracle)
The memcg_data is only valid on the head page, not the tail pages. Change the format and location of the printout within the dump to match the other parts of struct page better. Link: https://lkml.kernel.org/r/20210114190200.1894484-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-02mm: memcontrol: Use helpers to read page's memcg dataRoman Gushchin
Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup *page_memcg(struct page *page); struct mem_cgroup *page_memcg_rcu(struct page *page); struct mem_cgroup *page_memcg_check(struct page *page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-10-13mm, dump_page: rename head_mapcount() --> head_compound_mapcount()John Hubbard
Rename head_pincount() --> head_compound_pincount(). These names are more accurate (or less misleading) than the original ones. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200807183358.105097-1-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/debug.c: do not dereference i_ino blindlyMatthew Wilcox (Oracle)
__dump_page() checks i_dentry is fetchable and i_ino is earlier in the struct than i_ino, so it ought to work fine, but it's possible that struct randomisation has reordered i_ino after i_dentry and the pointer is just wild enough that i_dentry is fetchable and i_ino isn't. Also print the inode number if the dentry is invalid. Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lkml.kernel.org/r/20200819185710.28180-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, dump_page: do not crash with bad compound_mapcount()John Hubbard
If a compound page is being split while dump_page() is being run on that page, we can end up calling compound_mapcount() on a page that is no longer compound. This leads to a crash (already seen at least once in the field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount(). (The above is from Matthew Wilcox's analysis of Qian Cai's bug report.) A similar problem is possible, via compound_pincount() instead of compound_mapcount(). In order to avoid this kind of crash, make dump_page() slightly more robust, by providing a pair of simpler routines that don't contain assertions: head_mapcount() and head_pincount(). For debug tools, we don't want to go *too* far in this direction, but this is a simple small fix, and the crash has already been seen, so it's a good trade-off. Reported-by: Qian Cai <cai@lca.pw> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200804214807.169256-1-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/debug: print hashed address of struct pageMatthew Wilcox (Oracle)
The actual address of the struct page isn't particularly helpful, while the hashed address helps match with other messages elsewhere. Add the PFN that the page refers to in order to help diagnose problems where the page is improperly aligned for the purpose. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200709202117.7216-7-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/debug: print the inode number in dump_pageMatthew Wilcox (Oracle)
The inode number helps correlate this page with debug messages elsewhere in the kernel. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200709202117.7216-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/debug: switch dump_page to get_kernel_nofaultMatthew Wilcox (Oracle)
This is simpler to use than copy_from_kernel_nofault(). Also make some of the related error messages less verbose. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200709202117.7216-5-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/debug: print head flags in dump_pageMatthew Wilcox (Oracle)
Tail page flags contain very little useful information. Print the head page's flags instead. While the flags will contain "head" for tail pages, this should not be too confusing as the previous line starts with the word "head:" and so the flags should be interpreted as belonging to the head page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200709202117.7216-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/debug: dump compound page information on a second lineMatthew Wilcox (Oracle)
Simplify both the implementation and the output by splitting all the compound page information onto a second line. Reported-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200709202117.7216-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>