<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/mm/vmscan.c, branch v6.10-rc4</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v6.10-rc4</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v6.10-rc4'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2024-06-06T02:19:23+00:00</updated>
<entry>
<title>mm: drop the 'anon_' prefix for swap-out mTHP counters</title>
<updated>2024-06-06T02:19:23+00:00</updated>
<author>
<name>Baolin Wang</name>
<email>baolin.wang@linux.alibaba.com</email>
</author>
<published>2024-05-23T02:36:39+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=0d648dd5c899f33154b98a6aef6e3dab0f4de613'/>
<id>urn:sha1:0d648dd5c899f33154b98a6aef6e3dab0f4de613</id>
<content type='text'>
The mTHP swap related counters: 'anon_swpout' and 'anon_swpout_fallback'
are confusing with an 'anon_' prefix, since the shmem can swap out
non-anonymous pages.  So drop the 'anon_' prefix to keep consistent with
the old swap counter names.

This is needed in 6.10-rcX to avoid having an inconsistent ABI out in the
field.

Link: https://lkml.kernel.org/r/7a8989c13299920d7589007a30065c3e2c19f0e0.1716431702.git.baolin.wang@linux.alibaba.com
Fixes: d0f048ac39f6 ("mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters")
Fixes: 42248b9d34ea ("mm: add docs for per-order mTHP counters and transhuge_page ABI")
Signed-off-by: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Suggested-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Acked-by: Barry Song &lt;baohua@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Lance Yang &lt;ioworker0@gmail.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/vmscan: remove ignore_references argument of reclaim_folio_list()</title>
<updated>2024-05-07T17:37:02+00:00</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2024-04-29T22:44:51+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=c961bddb7df086a557dfcd8c1846f49e75d62888'/>
<id>urn:sha1:c961bddb7df086a557dfcd8c1846f49e75d62888</id>
<content type='text'>
All reclaim_folio_list() callers are passing 'true' for
'ignore_references' parameter.  In other words, the parameter is not
really being used.  Simplify the code by removing the parameter.

Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/vmscan: remove ignore_references argument of reclaim_pages()</title>
<updated>2024-05-07T17:37:02+00:00</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2024-04-29T22:44:50+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=14f5be2a2d9bb7eb21807b6e62de73dd24082b73'/>
<id>urn:sha1:14f5be2a2d9bb7eb21807b6e62de73dd24082b73</id>
<content type='text'>
All reclaim_pages() callers are setting 'ignore_references' parameter
'true'.  In other words, the parameter is not really being used.  Remove
the argument to make it simple.

Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters</title>
<updated>2024-05-06T00:53:35+00:00</updated>
<author>
<name>Barry Song</name>
<email>v-songbaohua@oppo.com</email>
</author>
<published>2024-04-12T11:48:56+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=d0f048ac39f6a71566d3f49a5922dfd7fa0d585b'/>
<id>urn:sha1:d0f048ac39f6a71566d3f49a5922dfd7fa0d585b</id>
<content type='text'>
This helps to display the fragmentation situation of the swapfile, knowing
the proportion of how much we haven't split large folios.  So far, we only
support non-split swapout for anon memory, with the possibility of
expanding to shmem in the future.  So, we add the "anon" prefix to the
counter names.

Link: https://lkml.kernel.org/r/20240412114858.407208-3-21cnbao@gmail.com
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Reviewed-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Domenico Cerasuolo &lt;cerasuolodomenico@gmail.com&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: vmscan: avoid split during shrink_folio_list()</title>
<updated>2024-04-26T03:56:38+00:00</updated>
<author>
<name>Ryan Roberts</name>
<email>ryan.roberts@arm.com</email>
</author>
<published>2024-04-08T18:39:45+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=5ed890ce5147855c5360affd5e5419ed68a54100'/>
<id>urn:sha1:5ed890ce5147855c5360affd5e5419ed68a54100</id>
<content type='text'>
Now that swap supports storing all mTHP sizes, avoid splitting large
folios before swap-out.  This benefits performance of the swap-out path by
eliding split_folio_to_list(), which is expensive, and also sets us up for
swapping in large folios in a future series.

If the folio is partially mapped, we continue to split it since we want to
avoid the extra IO overhead and storage of writing out pages
uneccessarily.

THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count
events only for PMD-mappable folios to avoid user confusion.  THP_SWPOUT
already has the appropriate guard.  Add a guard for THP_SWPOUT_FALLBACK. 
It may be appropriate to add per-size counters in future.

Link: https://lkml.kernel.org/r/20240408183946.2991168-7-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Barry Song &lt;21cnbao@gmail.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Lance Yang &lt;ioworker0@gmail.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: hold PTL from the first PTE while reclaiming a large folio</title>
<updated>2024-04-26T03:56:08+00:00</updated>
<author>
<name>Barry Song</name>
<email>v-songbaohua@oppo.com</email>
</author>
<published>2024-03-06T09:52:19+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=73bc32875ee9b1881dd780308c6793fe463fe803'/>
<id>urn:sha1:73bc32875ee9b1881dd780308c6793fe463fe803</id>
<content type='text'>
Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
modifications preceded by pte clear.  While iterating over PTEs of a large
folio, it only starts acquiring PTL from the first valid (present) PTE. 
PTE modifications can temporarily set PTEs to pte_none.  Consequently, the
initial PTEs of a large folio might be skipped in try_to_unmap_one().

For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
try_to_unmap_one().

So folio will be still mapped, the folio fails to be reclaimed and is put
back to LRU in this round.

This also breaks up PTEs optimization such as CONT-PTE on this large folio
and may lead to accident folio_split() afterwards.  And since a part of
PTEs are now swap entries, accessing those parts will introduce overhead -
do_swap_page.  Although the kernel can withstand all of the above issues,
the situation still seems quite awkward and warrants making it more ideal.

The same race also occurs with small folios, but they have only one PTE,
thus, it won't be possible for them to be partially unmapped.

This patch holds PTL from PTE0, allowing us to avoid reading PTE values
that are in the process of being transformed.  With stable PTE values, we
can ensure that this large folio is either completely reclaimed or that
all PTEs remain untouched in this round.

A corner case is that if we hold PTL from PTE0 and most initial PTEs have
been really unmapped before that, we may increase the duration of holding
PTL.  Thus we only apply this optimization to folios which are still
entirely mapped (not in deferred_split list).

[akpm@linux-foundation.org: rewrap comment, per Matthew]
Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Huang, Ying &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: record the migration reason for struct migration_target_control</title>
<updated>2024-04-26T03:56:06+00:00</updated>
<author>
<name>Baolin Wang</name>
<email>baolin.wang@linux.alibaba.com</email>
</author>
<published>2024-03-06T10:13:26+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=e42dfe4e0a51b476dcc6f1461c51fdb1b76573aa'/>
<id>urn:sha1:e42dfe4e0a51b476dcc6f1461c51fdb1b76573aa</id>
<content type='text'>
Patch series "make the hugetlb migration strategy consistent", v2.

As discussed in previous thread [1], there is an inconsistency when
handling hugetlb migration.  When handling the migration of freed hugetlb,
it prevents fallback to other NUMA nodes in
alloc_and_dissolve_hugetlb_folio().  However, when dealing with in-use
hugetlb, it allows fallback to other NUMA nodes in
alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool
and might result in unexpected failures when node bound workloads doesn't
get what is asssumed available.

This patchset tries to make the hugetlb migration strategy more clear
and consistent. Please find details in each patch.

[1]
https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/


This patch (of 2):

To support different hugetlb allocation strategies during hugetlb
migration based on various migration reasons, record the migration reason
in the migration_target_control structure as a preparation.

Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'master' into mm-stable</title>
<updated>2024-03-18T16:47:52+00:00</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2024-03-18T16:47:52+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=5e28061128646febc71c0942609619e29f41ff00'/>
<id>urn:sha1:5e28061128646febc71c0942609619e29f41ff00</id>
<content type='text'>
</content>
</entry>
<entry>
<title>mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure</title>
<updated>2024-03-12T20:07:17+00:00</updated>
<author>
<name>Byungchul Park</name>
<email>byungchul@sk.com</email>
</author>
<published>2024-03-04T06:27:37+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=d221dd5fea6415cc6c7acb7d68c8e5dfd61c3902'/>
<id>urn:sha1:d221dd5fea6415cc6c7acb7d68c8e5dfd61c3902</id>
<content type='text'>
With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon
pages.  However, it should be more careful to use the mode because it's
going to prevent anon pages from being reclaimed even if there are a huge
number of anon pages that are cold and should be reclaimed.  Even worse,
that leads kswapd_failures to reach MAX_RECLAIM_RETRIES and stopping
kswapd from functioning until direct reclaim eventually works to resume
kswapd.

So kswapd needs to retry its scan priority loop with cache_trim_mode off
again if the mode doesn't work for reclaim.

The problematic behavior can be reproduced by:

   CONFIG_NUMA_BALANCING enabled
   sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
   numa node0 (8GB local memory, 16 CPUs)
   numa node1 (8GB slow tier memory, no CPUs)

   Sequence:

   1) echo 3 &gt; /proc/sys/vm/drop_caches
   2) To emulate the system with full of cold memory in local DRAM, run
      the following dummy program and never touch the region:

         mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
              MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);

   3) Run any memory intensive work e.g. XSBench.
   4) Check if numa balancing is working e.i. promotion/demotion.
   5) Iterate 1) ~ 4) until numa balancing stops.

With this, you could see that promotion/demotion are not working because
kswapd has stopped due to -&gt;kswapd_failures &gt;= MAX_RECLAIM_RETRIES.

Interesting vmstat delta's differences between before and after are like:

   +-----------------------+-------------------------------+
   | interesting vmstat    | before        | after         |
   +-----------------------+-------------------------------+
   | nr_inactive_anon      | 321935        | 1664772       |
   | nr_active_anon        | 1780700       | 437834        |
   | nr_inactive_file      | 30425         | 40882         |
   | nr_active_file        | 14961         | 3012          |
   | pgpromote_success     | 356           | 1293122       |
   | pgpromote_candidate   | 21953245      | 1824148       |
   | pgactivate            | 1844523       | 3311907       |
   | pgdeactivate          | 50634         | 1554069       |
   | pgfault               | 31100294      | 6518806       |
   | pgdemote_kswapd       | 30856         | 2230821       |
   | pgscan_kswapd         | 1861981       | 7667629       |
   | pgscan_anon           | 1822930       | 7610583       |
   | pgscan_file           | 39051         | 57046         |
   | pgsteal_anon          | 386           | 2192033       |
   | pgsteal_file          | 30470         | 38788         |
   | pageoutrun            | 30            | 412           |
   | numa_hint_faults      | 27418279      | 2875955       |
   | numa_pages_migrated   | 356           | 1293122       |
   +-----------------------+-------------------------------+

Link: https://lkml.kernel.org/r/20240304082118.20499-1-byungchul@sk.com
Signed-off-by: Byungchul Park &lt;byungchul@sk.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove folio from deferred split list before uncharging it</title>
<updated>2024-03-12T20:07:16+00:00</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2024-03-11T19:18:34+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=47932e7048df9156e96133ee90fb3e9df68dbd15'/>
<id>urn:sha1:47932e7048df9156e96133ee90fb3e9df68dbd15</id>
<content type='text'>
When freeing a large folio, we must remove it from the deferred split list
before we uncharge it as each memcg has its own deferred split list (with
associated lock) and removing a folio from the deferred split list while
holding the wrong lock will corrupt that list and cause various related
problems.

Link: https://lore.kernel.org/linux-mm/367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com/
Link: https://lkml.kernel.org/r/20240311191835.312162-1-willy@infradead.org
Fixes: f77171d241e3 (mm: allow non-hugetlb large folios to be batch processed)
Fixes: 29f3843026cf (mm: free folios directly in move_folios_to_lru())
Fixes: bc2ff4cbc329 (mm: free folios in a batch in shrink_folio_list())
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Debugged-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Tested-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
