From 1f318a9b0dc3990490e98eef48f21e6f15185781 Mon Sep 17 00:00:00 2001 From: Jaewon Kim Date: Wed, 3 Jun 2020 16:01:15 -0700 Subject: mm/vmscan: count layzfree pages and fix nr_isolated_* mismatch Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages. If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree page, then the lazyfree page is changed to a normal anon page having SwapBacked by commit 802a3a92ad7a ("mm: reclaim MADV_FREE pages"). Even with the change, reclaim context correctly counts isolated files because it uses is_file_lru to distinguish file. And the change to anon is not happened if try_to_unmap_one is used for migration. So migration context like compaction also correctly counts isolated files even though it uses page_is_file_lru insted of is_file_lru. Recently page_is_file_cache was renamed to page_is_file_lru by commit 9de4f22a60f7 ("mm: code cleanup for MADV_FREE"). But the nr_isolate_* mismatch problem happens on cma alloc. There is reclaim_clean_pages_from_list which is being used only by cma. It was introduced by commit 02c6de8d757c ("mm: cma: discard clean pages during contiguous allocation instead of migration") to reclaim clean file pages without migration. The cma alloc uses both reclaim_clean_pages_from_list and migrate_pages, and it uses page_is_file_lru to count isolated files. If there are dirty lazyfree pages allocated from cma memory region, the pages are counted as isolated file at the beginging but are counted as isolated anon after finished. Mem-Info: Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Like log above, there were too much isolated files, 37664kB, which triggers too_many_isolated in reclaim even when there is no actually isolated file in system wide. It could be reproducible by running two programs, writing on MADV_FREE page and doing cma alloc, respectively. Although isolated anon is 0, I found that the internal value of isolated anon was the negative value of isolated file. Fix this by compensating the isolated count for both LRU lists. Count non-discarded lazyfree pages in shrink_page_list, then compensate the counted number in reclaim_clean_pages_from_list. Reported-by: Yong-Taek Lee Suggested-by: Minchan Kim Signed-off-by: Jaewon Kim Signed-off-by: Andrew Morton Acked-by: Minchan Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Marek Szyprowski Cc: Michal Nazarewicz Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds --- include/linux/vmstat.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/vmstat.h') diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 292485f3d24d..10cc932e209a 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -29,6 +29,7 @@ struct reclaim_stat { unsigned nr_activate[2]; unsigned nr_ref_keep; unsigned nr_unmap_fail; + unsigned nr_lazyfree_fail; }; enum writeback_stat_item { -- cgit v1.2.3 From 96f8bf4fb1dd2656ae3e92326be9ebf003bbfd45 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 3 Jun 2020 16:03:09 -0700 Subject: mm: vmscan: reclaim writepage is IO cost The VM tries to balance reclaim pressure between anon and file so as to reduce the amount of IO incurred due to the memory shortage. It already counts refaults and swapins, but in addition it should also count writepage calls during reclaim. For swap, this is obvious: it's IO that wouldn't have occurred if the anonymous memory hadn't been under memory pressure. From a relative balancing point of view this makes sense as well: even if anon is cold and reclaimable, a cache that isn't thrashing may have equally cold pages that don't require IO to reclaim. For file writeback, it's trickier: some of the reclaim writepage IO would have likely occurred anyway due to dirty expiration. But not all of it - premature writeback reduces batching and generates additional writes. Since the flushers are already woken up by the time the VM starts writing cache pages one by one, let's assume that we'e likely causing writes that wouldn't have happened without memory pressure. In addition, the per-page cost of IO would have probably been much cheaper if written in larger batches from the flusher thread rather than the single-page-writes from kswapd. For our purposes - getting the trend right to accelerate convergence on a stable state that doesn't require paging at all - this is sufficiently accurate. If we later wanted to optimize for sustained thrashing, we can still refine the measurements. Count all writepage calls from kswapd as IO cost toward the LRU that the page belongs to. Why do this dynamically? Don't we know in advance that anon pages require IO to reclaim, and so could build in a static bias? First, scanning is not the same as reclaiming. If all the anon pages are referenced, we may not swap for a while just because we're scanning the anon list. During this time, however, it's important that we age anonymous memory and the page cache at the same rate so that their hot-cold gradients are comparable. Everything else being equal, we still want to reclaim the coldest memory overall. Second, we keep copies in swap unless the page changes. If there is swap-backed data that's mostly read (tmpfs file) and has been swapped out before, we can reclaim it without incurring additional IO. Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org Signed-off-by: Linus Torvalds --- include/linux/swap.h | 4 +++- include/linux/vmstat.h | 1 + mm/swap.c | 16 ++++++++++------ mm/swap_state.c | 2 +- mm/vmscan.c | 3 +++ mm/workingset.c | 2 +- 6 files changed, 19 insertions(+), 9 deletions(-) (limited to 'include/linux/vmstat.h') diff --git a/include/linux/swap.h b/include/linux/swap.h index 0b71bf75fb67..4c5974bb9ba9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -334,7 +334,9 @@ extern unsigned long nr_free_pagecache_pages(void); /* linux/mm/swap.c */ -extern void lru_note_cost(struct page *); +extern void lru_note_cost(struct lruvec *lruvec, bool file, + unsigned int nr_pages); +extern void lru_note_cost_page(struct page *); extern void lru_cache_add(struct page *); extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 10cc932e209a..3d12c34cd42a 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -26,6 +26,7 @@ struct reclaim_stat { unsigned nr_congested; unsigned nr_writeback; unsigned nr_immediate; + unsigned nr_pageout; unsigned nr_activate[2]; unsigned nr_ref_keep; unsigned nr_unmap_fail; diff --git a/mm/swap.c b/mm/swap.c index 4dff2123f695..343675d629ae 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -278,18 +278,16 @@ void rotate_reclaimable_page(struct page *page) } } -void lru_note_cost(struct page *page) +void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) { - struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); - do { unsigned long lrusize; /* Record cost event */ - if (page_is_file_lru(page)) - lruvec->file_cost++; + if (file) + lruvec->file_cost += nr_pages; else - lruvec->anon_cost++; + lruvec->anon_cost += nr_pages; /* * Decay previous events @@ -311,6 +309,12 @@ void lru_note_cost(struct page *page) } while ((lruvec = parent_lruvec(lruvec))); } +void lru_note_cost_page(struct page *page) +{ + lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)), + page_is_file_lru(page), hpage_nr_pages(page)); +} + static void __activate_page(struct page *page, struct lruvec *lruvec, void *arg) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 1cd0b345ff7e..9d20b00627af 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -442,7 +442,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* XXX: Move to lru_cache_add() when it supports new vs putback */ spin_lock_irq(&page_pgdat(page)->lru_lock); - lru_note_cost(page); + lru_note_cost_page(page); spin_unlock_irq(&page_pgdat(page)->lru_lock); /* Caller will initiate read into locked page */ diff --git a/mm/vmscan.c b/mm/vmscan.c index d08640f0235c..14ffe9ccf7ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1359,6 +1359,8 @@ static unsigned int shrink_page_list(struct list_head *page_list, case PAGE_ACTIVATE: goto activate_locked; case PAGE_SUCCESS: + stat->nr_pageout += hpage_nr_pages(page); + if (PageWriteback(page)) goto keep; if (PageDirty(page)) @@ -1964,6 +1966,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, move_pages_to_lru(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); + lru_note_cost(lruvec, file, stat.nr_pageout); item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); diff --git a/mm/workingset.c b/mm/workingset.c index a6a2a740ed0b..d481ea452eeb 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -367,7 +367,7 @@ void workingset_refault(struct page *page, void *shadow) SetPageWorkingset(page); /* XXX: Move to lru_cache_add() when it supports new vs putback */ spin_lock_irq(&page_pgdat(page)->lru_lock); - lru_note_cost(page); + lru_note_cost_page(page); spin_unlock_irq(&page_pgdat(page)->lru_lock); inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } -- cgit v1.2.3