mm: swap: check if swap backing device is congested or not

Swap readahead would read in a few pages regardless if the underlying device is busy or not. It may incur long waiting time if the device is congested, and it may also exacerbate the congestion. Use inode_read_congested() to check if the underlying device is busy or not like what file page readahead does. Get inode from swap_info_struct. Although we can add inode information in swap_address_space (address_space->host), it may lead some unexpected side effect, i.e. it may break mapping_cap_account_dirty(). Using inode from swap_info_struct seems simple and good enough. Just does the check in vma_cluster_readahead() since swap_vma_readahead() is just used for non-rotational device which much less likely has congestion than traditional HDD. Although swap slots may be consecutive on swap partition, it still may be fragmented on swap file. This check would help to reduce excessive stall for such case. The test with page_fault1 of will-it-scale (sometimes tracing may just show runtest.py that is the wrapper script of page_fault1), which basically launches NR_CPU threads to generate 128MB anonymous pages for each thread, on my virtual machine with congested HDD shows long tail latency is reduced significantly. Without the patch page_fault1_thr-1490 [023] 129.311706: funcgraph_entry: #57377.796 us | do_swap_page(); page_fault1_thr-1490 [023] 129.369103: funcgraph_entry: 5.642us | do_swap_page(); page_fault1_thr-1490 [023] 129.369119: funcgraph_entry: #1289.592 us | do_swap_page(); page_fault1_thr-1490 [023] 129.370411: funcgraph_entry: 4.957us | do_swap_page(); page_fault1_thr-1490 [023] 129.370419: funcgraph_entry: 1.940us | do_swap_page(); page_fault1_thr-1490 [023] 129.378847: funcgraph_entry: #1411.385 us | do_swap_page(); page_fault1_thr-1490 [023] 129.380262: funcgraph_entry: 3.916us | do_swap_page(); page_fault1_thr-1490 [023] 129.380275: funcgraph_entry: #4287.751 us | do_swap_page(); With the patch runtest.py-1417 [020] 301.925911: funcgraph_entry: #9870.146 us | do_swap_page(); runtest.py-1417 [020] 301.935785: funcgraph_entry: 9.802us | do_swap_page(); runtest.py-1417 [020] 301.935799: funcgraph_entry: 3.551us | do_swap_page(); runtest.py-1417 [020] 301.935806: funcgraph_entry: 2.142us | do_swap_page(); runtest.py-1417 [020] 301.935853: funcgraph_entry: 6.938us | do_swap_page(); runtest.py-1417 [020] 301.935864: funcgraph_entry: 3.765us | do_swap_page(); runtest.py-1417 [020] 301.935871: funcgraph_entry: 3.600us | do_swap_page(); runtest.py-1417 [020] 301.935878: funcgraph_entry: 7.202us | do_swap_page(); [akpm@linux-foundation.org: code cleanup] [yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Tim Chen <tim.c.chen@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Hugh Dickins <hughd@google.com Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Yang Shi <yang.shi@linux.alibaba.com> 2019-03-05 15:44:11 -0800
committer: Linus Torvalds <torvalds@linux-foundation.org> 2019-03-05 21:07:16 -0800
commit: 8fd2e0b505d124bbb046ab15de0ff6f8d4babf56 (patch)
tree: 869b2427d0f5166f66834c6543e86e021983b401
parent: 14ef1fc72a3e86cea6498193c5e04b4619cb8622 (diff)
download: lwn-8fd2e0b505d124bbb046ab15de0ff6f8d4babf56.tar.gz
lwn-8fd2e0b505d124bbb046ab15de0ff6f8d4babf56.zip
1 files changed, 7 insertions, 0 deletions
diff --git a/mm/swap_state.c b/mm/swap_state.c
index fd2f21e1c60a..56a3cc4f1068 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -543,6 +543,13 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!mask)
 		goto skip;
 
+	/* Test swap type to make sure the dereference is safe */
+	if (likely(si->flags & (SWP_BLKDEV | SWP_FS))) {
+		struct inode *inode = si->swap_file->f_mapping->host;
+		if (inode_read_congested(inode))
+			goto skip;
+	}
+
 	do_poll = false;
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
author	Yang Shi <yang.shi@linux.alibaba.com>	2019-03-05 15:44:11 -0800
committer	Linus Torvalds <torvalds@linux-foundation.org>	2019-03-05 21:07:16 -0800
commit	8fd2e0b505d124bbb046ab15de0ff6f8d4babf56 (patch)
tree	869b2427d0f5166f66834c6543e86e021983b401
parent	14ef1fc72a3e86cea6498193c5e04b4619cb8622 (diff)
download	lwn-8fd2e0b505d124bbb046ab15de0ff6f8d4babf56.tar.gz lwn-8fd2e0b505d124bbb046ab15de0ff6f8d4babf56.zip