From b3b0e6accb5b3d249760e071f2cf77f696961158 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:31:10 -0800 Subject: kasan: update documentation This patch updates KASAN documentation to reflect the addition of the new tag-based mode. Link: http://lkml.kernel.org/r/aabef9de317c54b8a3919a4946ce534c6576726a.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/dev-tools/kasan.rst | 232 +++++++++++++++++++++++--------------- 1 file changed, 138 insertions(+), 94 deletions(-) (limited to 'Documentation') diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index aabc8738b3d8..b72d07d70239 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -4,15 +4,25 @@ The Kernel Address Sanitizer (KASAN) Overview -------- -KernelAddressSANitizer (KASAN) is a dynamic memory error detector. It provides -a fast and comprehensive solution for finding use-after-free and out-of-bounds -bugs. +KernelAddressSANitizer (KASAN) is a dynamic memory error detector designed to +find out-of-bound and use-after-free bugs. KASAN has two modes: generic KASAN +(similar to userspace ASan) and software tag-based KASAN (similar to userspace +HWASan). -KASAN uses compile-time instrumentation for checking every memory access, -therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is -required for detection of out-of-bounds accesses to stack or global variables. +KASAN uses compile-time instrumentation to insert validity checks before every +memory access, and therefore requires a compiler version that supports that. -Currently KASAN is supported only for the x86_64 and arm64 architectures. +Generic KASAN is supported in both GCC and Clang. With GCC it requires version +4.9.2 or later for basic support and version 5.0 or later for detection of +out-of-bounds accesses for stack and global variables and for inline +instrumentation mode (see the Usage section). With Clang it requires version +7.0.0 or later and it doesn't support detection of out-of-bounds accesses for +global variables yet. + +Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later. + +Currently generic KASAN is supported for the x86_64, arm64, xtensa and s390 +architectures, and tag-based KASAN is supported only for arm64. Usage ----- @@ -21,12 +31,14 @@ To enable KASAN configure kernel with:: CONFIG_KASAN = y -and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline and -inline are compiler instrumentation types. The former produces smaller binary -the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC -version 5.0 or later. +and choose between CONFIG_KASAN_GENERIC (to enable generic KASAN) and +CONFIG_KASAN_SW_TAGS (to enable software tag-based KASAN). + +You also need to choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. +Outline and inline are compiler instrumentation types. The former produces +smaller binary while the latter is 1.1 - 2 times faster. -KASAN works with both SLUB and SLAB memory allocators. +Both KASAN modes work with both SLUB and SLAB memory allocators. For better bug detection and nicer reporting, enable CONFIG_STACKTRACE. To disable instrumentation for specific files or directories, add a line @@ -43,85 +55,85 @@ similar to the following to the respective kernel Makefile: Error reports ~~~~~~~~~~~~~ -A typical out of bounds access report looks like this:: +A typical out-of-bounds access generic KASAN report looks like this:: ================================================================== - BUG: AddressSanitizer: out of bounds access in kmalloc_oob_right+0x65/0x75 [test_kasan] at addr ffff8800693bc5d3 - Write of size 1 by task modprobe/1689 - ============================================================================= - BUG kmalloc-128 (Not tainted): kasan error - ----------------------------------------------------------------------------- - - Disabling lock debugging due to kernel taint - INFO: Allocated in kmalloc_oob_right+0x3d/0x75 [test_kasan] age=0 cpu=0 pid=1689 - __slab_alloc+0x4b4/0x4f0 - kmem_cache_alloc_trace+0x10b/0x190 - kmalloc_oob_right+0x3d/0x75 [test_kasan] - init_module+0x9/0x47 [test_kasan] - do_one_initcall+0x99/0x200 - load_module+0x2cb3/0x3b20 - SyS_finit_module+0x76/0x80 - system_call_fastpath+0x12/0x17 - INFO: Slab 0xffffea0001a4ef00 objects=17 used=7 fp=0xffff8800693bd728 flags=0x100000000004080 - INFO: Object 0xffff8800693bc558 @offset=1368 fp=0xffff8800693bc720 - - Bytes b4 ffff8800693bc548: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ - Object ffff8800693bc558: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc568: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc578: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc588: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc598: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc5a8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc5b8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk - Object ffff8800693bc5c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. - Redzone ffff8800693bc5d8: cc cc cc cc cc cc cc cc ........ - Padding ffff8800693bc718: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ - CPU: 0 PID: 1689 Comm: modprobe Tainted: G B 3.18.0-rc1-mm1+ #98 - Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 - ffff8800693bc000 0000000000000000 ffff8800693bc558 ffff88006923bb78 - ffffffff81cc68ae 00000000000000f3 ffff88006d407600 ffff88006923bba8 - ffffffff811fd848 ffff88006d407600 ffffea0001a4ef00 ffff8800693bc558 + BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan] + Write of size 1 at addr ffff8801f44ec37b by task insmod/2760 + + CPU: 1 PID: 2760 Comm: insmod Not tainted 4.19.0-rc3+ #698 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: - [] dump_stack+0x46/0x58 - [] print_trailer+0xf8/0x160 - [] ? kmem_cache_oob+0xc3/0xc3 [test_kasan] - [] object_err+0x35/0x40 - [] ? kmalloc_oob_right+0x65/0x75 [test_kasan] - [] kasan_report_error+0x38a/0x3f0 - [] ? kasan_poison_shadow+0x2f/0x40 - [] ? kasan_unpoison_shadow+0x14/0x40 - [] ? kasan_poison_shadow+0x2f/0x40 - [] ? kmem_cache_oob+0xc3/0xc3 [test_kasan] - [] __asan_store1+0x75/0xb0 - [] ? kmem_cache_oob+0x1d/0xc3 [test_kasan] - [] ? kmalloc_oob_right+0x65/0x75 [test_kasan] - [] kmalloc_oob_right+0x65/0x75 [test_kasan] - [] init_module+0x9/0x47 [test_kasan] - [] do_one_initcall+0x99/0x200 - [] ? __vunmap+0xec/0x160 - [] load_module+0x2cb3/0x3b20 - [] ? m_show+0x240/0x240 - [] SyS_finit_module+0x76/0x80 - [] system_call_fastpath+0x12/0x17 + dump_stack+0x94/0xd8 + print_address_description+0x73/0x280 + kasan_report+0x144/0x187 + __asan_report_store1_noabort+0x17/0x20 + kmalloc_oob_right+0xa8/0xbc [test_kasan] + kmalloc_tests_init+0x16/0x700 [test_kasan] + do_one_initcall+0xa5/0x3ae + do_init_module+0x1b6/0x547 + load_module+0x75df/0x8070 + __do_sys_init_module+0x1c6/0x200 + __x64_sys_init_module+0x6e/0xb0 + do_syscall_64+0x9f/0x2c0 + entry_SYSCALL_64_after_hwframe+0x44/0xa9 + RIP: 0033:0x7f96443109da + RSP: 002b:00007ffcf0b51b08 EFLAGS: 00000202 ORIG_RAX: 00000000000000af + RAX: ffffffffffffffda RBX: 000055dc3ee521a0 RCX: 00007f96443109da + RDX: 00007f96445cff88 RSI: 0000000000057a50 RDI: 00007f9644992000 + RBP: 000055dc3ee510b0 R08: 0000000000000003 R09: 0000000000000000 + R10: 00007f964430cd0a R11: 0000000000000202 R12: 00007f96445cff88 + R13: 000055dc3ee51090 R14: 0000000000000000 R15: 0000000000000000 + + Allocated by task 2760: + save_stack+0x43/0xd0 + kasan_kmalloc+0xa7/0xd0 + kmem_cache_alloc_trace+0xe1/0x1b0 + kmalloc_oob_right+0x56/0xbc [test_kasan] + kmalloc_tests_init+0x16/0x700 [test_kasan] + do_one_initcall+0xa5/0x3ae + do_init_module+0x1b6/0x547 + load_module+0x75df/0x8070 + __do_sys_init_module+0x1c6/0x200 + __x64_sys_init_module+0x6e/0xb0 + do_syscall_64+0x9f/0x2c0 + entry_SYSCALL_64_after_hwframe+0x44/0xa9 + + Freed by task 815: + save_stack+0x43/0xd0 + __kasan_slab_free+0x135/0x190 + kasan_slab_free+0xe/0x10 + kfree+0x93/0x1a0 + umh_complete+0x6a/0xa0 + call_usermodehelper_exec_async+0x4c3/0x640 + ret_from_fork+0x35/0x40 + + The buggy address belongs to the object at ffff8801f44ec300 + which belongs to the cache kmalloc-128 of size 128 + The buggy address is located 123 bytes inside of + 128-byte region [ffff8801f44ec300, ffff8801f44ec380) + The buggy address belongs to the page: + page:ffffea0007d13b00 count:1 mapcount:0 mapping:ffff8801f7001640 index:0x0 + flags: 0x200000000000100(slab) + raw: 0200000000000100 ffffea0007d11dc0 0000001a0000001a ffff8801f7001640 + raw: 0000000000000000 0000000080150015 00000001ffffffff 0000000000000000 + page dumped because: kasan: bad access detected + Memory state around the buggy address: - ffff8800693bc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc380: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 fc - ffff8800693bc400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc500: fc fc fc fc fc fc fc fc fc fc fc 00 00 00 00 00 - >ffff8800693bc580: 00 00 00 00 00 00 00 00 00 00 03 fc fc fc fc fc - ^ - ffff8800693bc600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc - ffff8800693bc700: fc fc fc fc fb fb fb fb fb fb fb fb fb fb fb fb - ffff8800693bc780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb - ffff8800693bc800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb + ffff8801f44ec200: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb + ffff8801f44ec280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc + >ffff8801f44ec300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 + ^ + ffff8801f44ec380: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb + ffff8801f44ec400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ================================================================== -The header of the report discribe what kind of bug happened and what kind of -access caused it. It's followed by the description of the accessed slub object -(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and -the description of the accessed memory page. +The header of the report provides a short summary of what kind of bug happened +and what kind of access caused it. It's followed by a stack trace of the bad +access, a stack trace of where the accessed memory was allocated (in case bad +access happens on a slab object), and a stack trace of where the object was +freed (in case of a use-after-free bug report). Next comes a description of +the accessed slab object and information about the accessed memory page. In the last section the report shows memory state around the accessed address. Reading this part requires some understanding of how KASAN works. @@ -138,18 +150,24 @@ inaccessible memory like redzones or freed memory (see mm/kasan/kasan.h). In the report above the arrows point to the shadow byte 03, which means that the accessed address is partially accessible. +For tag-based KASAN this last report section shows the memory tags around the +accessed address (see Implementation details section). + Implementation details ---------------------- +Generic KASAN +~~~~~~~~~~~~~ + From a high level, our approach to memory error detection is similar to that of kmemcheck: use shadow memory to record whether each byte of memory is safe -to access, and use compile-time instrumentation to check shadow memory on each -memory access. +to access, and use compile-time instrumentation to insert checks of shadow +memory on each memory access. -AddressSanitizer dedicates 1/8 of kernel memory to its shadow memory -(e.g. 16TB to cover 128TB on x86_64) and uses direct mapping with a scale and -offset to translate a memory address to its corresponding shadow address. +Generic KASAN dedicates 1/8th of kernel memory to its shadow memory (e.g. 16TB +to cover 128TB on x86_64) and uses direct mapping with a scale and offset to +translate a memory address to its corresponding shadow address. Here is the function which translates an address to its corresponding shadow address:: @@ -162,12 +180,38 @@ address:: where ``KASAN_SHADOW_SCALE_SHIFT = 3``. -Compile-time instrumentation used for checking memory accesses. Compiler inserts -function calls (__asan_load*(addr), __asan_store*(addr)) before each memory -access of size 1, 2, 4, 8 or 16. These functions check whether memory access is -valid or not by checking corresponding shadow memory. +Compile-time instrumentation is used to insert memory access checks. Compiler +inserts function calls (__asan_load*(addr), __asan_store*(addr)) before each +memory access of size 1, 2, 4, 8 or 16. These functions check whether memory +access is valid or not by checking corresponding shadow memory. GCC 5.0 has possibility to perform inline instrumentation. Instead of making function calls GCC directly inserts the code to check the shadow memory. This option significantly enlarges kernel but it gives x1.1-x2 performance boost over outline instrumented kernel. + +Software tag-based KASAN +~~~~~~~~~~~~~~~~~~~~~~~~ + +Tag-based KASAN uses the Top Byte Ignore (TBI) feature of modern arm64 CPUs to +store a pointer tag in the top byte of kernel pointers. Like generic KASAN it +uses shadow memory to store memory tags associated with each 16-byte memory +cell (therefore it dedicates 1/16th of the kernel memory for shadow memory). + +On each memory allocation tag-based KASAN generates a random tag, tags the +allocated memory with this tag, and embeds this tag into the returned pointer. +Software tag-based KASAN uses compile-time instrumentation to insert checks +before each memory access. These checks make sure that tag of the memory that +is being accessed is equal to tag of the pointer that is used to access this +memory. In case of a tag mismatch tag-based KASAN prints a bug report. + +Software tag-based KASAN also has two instrumentation modes (outline, that +emits callbacks to check memory accesses; and inline, that performs the shadow +memory checks inline). With outline instrumentation mode, a bug report is +simply printed from the function that performs the access check. With inline +instrumentation a brk instruction is emitted by the compiler, and a dedicated +brk handler is used to print bug reports. + +A potential expansion of this mode is a hardware tag-based mode, which would +use hardware memory tagging support instead of compiler instrumentation and +manual shadow memory manipulation. -- cgit v1.2.3 From 1c30844d2dfe272d58c8fc000960b835d13aa2ac Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:52 -0800 Subject: mm: reclaim small amounts of memory when an external fragmentation event occurs An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 21 +++++++ include/linux/mm.h | 1 + include/linux/mmzone.h | 11 ++-- kernel/sysctl.c | 8 +++ mm/page_alloc.c | 43 +++++++++++++- mm/vmscan.c | 133 +++++++++++++++++++++++++++++++++++++++++--- 6 files changed, 202 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7d73882e2c27..187ce4f599a2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm: - swappiness - user_reserve_kbytes - vfs_cache_pressure +- watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -856,6 +857,26 @@ ten times more freeable objects than there are. ============================================================= +watermark_boost_factor: + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the high watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is that compaction has less work to do in the future and to +increase the success rate of future high-order allocations such as SLUB +allocations, THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor parameter, +the unit is in fractions of 10,000. The default value of 15,000 means +that up to 150% of the high watermark will be reclaimed in the event of +a pageblock being mixed due to fragmentation. The level of reclaim is +determined by the number of fragmentation events that occurred in the +recent past. If this value is smaller than a pageblock then a pageblocks +worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor +of 0 will disable the feature. + +============================================================= + watermark_scale_factor: This factor controls the aggressiveness of kswapd. It defines the diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d2be4c2d34a..031b2ce983f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone); /* page_alloc.c */ extern int min_free_kbytes; +extern int watermark_boost_factor; extern int watermark_scale_factor; /* nommu.c */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dcf1b66a96ab..5b4bfb90fb94 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,10 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) -#define wmark_pages(z, i) (z->_watermark[i]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -364,6 +364,7 @@ struct zone { /* zone watermarks, access with *_wmark_pages(zone) macros */ unsigned long _watermark[NR_WMARK]; + unsigned long watermark_boost; unsigned long nr_reserved_highatomic; @@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone) struct ctl_table; int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int watermark_boost_factor_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 5fc724e4e454..1825f712e73b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = { .proc_handler = min_free_kbytes_sysctl_handler, .extra1 = &zero, }, + { + .procname = "watermark_boost_factor", + .data = &watermark_boost_factor, + .maxlen = sizeof(watermark_boost_factor), + .mode = 0644, + .proc_handler = watermark_boost_factor_sysctl_handler, + .extra1 = &zero, + }, { .procname = "watermark_scale_factor", .data = &watermark_scale_factor, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 32b3e121a388..80373eca453d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; +int watermark_boost_factor __read_mostly = 15000; int watermark_scale_factor = 10; static unsigned long nr_kernel_pages __meminitdata; @@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } +static inline void boost_watermark(struct zone *zone) +{ + unsigned long max_boost; + + if (!watermark_boost_factor) + return; + + max_boost = mult_frac(zone->_watermark[WMARK_HIGH], + watermark_boost_factor, 10000); + max_boost = max(pageblock_nr_pages, max_boost); + + zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, + max_boost); +} + /* * This function implements actual steal behaviour. If order is large enough, * we can steal whole pageblock. If not, we first move freepages in this @@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt) * itself, so pages freed in the future will be put on the correct free list. */ static void steal_suitable_fallback(struct zone *zone, struct page *page, - int start_type, bool whole_block) + unsigned int alloc_flags, int start_type, bool whole_block) { unsigned int current_order = page_order(page); struct free_area *area; @@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page; } + /* + * Boost watermarks to increase reclaim pressure to reduce the + * likelihood of future fallbacks. Wake kswapd now as the node + * may be balanced overall and kswapd will not wake naturally. + */ + boost_watermark(zone); + if (alloc_flags & ALLOC_KSWAPD) + wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + /* We are not allowed to try stealing from the whole block */ if (!whole_block) goto single_page; @@ -2443,7 +2468,8 @@ do_steal: page = list_first_entry(&area->free_list[fallback_mt], struct page, lru); - steal_suitable_fallback(zone, page, start_migratetype, can_steal); + steal_suitable_fallback(zone, page, alloc_flags, start_migratetype, + can_steal); trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt); @@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void) zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + zone->watermark_boost = 0; spin_unlock_irqrestore(&zone->lock, flags); } @@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write, return 0; } +int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + return 0; +} + int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 24ab1f7394ab..bd8971a29204 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,6 +88,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; + /* e.g. boosted watermark reclaim leaves slabs alone */ + unsigned int may_shrinkslab:1; + /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - shrink_slab(sc->gfp_mask, pgdat->node_id, + if (sc->may_shrinkslab) { + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + } /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .may_shrinkslab = 1, }; /* @@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, .may_unmap = 1, .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, + .may_shrinkslab = 1, }; unsigned long lru_pages; @@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, + .may_shrinkslab = 1, }; /* @@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } +static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx) +{ + int i; + struct zone *zone; + + /* + * Check for watermark boosts top-down as the higher zones + * are more likely to be boosted. Both watermarks and boosts + * should not be checked at the time time as reclaim would + * start prematurely when there is no boosting and a lower + * zone is balanced. + */ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + if (zone->watermark_boost) + return true; + } + + return false; +} + /* * Returns true if there is an eligible zone balanced for the request order * and classzone_idx @@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) unsigned long mark = -1; struct zone *zone; + /* + * Check watermarks bottom-up as lower zones are more likely to + * meet watermarks. + */ for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; @@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; unsigned long pflags; + unsigned long nr_boost_reclaim; + unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; + bool boosted; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, - .priority = DEF_PRIORITY, - .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = 1, }; psi_memstall_enter(&pflags); @@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); + /* + * Account for the reclaim boost. Note that the zone boost is left in + * place so that parallel allocations that are near the watermark will + * stall or direct reclaim until kswapd is finished. + */ + nr_boost_reclaim = 0; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + nr_boost_reclaim += zone->watermark_boost; + zone_boosts[i] = zone->watermark_boost; + } + boosted = nr_boost_reclaim; + +restart: + sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; + bool balanced; bool ret; sc.reclaim_idx = classzone_idx; @@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Note that - * sc.reclaim_idx is not used as buffer_heads_over_limit may - * have adjusted it. + * If the pgdat is imbalanced then ignore boosting and preserve + * the watermarks for a later time and restart. Note that the + * zone watermarks will be still reset at the end of balancing + * on the grounds that the normal reclaim should be enough to + * re-evaluate if boosting is required when kswapd next wakes. + */ + balanced = pgdat_balanced(pgdat, sc.order, classzone_idx); + if (!balanced && nr_boost_reclaim) { + nr_boost_reclaim = 0; + goto restart; + } + + /* + * If boosting is not active then only reclaim if there are no + * eligible zones. Note that sc.reclaim_idx is not used as + * buffer_heads_over_limit may have adjusted it. */ - if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + if (!nr_boost_reclaim && balanced) goto out; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) + raise_priority = false; + + /* + * Do not writeback or swap pages for boosted reclaim. The + * intent is to relieve pressure not issue sub-optimal IO + * from reclaim context. If no pages are reclaimed, the + * reclaim will be aborted. + */ + sc.may_writepage = !laptop_mode && !nr_boost_reclaim; + sc.may_swap = !nr_boost_reclaim; + sc.may_shrinkslab = !nr_boost_reclaim; + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * progress in reclaiming pages */ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; + nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed); + + /* + * If reclaim made no progress for a boost, stop reclaim as + * IO cannot be queued and it could be an infinite loop in + * extreme circumstances. + */ + if (nr_boost_reclaim && !nr_reclaimed) + break; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) pgdat->kswapd_failures++; out: + /* If reclaim was boosted, account for the reclaim done in this pass */ + if (boosted) { + unsigned long flags; + + for (i = 0; i <= classzone_idx; i++) { + if (!zone_boosts[i]) + continue; + + /* Increments are under the zone lock */ + zone = pgdat->node_zones + i; + spin_lock_irqsave(&zone->lock, flags); + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); + spin_unlock_irqrestore(&zone->lock, flags); + } + + /* + * As there is now likely space, wakeup kcompact to defragment + * pageblocks. + */ + wakeup_kcompactd(pgdat, pageblock_order, classzone_idx); + } + snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); psi_memstall_leave(&pflags); @@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, /* Hopeless node, leave it to direct reclaim if possible */ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || - pgdat_balanced(pgdat, order, classzone_idx)) { + (pgdat_balanced(pgdat, order, classzone_idx) && + !pgdat_watermark_boosted(pgdat, classzone_idx))) { /* * There may be plenty of free memory available, but it's too * fragmented for high-order allocations. Wake up kcompactd -- cgit v1.2.3 From e82592c4fd7eafe8dec12a70436e93e3afb28556 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 28 Dec 2018 00:36:44 -0800 Subject: zram: introduce ZRAM_IDLE flag To support idle page writeback with upcoming patches, this patch introduces a new ZRAM_IDLE flag. Userspace can mark zram slots as "idle" via "echo all > /sys/block/zramX/idle" which marks every allocated zram slot as ZRAM_IDLE. User could see it by /sys/kernel/debug/zram/zram0/block_state. 300 75.033841 ...i 301 63.806904 s..i 302 63.806919 ..hi Once there is IO for the slot, the mark will be disappeared. 300 75.033841 ... 301 63.806904 s..i 302 63.806919 ..hi Therefore, 300th block is idle zpage. With this feature, user can how many zram has idle pages which are waste of memory. Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Reviewed-by: Joey Pabalinas Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 8 +++++ Documentation/blockdev/zram.txt | 10 +++--- drivers/block/zram/zram_drv.c | 57 ++++++++++++++++++++++++++++-- drivers/block/zram/zram_drv.h | 1 + 4 files changed, 69 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index c1513c756af1..04c9a5980bc7 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -98,3 +98,11 @@ Description: The backing_dev file is read-write and set up backing device for zram to write incompressible pages. For using, user should enable CONFIG_ZRAM_WRITEBACK. + +What: /sys/block/zram/idle +Date: November 2018 +Contact: Minchan Kim +Description: + idle file is write-only and mark zram slot as idle. + If system has mounted debugfs, user can see which slots + are idle via /sys/kernel/debug/zram/zram/block_state diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 3c1b5ab54bc0..f3bcd716d8a9 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -169,6 +169,7 @@ comp_algorithm RW show and change the compression algorithm compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes backing_dev RW set up backend storage for zram to write out +idle WO mark allocated slot as idle User space is advised to use the following files to read the device statistics. @@ -251,16 +252,17 @@ pages of the process with*pagemap. If you enable the feature, you could see block state via /sys/kernel/debug/zram/zram0/block_state". The output is as follows, - 300 75.033841 .wh - 301 63.806904 s.. - 302 63.806919 ..h + 300 75.033841 .wh. + 301 63.806904 s... + 302 63.806919 ..hi First column is zram's block index. Second column is access time since the system was booted Third column is state of the block. (s: same page w: written page to backing store -h: huge page) +h: huge page +i: idle page) First line of above example says 300th block is accessed at 75.033841sec and the block's state is huge so it is written back to the backing diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 4457d0395bfb..180613b478a6 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -281,6 +281,47 @@ static ssize_t mem_used_max_store(struct device *dev, return len; } +static ssize_t idle_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t len) +{ + struct zram *zram = dev_to_zram(dev); + unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; + int index; + char mode_buf[8]; + ssize_t sz; + + sz = strscpy(mode_buf, buf, sizeof(mode_buf)); + if (sz <= 0) + return -EINVAL; + + /* ignore trailing new line */ + if (mode_buf[sz - 1] == '\n') + mode_buf[sz - 1] = 0x00; + + if (strcmp(mode_buf, "all")) + return -EINVAL; + + down_read(&zram->init_lock); + if (!init_done(zram)) { + up_read(&zram->init_lock); + return -EINVAL; + } + + for (index = 0; index < nr_pages; index++) { + zram_slot_lock(zram, index); + if (!zram_allocated(zram, index)) + goto next; + + zram_set_flag(zram, index, ZRAM_IDLE); +next: + zram_slot_unlock(zram, index); + } + + up_read(&zram->init_lock); + + return len; +} + #ifdef CONFIG_ZRAM_WRITEBACK static void reset_bdev(struct zram *zram) { @@ -638,6 +679,7 @@ static void zram_debugfs_destroy(void) static void zram_accessed(struct zram *zram, u32 index) { + zram_clear_flag(zram, index, ZRAM_IDLE); zram->table[index].ac_time = ktime_get_boottime(); } @@ -670,12 +712,13 @@ static ssize_t read_block_state(struct file *file, char __user *buf, ts = ktime_to_timespec64(zram->table[index].ac_time); copied = snprintf(kbuf + written, count, - "%12zd %12lld.%06lu %c%c%c\n", + "%12zd %12lld.%06lu %c%c%c%c\n", index, (s64)ts.tv_sec, ts.tv_nsec / NSEC_PER_USEC, zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.', zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.', - zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.'); + zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.', + zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.'); if (count < copied) { zram_slot_unlock(zram, index); @@ -720,7 +763,10 @@ static void zram_debugfs_unregister(struct zram *zram) #else static void zram_debugfs_create(void) {}; static void zram_debugfs_destroy(void) {}; -static void zram_accessed(struct zram *zram, u32 index) {}; +static void zram_accessed(struct zram *zram, u32 index) +{ + zram_clear_flag(zram, index, ZRAM_IDLE); +}; static void zram_debugfs_register(struct zram *zram) {}; static void zram_debugfs_unregister(struct zram *zram) {}; #endif @@ -924,6 +970,9 @@ static void zram_free_page(struct zram *zram, size_t index) #ifdef CONFIG_ZRAM_MEMORY_TRACKING zram->table[index].ac_time = 0; #endif + if (zram_test_flag(zram, index, ZRAM_IDLE)) + zram_clear_flag(zram, index, ZRAM_IDLE); + if (zram_test_flag(zram, index, ZRAM_HUGE)) { zram_clear_flag(zram, index, ZRAM_HUGE); atomic64_dec(&zram->stats.huge_pages); @@ -1589,6 +1638,7 @@ static DEVICE_ATTR_RO(initstate); static DEVICE_ATTR_WO(reset); static DEVICE_ATTR_WO(mem_limit); static DEVICE_ATTR_WO(mem_used_max); +static DEVICE_ATTR_WO(idle); static DEVICE_ATTR_RW(max_comp_streams); static DEVICE_ATTR_RW(comp_algorithm); #ifdef CONFIG_ZRAM_WRITEBACK @@ -1602,6 +1652,7 @@ static struct attribute *zram_disk_attrs[] = { &dev_attr_compact.attr, &dev_attr_mem_limit.attr, &dev_attr_mem_used_max.attr, + &dev_attr_idle.attr, &dev_attr_max_comp_streams.attr, &dev_attr_comp_algorithm.attr, #ifdef CONFIG_ZRAM_WRITEBACK diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 4f83f1f14b0a..a84611b97867 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -48,6 +48,7 @@ enum zram_pageflags { ZRAM_SAME, /* Page consists the same element */ ZRAM_WB, /* page is stored on backing_device */ ZRAM_HUGE, /* Incompressible page */ + ZRAM_IDLE, /* not accessed page since last idle marking */ __NR_ZRAM_PAGEFLAGS, }; -- cgit v1.2.3 From a939888ec38bf1f33e4a903056677e92a4844244 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 28 Dec 2018 00:36:47 -0800 Subject: zram: support idle/huge page writeback Add a new feature "zram idle/huge page writeback". In the zram-swap use case, zram usually has many idle/huge swap pages. It's pointless to keep them in memory (ie, zram). To solve this problem, this feature introduces idle/huge page writeback to the backing device so the goal is to save more memory space on embedded systems. Normal sequence to use idle/huge page writeback feature is as follows, while (1) { # mark allocated zram slot to idle echo all > /sys/block/zram0/idle # leave system working for several hours # Unless there is no access for some blocks on zram, # they are still IDLE marked pages. echo "idle" > /sys/block/zram0/writeback or/and echo "huge" > /sys/block/zram0/writeback # write the IDLE or/and huge marked slot into backing device # and free the memory. } Per the discussion at https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u, This patch removes direct incommpressibe page writeback feature (d2afd25114f4 ("zram: write incompressible pages to backing device")). Below concerns from Sergey: == &< == "IDLE writeback" is superior to "incompressible writeback". "incompressible writeback" is completely unpredictable and uncontrollable; it depens on data patterns and compression algorithms. While "IDLE writeback" is predictable. I even suspect, that, *ideally*, we can remove "incompressible writeback". "IDLE pages" is a super set which also includes "incompressible" pages. So, technically, we still can do "incompressible writeback" from "IDLE writeback" path; but a much more reasonable one, based on a page idling period. I understand that you want to keep "direct incompressible writeback" around. ZRAM is especially popular on devices which do suffer from flash wearout, so I can see "incompressible writeback" path becoming a dead code, long term. == &< == Below concerns from Minchan: == &< == My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation, both hugepage/idlepage writeck will turn on. However someuser want to enable only idlepage writeback so we need to introduce turn on/off knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase. I don't want to make it complicated *if possible*. Long term, I imagine we need to make VM aware of new swap hierarchy a little bit different with as-is. For example, first high priority swap can return -EIO or -ENOCOMP, swap try to fallback to next lower priority swap device. With that, hugepage writeback will work tranparently. So we could regard it as regression because incompressible pages doesn't go to backing storage automatically. Instead, user should do it via "echo huge" > /sys/block/zram/writeback" manually. == &< == Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Joey Pabalinas Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 7 + Documentation/blockdev/zram.txt | 28 +++- drivers/block/zram/Kconfig | 5 +- drivers/block/zram/zram_drv.c | 247 ++++++++++++++++++++--------- drivers/block/zram/zram_drv.h | 1 + 5 files changed, 209 insertions(+), 79 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 04c9a5980bc7..d1f80b077885 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -106,3 +106,10 @@ Description: idle file is write-only and mark zram slot as idle. If system has mounted debugfs, user can see which slots are idle via /sys/kernel/debug/zram/zram/block_state + +What: /sys/block/zram/writeback +Date: November 2018 +Contact: Minchan Kim +Description: + The writeback file is write-only and trigger idle and/or + huge page writeback to backing device. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index f3bcd716d8a9..806cdaabac83 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -238,11 +238,31 @@ line of text and contains the following stats separated by whitespace: = writeback -With incompressible pages, there is no memory saving with zram. -Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page +With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page to backing storage rather than keeping it in memory. -User should set up backing device via /sys/block/zramX/backing_dev -before disksize setting. +To use the feature, admin should set up backing device via + + "echo /dev/sda5 > /sys/block/zramX/backing_dev" + +before disksize setting. It supports only partition at this moment. +If admin want to use incompressible page writeback, they could do via + + "echo huge > /sys/block/zramX/write" + +To use idle page writeback, first, user need to declare zram pages +as idle. + + "echo all > /sys/block/zramX/idle" + +From now on, any pages on zram are idle pages. The idle mark +will be removed until someone request access of the block. +IOW, unless there is access request, those pages are still idle pages. + +Admin can request writeback of those idle pages at right timing via + + "echo idle > /sys/block/zramX/writeback" + +With the command, zram writeback idle pages from memory to the storage. = memory tracking diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index fcd055457364..1ffc64770643 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -15,7 +15,7 @@ config ZRAM See Documentation/blockdev/zram.txt for more information. config ZRAM_WRITEBACK - bool "Write back incompressible page to backing device" + bool "Write back incompressible or idle page to backing device" depends on ZRAM help With incompressible page, there is no memory saving to keep it @@ -23,6 +23,9 @@ config ZRAM_WRITEBACK For this feature, admin should set up backing device via /sys/block/zramX/backing_dev. + With /sys/block/zramX/{idle,writeback}, application could ask + idle page's writeback to the backing device to save in memory. + See Documentation/blockdev/zram.txt for more information. config ZRAM_MEMORY_TRACKING diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 180613b478a6..6b5a886c8f32 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -52,6 +52,9 @@ static unsigned int num_devices = 1; static size_t huge_class_size; static void zram_free_page(struct zram *zram, size_t index); +static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec, + u32 index, int offset, struct bio *bio); + static int zram_slot_trylock(struct zram *zram, u32 index) { @@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram) return zram->disksize; } -static inline bool zram_allocated(struct zram *zram, u32 index) -{ - - return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) || - zram->table[index].handle; -} - static inline struct zram *dev_to_zram(struct device *dev) { return (struct zram *)dev_to_disk(dev)->private_data; @@ -138,6 +134,13 @@ static void zram_set_obj_size(struct zram *zram, zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size; } +static inline bool zram_allocated(struct zram *zram, u32 index) +{ + return zram_get_obj_size(zram, index) || + zram_test_flag(zram, index, ZRAM_SAME) || + zram_test_flag(zram, index, ZRAM_WB); +} + #if PAGE_SIZE != 4096 static inline bool is_partial_io(struct bio_vec *bvec) { @@ -308,10 +311,14 @@ static ssize_t idle_store(struct device *dev, } for (index = 0; index < nr_pages; index++) { + /* + * Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race. + * See the comment in writeback_store. + */ zram_slot_lock(zram, index); - if (!zram_allocated(zram, index)) + if (!zram_allocated(zram, index) || + zram_test_flag(zram, index, ZRAM_UNDER_WB)) goto next; - zram_set_flag(zram, index, ZRAM_IDLE); next: zram_slot_unlock(zram, index); @@ -546,6 +553,158 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec, return 1; } +#define HUGE_WRITEBACK 0x1 +#define IDLE_WRITEBACK 0x2 + +static ssize_t writeback_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t len) +{ + struct zram *zram = dev_to_zram(dev); + unsigned long nr_pages = zram->disksize >> PAGE_SHIFT; + unsigned long index; + struct bio bio; + struct bio_vec bio_vec; + struct page *page; + ssize_t ret, sz; + char mode_buf[8]; + unsigned long mode = -1UL; + unsigned long blk_idx = 0; + + sz = strscpy(mode_buf, buf, sizeof(mode_buf)); + if (sz <= 0) + return -EINVAL; + + /* ignore trailing newline */ + if (mode_buf[sz - 1] == '\n') + mode_buf[sz - 1] = 0x00; + + if (!strcmp(mode_buf, "idle")) + mode = IDLE_WRITEBACK; + else if (!strcmp(mode_buf, "huge")) + mode = HUGE_WRITEBACK; + + if (mode == -1UL) + return -EINVAL; + + down_read(&zram->init_lock); + if (!init_done(zram)) { + ret = -EINVAL; + goto release_init_lock; + } + + if (!zram->backing_dev) { + ret = -ENODEV; + goto release_init_lock; + } + + page = alloc_page(GFP_KERNEL); + if (!page) { + ret = -ENOMEM; + goto release_init_lock; + } + + for (index = 0; index < nr_pages; index++) { + struct bio_vec bvec; + + bvec.bv_page = page; + bvec.bv_len = PAGE_SIZE; + bvec.bv_offset = 0; + + if (!blk_idx) { + blk_idx = alloc_block_bdev(zram); + if (!blk_idx) { + ret = -ENOSPC; + break; + } + } + + zram_slot_lock(zram, index); + if (!zram_allocated(zram, index)) + goto next; + + if (zram_test_flag(zram, index, ZRAM_WB) || + zram_test_flag(zram, index, ZRAM_SAME) || + zram_test_flag(zram, index, ZRAM_UNDER_WB)) + goto next; + + if ((mode & IDLE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_IDLE)) && + (mode & HUGE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_HUGE))) + goto next; + /* + * Clearing ZRAM_UNDER_WB is duty of caller. + * IOW, zram_free_page never clear it. + */ + zram_set_flag(zram, index, ZRAM_UNDER_WB); + /* Need for hugepage writeback racing */ + zram_set_flag(zram, index, ZRAM_IDLE); + zram_slot_unlock(zram, index); + if (zram_bvec_read(zram, &bvec, index, 0, NULL)) { + zram_slot_lock(zram, index); + zram_clear_flag(zram, index, ZRAM_UNDER_WB); + zram_clear_flag(zram, index, ZRAM_IDLE); + zram_slot_unlock(zram, index); + continue; + } + + bio_init(&bio, &bio_vec, 1); + bio_set_dev(&bio, zram->bdev); + bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9); + bio.bi_opf = REQ_OP_WRITE | REQ_SYNC; + + bio_add_page(&bio, bvec.bv_page, bvec.bv_len, + bvec.bv_offset); + /* + * XXX: A single page IO would be inefficient for write + * but it would be not bad as starter. + */ + ret = submit_bio_wait(&bio); + if (ret) { + zram_slot_lock(zram, index); + zram_clear_flag(zram, index, ZRAM_UNDER_WB); + zram_clear_flag(zram, index, ZRAM_IDLE); + zram_slot_unlock(zram, index); + continue; + } + + /* + * We released zram_slot_lock so need to check if the slot was + * changed. If there is freeing for the slot, we can catch it + * easily by zram_allocated. + * A subtle case is the slot is freed/reallocated/marked as + * ZRAM_IDLE again. To close the race, idle_store doesn't + * mark ZRAM_IDLE once it found the slot was ZRAM_UNDER_WB. + * Thus, we could close the race by checking ZRAM_IDLE bit. + */ + zram_slot_lock(zram, index); + if (!zram_allocated(zram, index) || + !zram_test_flag(zram, index, ZRAM_IDLE)) { + zram_clear_flag(zram, index, ZRAM_UNDER_WB); + zram_clear_flag(zram, index, ZRAM_IDLE); + goto next; + } + + zram_free_page(zram, index); + zram_clear_flag(zram, index, ZRAM_UNDER_WB); + zram_set_flag(zram, index, ZRAM_WB); + zram_set_element(zram, index, blk_idx); + blk_idx = 0; + atomic64_inc(&zram->stats.pages_stored); +next: + zram_slot_unlock(zram, index); + } + + if (blk_idx) + free_block_bdev(zram, blk_idx); + ret = len; + __free_page(page); +release_init_lock: + up_read(&zram->init_lock); + + return ret; +} + struct zram_work { struct work_struct work; struct zram *zram; @@ -603,57 +762,8 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec, else return read_from_bdev_async(zram, bvec, entry, parent); } - -static int write_to_bdev(struct zram *zram, struct bio_vec *bvec, - u32 index, struct bio *parent, - unsigned long *pentry) -{ - struct bio *bio; - unsigned long entry; - - bio = bio_alloc(GFP_ATOMIC, 1); - if (!bio) - return -ENOMEM; - - entry = alloc_block_bdev(zram); - if (!entry) { - bio_put(bio); - return -ENOSPC; - } - - bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9); - bio_set_dev(bio, zram->bdev); - if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, - bvec->bv_offset)) { - bio_put(bio); - free_block_bdev(zram, entry); - return -EIO; - } - - if (!parent) { - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC; - bio->bi_end_io = zram_page_end_io; - } else { - bio->bi_opf = parent->bi_opf; - bio_chain(bio, parent); - } - - submit_bio(bio); - *pentry = entry; - - return 0; -} - #else static inline void reset_bdev(struct zram *zram) {}; -static int write_to_bdev(struct zram *zram, struct bio_vec *bvec, - u32 index, struct bio *parent, - unsigned long *pentry) - -{ - return -EIO; -} - static int read_from_bdev(struct zram *zram, struct bio_vec *bvec, unsigned long entry, struct bio *parent, bool sync) { @@ -1006,7 +1116,8 @@ out: atomic64_dec(&zram->stats.pages_stored); zram_set_handle(zram, index, 0); zram_set_obj_size(zram, index, 0); - WARN_ON_ONCE(zram->table[index].flags & ~(1UL << ZRAM_LOCK)); + WARN_ON_ONCE(zram->table[index].flags & + ~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB)); } static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index, @@ -1115,7 +1226,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, struct page *page = bvec->bv_page; unsigned long element = 0; enum zram_pageflags flags = 0; - bool allow_wb = true; mem = kmap_atomic(page); if (page_same_filled(mem, &element)) { @@ -1140,21 +1250,8 @@ compress_again: return ret; } - if (unlikely(comp_len >= huge_class_size)) { + if (comp_len >= huge_class_size) comp_len = PAGE_SIZE; - if (zram->backing_dev && allow_wb) { - zcomp_stream_put(zram->comp); - ret = write_to_bdev(zram, bvec, index, bio, &element); - if (!ret) { - flags = ZRAM_WB; - ret = 1; - goto out; - } - allow_wb = false; - goto compress_again; - } - } - /* * handle allocation has 2 paths: * a) fast path is executed with preemption disabled (for @@ -1643,6 +1740,7 @@ static DEVICE_ATTR_RW(max_comp_streams); static DEVICE_ATTR_RW(comp_algorithm); #ifdef CONFIG_ZRAM_WRITEBACK static DEVICE_ATTR_RW(backing_dev); +static DEVICE_ATTR_WO(writeback); #endif static struct attribute *zram_disk_attrs[] = { @@ -1657,6 +1755,7 @@ static struct attribute *zram_disk_attrs[] = { &dev_attr_comp_algorithm.attr, #ifdef CONFIG_ZRAM_WRITEBACK &dev_attr_backing_dev.attr, + &dev_attr_writeback.attr, #endif &dev_attr_io_stat.attr, &dev_attr_mm_stat.attr, diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index a84611b97867..1ad74f030b6d 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -47,6 +47,7 @@ enum zram_pageflags { ZRAM_LOCK = ZRAM_FLAG_SHIFT, ZRAM_SAME, /* Page consists the same element */ ZRAM_WB, /* page is stored on backing_device */ + ZRAM_UNDER_WB, /* page is under writeback */ ZRAM_HUGE, /* Incompressible page */ ZRAM_IDLE, /* not accessed page since last idle marking */ -- cgit v1.2.3 From 23eddf39b2c28c05cb8f8203d38e61807d701b38 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 28 Dec 2018 00:36:51 -0800 Subject: zram: add bd_stat statistics bd_stat represents things that happened in the backing device. Currently it supports bd_counts, bd_reads and bd_writes which are helpful to understand wearout of flash and memory saving. [minchan@kernel.org: v4] Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Cc: Joey Pabalinas Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 8 ++++++++ Documentation/blockdev/zram.txt | 11 +++++++++++ drivers/block/zram/zram_drv.c | 29 +++++++++++++++++++++++++++++ drivers/block/zram/zram_drv.h | 5 +++++ 4 files changed, 53 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index d1f80b077885..65fc33b2f53b 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -113,3 +113,11 @@ Contact: Minchan Kim Description: The writeback file is write-only and trigger idle and/or huge page writeback to backing device. + +What: /sys/block/zram/bd_stat +Date: November 2018 +Contact: Minchan Kim +Description: + The bd_stat file is read-only and represents backing device's + statistics (bd_count, bd_reads, bd_writes) in a format + similar to block layer statistics file format. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 806cdaabac83..906df97527a7 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -221,6 +221,17 @@ line of text and contains the following stats separated by whitespace: pages_compacted the number of pages freed during compaction huge_pages the number of incompressible pages +File /sys/block/zram/bd_stat + +The stat file represents device's backing device statistics. It consists of +a single line of text and contains the following stats separated by whitespace: + bd_count size of data written in backing device. + Unit: 4K bytes + bd_reads the number of reads from backing device + Unit: 4K bytes + bd_writes the number of writes to backing device + Unit: 4K bytes + 9) Deactivate: swapoff /dev/zram0 umount /dev/zram1 diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 6b5a886c8f32..f1832fa3ba41 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -502,6 +502,7 @@ retry: if (test_and_set_bit(blk_idx, zram->bitmap)) goto retry; + atomic64_inc(&zram->stats.bd_count); return blk_idx; } @@ -511,6 +512,7 @@ static void free_block_bdev(struct zram *zram, unsigned long blk_idx) was_set = test_and_clear_bit(blk_idx, zram->bitmap); WARN_ON_ONCE(!was_set); + atomic64_dec(&zram->stats.bd_count); } static void zram_page_end_io(struct bio *bio) @@ -668,6 +670,7 @@ static ssize_t writeback_store(struct device *dev, continue; } + atomic64_inc(&zram->stats.bd_writes); /* * We released zram_slot_lock so need to check if the slot was * changed. If there is freeing for the slot, we can catch it @@ -757,6 +760,7 @@ static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec, static int read_from_bdev(struct zram *zram, struct bio_vec *bvec, unsigned long entry, struct bio *parent, bool sync) { + atomic64_inc(&zram->stats.bd_reads); if (sync) return read_from_bdev_sync(zram, bvec, entry, parent); else @@ -1013,6 +1017,25 @@ static ssize_t mm_stat_show(struct device *dev, return ret; } +#ifdef CONFIG_ZRAM_WRITEBACK +static ssize_t bd_stat_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct zram *zram = dev_to_zram(dev); + ssize_t ret; + + down_read(&zram->init_lock); + ret = scnprintf(buf, PAGE_SIZE, + "%8llu %8llu %8llu\n", + (u64)atomic64_read(&zram->stats.bd_count) * (PAGE_SHIFT - 12), + (u64)atomic64_read(&zram->stats.bd_reads) * (PAGE_SHIFT - 12), + (u64)atomic64_read(&zram->stats.bd_writes) * (PAGE_SHIFT - 12)); + up_read(&zram->init_lock); + + return ret; +} +#endif + static ssize_t debug_stat_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -1033,6 +1056,9 @@ static ssize_t debug_stat_show(struct device *dev, static DEVICE_ATTR_RO(io_stat); static DEVICE_ATTR_RO(mm_stat); +#ifdef CONFIG_ZRAM_WRITEBACK +static DEVICE_ATTR_RO(bd_stat); +#endif static DEVICE_ATTR_RO(debug_stat); static void zram_meta_free(struct zram *zram, u64 disksize) @@ -1759,6 +1785,9 @@ static struct attribute *zram_disk_attrs[] = { #endif &dev_attr_io_stat.attr, &dev_attr_mm_stat.attr, +#ifdef CONFIG_ZRAM_WRITEBACK + &dev_attr_bd_stat.attr, +#endif &dev_attr_debug_stat.attr, NULL, }; diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 1ad74f030b6d..bc477803530d 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -82,6 +82,11 @@ struct zram_stats { atomic_long_t max_used_pages; /* no. of maximum pages stored */ atomic64_t writestall; /* no. of write slow paths */ atomic64_t miss_free; /* no. of missed free */ +#ifdef CONFIG_ZRAM_WRITEBACK + atomic64_t bd_count; /* no. of pages in backing device */ + atomic64_t bd_reads; /* no. of reads from backing device */ + atomic64_t bd_writes; /* no. of writes from backing device */ +#endif }; struct zram { -- cgit v1.2.3 From bb416d18b850faaa44bd3bb67c9728922c3cce98 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 28 Dec 2018 00:36:54 -0800 Subject: zram: writeback throttle If there are lots of write IO with flash device, it could have a wearout problem of storage. To overcome the problem, admin needs to design write limitation to guarantee flash health for entire product life. This patch creates a new knob "writeback_limit" for zram. writeback_limit's default value is 0 so that it doesn't limit any writeback. If admin want to measure writeback count in a certain period, he could know it via /sys/block/zram0/bd_stat's 3rd column. If admin want to limit writeback as per-day 400M, he could do it like below. MB_SHIFT=20 4K_SHIFT=12 echo $((400<>4K_SHIFT)) > \ /sys/block/zram0/writeback_limit. If admin want to allow further write again, he could do it like below echo 0 > /sys/block/zram0/writeback_limit If admin want to see remaining writeback budget, cat /sys/block/zram0/writeback_limit The writeback_limit count will reset whenever you reset zram (e.g., system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. [minchan@kernel.org: v4] Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Cc: Joey Pabalinas Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 9 ++++++ Documentation/blockdev/zram.txt | 31 ++++++++++++++++++ drivers/block/zram/zram_drv.c | 52 ++++++++++++++++++++++++++++-- drivers/block/zram/zram_drv.h | 2 ++ 4 files changed, 91 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 65fc33b2f53b..9d2339a485c8 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -121,3 +121,12 @@ Description: The bd_stat file is read-only and represents backing device's statistics (bd_count, bd_reads, bd_writes) in a format similar to block layer statistics file format. + +What: /sys/block/zram/writeback_limit +Date: November 2018 +Contact: Minchan Kim +Description: + The writeback_limit file is read-write and specifies the maximum + amount of writeback ZRAM can do. The limit could be changed + in run time and "0" means disable the limit. + No limit is the initial state. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 906df97527a7..436c5e98e1b6 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -164,6 +164,8 @@ reset WO trigger device reset mem_used_max WO reset the `mem_used_max' counter (see later) mem_limit WO specifies the maximum amount of memory ZRAM can use to store the compressed data +writeback_limit WO specifies the maximum amount of write IO zram can + write out to backing device as 4KB unit max_comp_streams RW the number of possible concurrent compress operations comp_algorithm RW show and change the compression algorithm compact WO trigger memory compaction @@ -275,6 +277,35 @@ Admin can request writeback of those idle pages at right timing via With the command, zram writeback idle pages from memory to the storage. +If there are lots of write IO with flash device, potentially, it has +flash wearout problem so that admin needs to design write limitation +to guarantee storage health for entire product life. +To overcome the concern, zram supports "writeback_limit". +The "writeback_limit"'s default value is 0 so that it doesn't limit +any writeback. If admin want to measure writeback count in a certain +period, he could know it via /sys/block/zram0/bd_stat's 3rd column. + +If admin want to limit writeback as per-day 400M, he could do it +like below. + + MB_SHIFT=20 + 4K_SHIFT=12 + echo $((400<>4K_SHIFT)) > \ + /sys/block/zram0/writeback_limit. + +If admin want to allow further write again, he could do it like below + + echo 0 > /sys/block/zram0/writeback_limit + +If admin want to see remaining writeback budget since he set, + + cat /sys/block/zram0/writeback_limit + +The writeback_limit count will reset whenever you reset zram(e.g., +system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of +writeback happened until you reset the zram to allocate extra writeback +budget in next setting is user's job. + = memory tracking With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f1832fa3ba41..33c5cc879f24 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -330,6 +330,39 @@ next: } #ifdef CONFIG_ZRAM_WRITEBACK +static ssize_t writeback_limit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t len) +{ + struct zram *zram = dev_to_zram(dev); + u64 val; + ssize_t ret = -EINVAL; + + if (kstrtoull(buf, 10, &val)) + return ret; + + down_read(&zram->init_lock); + atomic64_set(&zram->stats.bd_wb_limit, val); + if (val == 0) + zram->stop_writeback = false; + up_read(&zram->init_lock); + ret = len; + + return ret; +} + +static ssize_t writeback_limit_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + u64 val; + struct zram *zram = dev_to_zram(dev); + + down_read(&zram->init_lock); + val = atomic64_read(&zram->stats.bd_wb_limit); + up_read(&zram->init_lock); + + return scnprintf(buf, PAGE_SIZE, "%llu\n", val); +} + static void reset_bdev(struct zram *zram) { struct block_device *bdev; @@ -612,6 +645,11 @@ static ssize_t writeback_store(struct device *dev, bvec.bv_len = PAGE_SIZE; bvec.bv_offset = 0; + if (zram->stop_writeback) { + ret = -EIO; + break; + } + if (!blk_idx) { blk_idx = alloc_block_bdev(zram); if (!blk_idx) { @@ -694,6 +732,11 @@ static ssize_t writeback_store(struct device *dev, zram_set_element(zram, index, blk_idx); blk_idx = 0; atomic64_inc(&zram->stats.pages_stored); + if (atomic64_add_unless(&zram->stats.bd_wb_limit, + -1 << (PAGE_SHIFT - 12), 0)) { + if (atomic64_read(&zram->stats.bd_wb_limit) == 0) + zram->stop_writeback = true; + } next: zram_slot_unlock(zram, index); } @@ -1018,6 +1061,7 @@ static ssize_t mm_stat_show(struct device *dev, } #ifdef CONFIG_ZRAM_WRITEBACK +#define FOUR_K(x) ((x) * (1 << (PAGE_SHIFT - 12))) static ssize_t bd_stat_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -1027,9 +1071,9 @@ static ssize_t bd_stat_show(struct device *dev, down_read(&zram->init_lock); ret = scnprintf(buf, PAGE_SIZE, "%8llu %8llu %8llu\n", - (u64)atomic64_read(&zram->stats.bd_count) * (PAGE_SHIFT - 12), - (u64)atomic64_read(&zram->stats.bd_reads) * (PAGE_SHIFT - 12), - (u64)atomic64_read(&zram->stats.bd_writes) * (PAGE_SHIFT - 12)); + FOUR_K((u64)atomic64_read(&zram->stats.bd_count)), + FOUR_K((u64)atomic64_read(&zram->stats.bd_reads)), + FOUR_K((u64)atomic64_read(&zram->stats.bd_writes))); up_read(&zram->init_lock); return ret; @@ -1767,6 +1811,7 @@ static DEVICE_ATTR_RW(comp_algorithm); #ifdef CONFIG_ZRAM_WRITEBACK static DEVICE_ATTR_RW(backing_dev); static DEVICE_ATTR_WO(writeback); +static DEVICE_ATTR_RW(writeback_limit); #endif static struct attribute *zram_disk_attrs[] = { @@ -1782,6 +1827,7 @@ static struct attribute *zram_disk_attrs[] = { #ifdef CONFIG_ZRAM_WRITEBACK &dev_attr_backing_dev.attr, &dev_attr_writeback.attr, + &dev_attr_writeback_limit.attr, #endif &dev_attr_io_stat.attr, &dev_attr_mm_stat.attr, diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index bc477803530d..4bd3afd15e83 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -86,6 +86,7 @@ struct zram_stats { atomic64_t bd_count; /* no. of pages in backing device */ atomic64_t bd_reads; /* no. of reads from backing device */ atomic64_t bd_writes; /* no. of writes from backing device */ + atomic64_t bd_wb_limit; /* writeback limit of backing device */ #endif }; @@ -113,6 +114,7 @@ struct zram { */ bool claim; /* Protected by bdev->bd_mutex */ struct file *backing_dev; + bool stop_writeback; #ifdef CONFIG_ZRAM_WRITEBACK struct block_device *bdev; unsigned int old_block_size; -- cgit v1.2.3 From 7550c6079846a24f30d15ac75a941c8515dbedfb Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:38:17 -0800 Subject: mm, proc: be more verbose about unstable VMA flags in /proc//smaps Patch series "THP eligibility reporting via proc". This series of three patches aims at making THP eligibility reporting much more robust and long term sustainable. The trigger for the change is a regression report [2] and the long follow up discussion. In short the specific application didn't have good API to query whether a particular mapping can be backed by THP so it has used VMA flags to workaround that. These flags represent a deep internal state of VMAs and as such they should be used by userspace with a great deal of caution. A similar has happened for [3] when users complained that VM_MIXEDMAP is no longer set on DAX mappings. Again a lack of a proper API led to an abuse. The first patch in the series tries to emphasise that that the semantic of flags might change and any application consuming those should be really careful. The remaining two patches provide a more suitable interface to address [2] and provide a consistent API to query the THP status both for each VMA and process wide as well. [1] http://lkml.kernel.org/r/20181120103515.25280-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com [3] http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz This patch (of 3): Even though vma flags exported via /proc//smaps are explicitly documented to be not guaranteed for future compatibility the warning doesn't go far enough because it doesn't mention semantic changes to those flags. And they are important as well because these flags are a deep implementation internal to the MM code and the semantic might change at any time. Let's consider two recent examples: http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the : mean time certain customer of ours started poking into /proc//smaps : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA : flags, the application just fails to start complaining that DAX support is : missing in the kernel. http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") : introduced a regression in that userspace cannot always determine the set : of vmas where thp is ineligible. : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps : to determine if a vma is eligible to be backed by hugepages. : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to : be disabled and emit "nh" as a flag for the corresponding vmas as part of : /proc/pid/smaps. After the commit, thp is disabled by means of an mm : flag and "nh" is not emitted. : This causes smaps parsing libraries to assume a vma is eligible for thp : and ends up puzzling the user on why its memory is not backed by thp. In both cases userspace was relying on a semantic of a specific VMA flag. The primary reason why that happened is a lack of a proper interface. While this has been worked on and it will be fixed properly, it seems that our wording could see some refinement and be more vocal about semantic aspect of these flags as well. Link: http://lkml.kernel.org/r/20181211143641.3503-2-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Jan Kara Acked-by: Dan Williams Acked-by: David Rientjes Acked-by: Mike Rapoport Acked-by: Vlastimil Babka Cc: Dan Williams Cc: David Rientjes Cc: Paul Oppenheimer Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 12a5e6e693b6..2a4e63f5122c 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -496,7 +496,9 @@ manner. The codes are the following: Note that there is no guarantee that every flag and associated mnemonic will be present in all further kernel releases. Things get changed, the flags may -be vanished or the reverse -- new added. +be vanished or the reverse -- new added. Interpretation of their meaning +might change in future as well. So each consumer of these flags has to +follow each specific kernel version for the exact semantic. This file is only present if the CONFIG_MMU kernel configuration option is enabled. -- cgit v1.2.3 From 7635d9cbe8327e131a1d3d8517dc186c2796ce2e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:38:21 -0800 Subject: mm, thp, proc: report THP eligibility for each vma Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc//smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Dan Williams Cc: David Rientjes Cc: Jan Kara Cc: Mike Rapoport Cc: Paul Oppenheimer Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 3 +++ fs/proc/task_mmu.c | 2 ++ include/linux/huge_mm.h | 13 ++++++++++++- mm/huge_memory.c | 12 +++++++++++- mm/memory.c | 4 ++-- 5 files changed, 30 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2a4e63f5122c..cd465304bec4 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -425,6 +425,7 @@ SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB +THPeligible: 0 VmFlags: rd ex mr mw me dw the first of these lines shows the same information as is displayed for the @@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. +"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if +true, 0 otherwise. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b3ddceb003bc..f0ec9edab2f3 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss); + seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma)); + if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); show_smap_vma_flags(m, vma); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4663ee96cf59..381e872bfde0 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma); extern unsigned long transparent_hugepage_flags; -static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) +/* + * to be used on vmas which are known to support THP. + * Use transparent_hugepage_enabled otherwise + */ +static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) { if (vma->vm_flags & VM_NOHUGEPAGE) return false; @@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) return false; } +bool transparent_hugepage_enabled(struct vm_area_struct *vma); + #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1<ptl); alloc: - if (transparent_hugepage_enabled(vma) && + if (__transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) { huge_gfp = alloc_hugepage_direct_gfpmask(vma); new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER); diff --git a/mm/memory.c b/mm/memory.c index b7a8bfe5f5ec..2dd2f9ab57f4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3831,7 +3831,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, vmf.pud = pud_alloc(mm, p4d, address); if (!vmf.pud) return VM_FAULT_OOM; - if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) { + if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -3857,7 +3857,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, vmf.pmd = pmd_alloc(mm, vmf.pud, address); if (!vmf.pmd) return VM_FAULT_OOM; - if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) { + if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; -- cgit v1.2.3 From a1400af755631f5267f7cc3d0fda5ba72f58d7d3 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:38:25 -0800 Subject: mm, proc: report PR_SET_THP_DISABLE in proc David Rientjes has reported that commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") has changed the way how we report THPable VMAs to the userspace. Their monitoring tool is triggering false alarms on PR_SET_THP_DISABLE tasks because it considers an insufficient THP usage as a memory fragmentation resp. memory pressure issue. Before the said commit each newly created VMA inherited VM_NOHUGEPAGE flag and that got exposed to the userspace via /proc//smaps file. This implementation had its downsides as explained in the commit message but it is true that the userspace doesn't have any means to query for the process wide THP enabled/disabled status. PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to export in the process wide context rather than per-vma. Introduce a new field to /proc//status which export this status. If PR_SET_THP_DISABLE is used then it reports false same as when the THP is not compiled in. It doesn't consider the global THP status because we already export that information via sysfs Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Reported-by: David Rientjes Cc: Dan Williams Cc: Jan Kara Cc: Mike Rapoport Cc: Paul Oppenheimer Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 3 +++ fs/proc/array.c | 10 ++++++++++ 2 files changed, 13 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index cd465304bec4..b24fd9bccc99 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -182,6 +182,7 @@ read the file /proc/PID/status: VmSwap: 0 kB HugetlbPages: 0 kB CoreDumping: 0 + THP_enabled: 1 Threads: 1 SigQ: 0/28578 SigPnd: 0000000000000000 @@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8) HugetlbPages size of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) + THP_enabled process is allowed to use THP (returns 0 when + PR_SET_THP_DISABLE is set on the process Threads number of threads SigQ number of signals queued/max. number for queue SigPnd bitmap of pending signals for the thread diff --git a/fs/proc/array.c b/fs/proc/array.c index 0ceb3b6b37e7..9d428d5a0ac8 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -392,6 +392,15 @@ static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm) seq_putc(m, '\n'); } +static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm) +{ + bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE); + + if (thp_enabled) + thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags); + seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); +} + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { @@ -406,6 +415,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, if (mm) { task_mem(m, mm); task_core_dumping(m, mm); + task_thp_status(m, mm); mmput(mm); } task_sig(m, task); -- cgit v1.2.3