diff options
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/ksm.txt | 18 | ||||
-rw-r--r-- | Documentation/vm/transhuge.txt | 10 | ||||
-rw-r--r-- | Documentation/vm/userfaultfd.txt | 87 |
3 files changed, 112 insertions, 3 deletions
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt index f34a8ee6f860..6b0ca7feb135 100644 --- a/Documentation/vm/ksm.txt +++ b/Documentation/vm/ksm.txt @@ -38,6 +38,10 @@ the range for whenever the KSM daemon is started; even if the range cannot contain any pages which KSM could actually merge; even if MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. +If a region of memory must be split into at least one new MADV_MERGEABLE +or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process +will exceed vm.max_map_count (see Documentation/sysctl/vm.txt). + Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range includes unmapped gaps (though working on the intervening mapped areas), @@ -80,6 +84,20 @@ run - set 0 to stop ksmd from running but keep merged pages, Default: 0 (must be changed to 1 to activate KSM, except if CONFIG_SYSFS is disabled) +use_zero_pages - specifies whether empty pages (i.e. allocated pages + that only contain zeroes) should be treated specially. + When set to 1, empty pages are merged with the kernel + zero page(s) instead of with each other as it would + happen normally. This can improve the performance on + architectures with coloured zero pages, depending on + the workload. Care should be taken when enabling this + setting, as it can potentially degrade the performance + of KSM for some workloads, for example if the checksums + of pages candidate for merging match the checksum of + an empty page. This setting can be changed at any time, + it is only effective for pages merged after the change. + Default: 0 (normal KSM behaviour as in earlier releases) + The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: pages_shared - how many shared pages are being used diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index c4171e4519c2..cd28d5ee5273 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -110,6 +110,7 @@ MADV_HUGEPAGE region. echo always >/sys/kernel/mm/transparent_hugepage/defrag echo defer >/sys/kernel/mm/transparent_hugepage/defrag +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag echo madvise >/sys/kernel/mm/transparent_hugepage/defrag echo never >/sys/kernel/mm/transparent_hugepage/defrag @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to delay the VM start to utilise them. "defer" means that an application will wake kswapd in the background -to reclaim pages and wake kcompact to compact memory so that THP is +to reclaim pages and wake kcompactd to compact memory so that THP is available in the near future. It's the responsibility of khugepaged to then install the THP pages later. +"defer+madvise" will enter direct reclaim and compaction like "always", but +only for regions that have used madvise(MADV_HUGEPAGE); all other regions +will wake kswapd in the background to reclaim pages and wake kcompactd to +compact memory so that THP is available in the near future. + "madvise" will enter direct reclaim like "always" but only for regions that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. @@ -296,7 +302,7 @@ thp_split_page is incremented every time a huge page is split into base reason is that a huge page is old and is being reclaimed. This action implies splitting all PMD the page mapped with. -thp_split_page_failed is is incremented if kernel fails to split huge +thp_split_page_failed is incremented if kernel fails to split huge page. This can happen if the page was pinned by somebody. thp_deferred_split_page is incremented when a huge page is put onto split diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt index 70a3c94d1941..bb2f945f87ab 100644 --- a/Documentation/vm/userfaultfd.txt +++ b/Documentation/vm/userfaultfd.txt @@ -54,6 +54,26 @@ uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of respectively all the available features of the read(2) protocol and the generic ioctl available. +The uffdio_api.features bitmask returned by the UFFDIO_API ioctl +defines what memory types are supported by the userfaultfd and what +events, except page fault notifications, may be generated. + +If the kernel supports registering userfaultfd ranges on hugetlbfs +virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in +uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be +set if the kernel supports registering userfaultfd ranges on shared +memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero +MAP_SHARED, memfd_create, etc). + +The userland application that wants to use userfaultfd with hugetlbfs +or shared memory need to set the corresponding flag in +uffdio_api.features to enable those features. + +If the userland desires to receive notifications for events other than +page faults, it has to verify that uffdio_api.features has appropriate +UFFD_FEATURE_EVENT_* bits set. These events are described in more +detail below in "Non-cooperative userfaultfd" section. + Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should be invoked (if present in the returned uffdio_api.ioctls bitmask) to register a memory range in the userfaultfd by setting the @@ -129,7 +149,7 @@ migration thread in the QEMU running in the destination node will receive the page that triggered the userfault and it'll map it as usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it was spontaneously sent by the source or if it was an urgent page -requested through an userfault). +requested through a userfault). By the time the userfaults start, the QEMU in the destination node doesn't need to keep any per-page state bitmap relative to the live @@ -142,3 +162,68 @@ course the bitmap is updated accordingly. It's also useful to avoid sending the same page twice (in case the userfault is read by the postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration thread). + +== Non-cooperative userfaultfd == + +When the userfaultfd is monitored by an external manager, the manager +must be able to track changes in the process virtual memory +layout. Userfaultfd can notify the manager about such changes using +the same read(2) protocol as for the page fault notifications. The +manager has to explicitly enable these events by setting appropriate +bits in uffdio_api.features passed to UFFDIO_API ioctl: + +UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When +this feature is enabled, the userfaultfd context of the parent process +is duplicated into the newly created process. The manager receives +UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in +the uffd_msg.fork. + +UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() +calls. When the non-cooperative process moves a virtual memory area to +a different location, the manager will receive UFFD_EVENT_REMAP. The +uffd_msg.remap will contain the old and new addresses of the area and +its original length. + +UFFD_FEATURE_EVENT_REMOVE - enable notifications about +madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event +UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The +uffd_msg.remove will contain start and end addresses of the removed +area. + +UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory +unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove +containing start and end addresses of the unmapped area. + +Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP +are pretty similar, they quite differ in the action expected from the +userfaultfd manager. In the former case, the virtual memory is +removed, but the area is not, the area remains monitored by the +userfaultfd, and if a page fault occurs in that area it will be +delivered to the manager. The proper resolution for such page fault is +to zeromap the faulting address. However, in the latter case, when an +area is unmapped, either explicitly (with munmap() system call), or +implicitly (e.g. during mremap()), the area is removed and in turn the +userfaultfd context for such area disappears too and the manager will +not get further userland page faults from the removed area. Still, the +notification is required in order to prevent manager from using +UFFDIO_COPY on the unmapped area. + +Unlike userland page faults which have to be synchronous and require +explicit or implicit wakeup, all the events are delivered +asynchronously and the non-cooperative process resumes execution as +soon as manager executes read(). The userfaultfd manager should +carefully synchronize calls to UFFDIO_COPY with the events +processing. To aid the synchronization, the UFFDIO_COPY ioctl will +return -ENOSPC when the monitored process exits at the time of +UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed +its virtual memory layout simultaneously with outstanding UFFDIO_COPY +operation. + +The current asynchronous model of the event delivery is optimal for +single threaded non-cooperative userfaultfd manager implementations. A +synchronous event delivery model can be added later as a new +userfaultfd feature to facilitate multithreading enhancements of the +non cooperative manager, for example to allow UFFDIO_COPY ioctls to +run in parallel to the event reception. Single threaded +implementations should continue to use the current async event +delivery model instead. |