From ad56b738c5dd223a2f66685830f82194025a6138 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:47 +0200 Subject: docs/vm: rename documentation files to .rst Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/filesystems/proc.txt | 4 ++-- Documentation/filesystems/tmpfs.txt | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2a84bb334894..2d3984c70feb 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -515,7 +515,7 @@ guarantees: The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). +soft-dirty bit on pte (see Documentation/vm/soft-dirty.rst for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -536,7 +536,7 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. +/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index a85355cf85f4..627389a34f77 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -105,7 +105,7 @@ policy for the file will revert to "default" policy. NUMA memory allocation policies have optional flags that can be used in conjunction with their modes. These optional flags can be specified when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/vm/numa_memory_policy.txt for a list of all available +See Documentation/vm/numa_memory_policy.rst for a list of all available memory allocation policy mode flags and their effect on memory policy. =static is equivalent to MPOL_F_STATIC_NODES -- cgit v1.2.3 From 1ad1335dc58646764eda7bb054b350934a1b23ec Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:49 +0300 Subject: docs/admin-guide/mm: start moving here files from Documentation/vm Several documents in Documentation/vm fit quite well into the "admin/user guide" category. The documents that don't overload the reader with lots of implementation details and provide coherent description of certain feature can be moved to Documentation/admin-guide/mm. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/ABI/stable/sysfs-devices-node | 2 +- .../ABI/testing/sysfs-kernel-mm-hugepages | 2 +- Documentation/admin-guide/mm/hugetlbpage.rst | 381 +++++++++++++++++++++ .../admin-guide/mm/idle_page_tracking.rst | 115 +++++++ Documentation/admin-guide/mm/index.rst | 9 + Documentation/admin-guide/mm/pagemap.rst | 197 +++++++++++ Documentation/admin-guide/mm/soft-dirty.rst | 47 +++ Documentation/admin-guide/mm/userfaultfd.rst | 241 +++++++++++++ Documentation/filesystems/proc.txt | 6 +- Documentation/sysctl/vm.txt | 4 +- Documentation/vm/00-INDEX | 10 - Documentation/vm/hugetlbpage.rst | 381 --------------------- Documentation/vm/hwpoison.rst | 2 +- Documentation/vm/idle_page_tracking.rst | 115 ------- Documentation/vm/index.rst | 5 - Documentation/vm/pagemap.rst | 197 ----------- Documentation/vm/soft-dirty.rst | 47 --- Documentation/vm/userfaultfd.rst | 241 ------------- fs/Kconfig | 2 +- fs/proc/task_mmu.c | 4 +- mm/Kconfig | 5 +- 21 files changed, 1005 insertions(+), 1008 deletions(-) create mode 100644 Documentation/admin-guide/mm/hugetlbpage.rst create mode 100644 Documentation/admin-guide/mm/idle_page_tracking.rst create mode 100644 Documentation/admin-guide/mm/pagemap.rst create mode 100644 Documentation/admin-guide/mm/soft-dirty.rst create mode 100644 Documentation/admin-guide/mm/userfaultfd.rst delete mode 100644 Documentation/vm/hugetlbpage.rst delete mode 100644 Documentation/vm/idle_page_tracking.rst delete mode 100644 Documentation/vm/pagemap.rst delete mode 100644 Documentation/vm/soft-dirty.rst delete mode 100644 Documentation/vm/userfaultfd.rst (limited to 'Documentation/filesystems') diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index b38f4b734567..3e90e1f3bf0a 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -90,4 +90,4 @@ Date: December 2009 Contact: Lee Schermerhorn Description: The node's huge page size control/query attributes. - See Documentation/vm/hugetlbpage.rst \ No newline at end of file + See Documentation/admin-guide/mm/hugetlbpage.rst \ No newline at end of file diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages index 5140b233356c..fdaa2162fae1 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages +++ b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages @@ -12,4 +12,4 @@ Description: free_hugepages surplus_hugepages resv_hugepages - See Documentation/vm/hugetlbpage.rst for details. + See Documentation/admin-guide/mm/hugetlbpage.rst for details. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst new file mode 100644 index 000000000000..2b374d10284d --- /dev/null +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -0,0 +1,381 @@ +.. _hugetlbpage: + +============= +HugeTLB Pages +============= + +Overview +======== + +The intent of this file is to give a brief summary of hugetlbpage support in +the Linux kernel. This support is built on top of multiple page size support +that is provided by most modern architectures. For example, x86 CPUs normally +support 4K and 2M (1G if architecturally supported) page sizes, ia64 +architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, +256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical +translations. Typically this is a very scarce resource on processor. +Operating systems try to make best use of limited number of TLB resources. +This optimization is more critical now as bigger and bigger physical memories +(several GBs) are more readily available. + +Users can use the huge page support in Linux kernel by either using the mmap +system call or standard SYSV shared memory system calls (shmget, shmat). + +First the Linux kernel needs to be built with the CONFIG_HUGETLBFS +(present under "File systems") and CONFIG_HUGETLB_PAGE (selected +automatically when CONFIG_HUGETLBFS is selected) configuration +options. + +The ``/proc/meminfo`` file provides information about the total number of +persistent hugetlb pages in the kernel's huge page pool. It also displays +default huge page size and information about the number of free, reserved +and surplus huge pages in the pool of huge pages of default size. +The huge page size is needed for generating the proper alignment and +size of the arguments to system calls that map huge page regions. + +The output of ``cat /proc/meminfo`` will include lines like:: + + HugePages_Total: uuu + HugePages_Free: vvv + HugePages_Rsvd: www + HugePages_Surp: xxx + Hugepagesize: yyy kB + Hugetlb: zzz kB + +where: + +HugePages_Total + is the size of the pool of huge pages. +HugePages_Free + is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd + is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp + is short for "surplus," and is the number of huge pages in + the pool above the value in ``/proc/sys/vm/nr_hugepages``. The + maximum number of surplus huge pages is controlled by + ``/proc/sys/vm/nr_overcommit_hugepages``. +Hugepagesize + is the default hugepage size (in Kb). +Hugetlb + is the total amount of memory (in kB), consumed by huge + pages of all sizes. + If huge pages of different sizes are in use, this number + will exceed HugePages_Total \* Hugepagesize. To get more + detailed information, please, refer to + ``/sys/kernel/mm/hugepages`` (described below). + + +``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" +configured in the kernel. + +``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge +pages in the kernel's huge page pool. "Persistent" huge pages will be +returned to the huge page pool when freed by a task. A user with root +privileges can dynamically allocate more or free some persistent huge pages +by increasing or decreasing the value of ``nr_hugepages``. + +Pages that are used as huge pages are reserved inside the kernel and cannot +be used for other purposes. Huge pages cannot be swapped out under +memory pressure. + +Once a number of huge pages have been pre-allocated to the kernel huge page +pool, a user with appropriate privilege can use either the mmap system call +or shared memory system calls to use the huge pages. See the discussion of +:ref:`Using Huge Pages `, below. + +The administrator can allocate persistent huge pages on the kernel boot +command line by specifying the "hugepages=N" parameter, where 'N' = the +number of huge pages requested. This is the most reliable method of +allocating huge pages as memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To allocate huge pages +of a specific size, one must precede the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=". must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=" boot parameter. + +When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` +indicates the current number of pre-allocated huge pages of the default size. +Thus, one can use the following command to dynamically allocate/deallocate +default sized persistent huge pages:: + + echo 20 > /proc/sys/vm/nr_hugepages + +This command will try to adjust the number of default sized huge pages in the +huge page pool to 20, allocating or freeing huge pages, as required. + +On a NUMA platform, the kernel will attempt to distribute the huge page pool +over all the set of allowed nodes specified by the NUMA memory policy of the +task that modifies ``nr_hugepages``. The default for the allowed nodes--when the +task has default memory policy--is all on-line nodes with memory. Allowed +nodes with insufficient available, contiguous memory for a huge page will be +silently skipped when allocating persistent huge pages. See the +:ref:`discussion below ` +of the interaction of task memory policy, cpusets and per node attributes +with the allocation and freeing of persistent huge pages. + +The success or failure of huge page allocation depends on the amount of +physically contiguous memory that is present in system at the time of the +allocation attempt. If the kernel is unable to allocate huge pages from +some nodes in a NUMA system, it will attempt to make up the difference by +allocating extra pages on other nodes with sufficient available contiguous +memory, if any. + +System administrators may want to put this command in one of the local rc +init files. This will enable the kernel to allocate huge pages early in +the boot process when the possibility of getting physical contiguous pages +is still very high. Administrators can verify the number of huge pages +actually allocated by checking the sysctl or meminfo. To check the per node +distribution of huge pages in a NUMA system, use:: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge + +``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of +huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are +requested by applications. Writing any non-zero value into this file +indicates that the hugetlb subsystem is allowed to try to obtain that +number of "surplus" huge pages from the kernel's normal page pool, when the +persistent huge page pool is exhausted. As these surplus huge pages become +unused, they are freed back to the kernel's normal page pool. + +When increasing the huge page pool size via ``nr_hugepages``, any existing +surplus pages will first be promoted to persistent huge pages. Then, additional +huge pages will be allocated, if necessary and if possible, to fulfill +the new persistent huge page pool size. + +The administrator may shrink the pool of persistent huge pages for +the default huge page size by setting the ``nr_hugepages`` sysctl to a +smaller value. The kernel will attempt to balance the freeing of huge pages +across all nodes in the memory policy of the task modifying ``nr_hugepages``. +Any free huge pages on the selected nodes will be freed back to the kernel's +normal page pool. + +Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that +it becomes less than the number of huge pages in use will convert the balance +of the in-use huge pages to surplus huge pages. This will occur even if +the number of surplus pages would exceed the overcommit value. As long as +this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is +increased sufficiently, or the surplus huge pages go out of use and are freed-- +no more surplus huge pages will be allowed to be allocated. + +With support for multiple huge page pools at run-time available, much of +the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in +sysfs. +The ``/proc`` interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is:: + + /sys/kernel/mm/hugepages + +For each huge page size supported by the running kernel, a subdirectory +will exist, of the form:: + + hugepages-${size}kB + +Inside each of these directories, the same set of files will exist:: + + nr_hugepages + nr_hugepages_mempolicy + nr_overcommit_hugepages + free_hugepages + resv_hugepages + surplus_hugepages + +which function as described above for the default huge page-sized case. + +.. _mem_policy_and_hp_alloc: + +Interaction of Task Memory Policy with Huge Page Allocation/Freeing +=================================================================== + +Whether huge pages are allocated and freed via the ``/proc`` interface or +the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the +NUMA nodes from which huge pages are allocated or freed are controlled by the +NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` +sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy +is ignored. + +The recommended method to allocate or free huge pages to/from the kernel +huge page pool, using the ``nr_hugepages`` example above, is:: + + numactl --interleave echo 20 \ + >/proc/sys/vm/nr_hugepages_mempolicy + +or, more succinctly:: + + numactl -m echo 20 >/proc/sys/vm/nr_hugepages_mempolicy + +This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes +specified in , depending on whether number of persistent huge pages +is initially less than or greater than 20, respectively. No huge pages will be +allocated nor freed on any node not included in the specified . + +When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any +memory policy mode--bind, preferred, local or interleave--may be used. The +resulting effect on persistent huge page allocation is as follows: + +#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], + persistent huge pages will be distributed across the node or nodes + specified in the mempolicy as if "interleave" had been specified. + However, if a node in the policy does not contain sufficient contiguous + memory for a huge page, the allocation will not "fallback" to the nearest + neighbor node with sufficient contiguous memory. To do this would cause + undesirable imbalance in the distribution of the huge page pool, or + possibly, allocation of persistent huge pages on nodes not allowed by + the task's memory policy. + +#. One or more nodes may be specified with the bind or interleave policy. + If more than one node is specified with the preferred policy, only the + lowest numeric id will be used. Local policy will select the node where + the task is running at the time the nodes_allowed mask is constructed. + For local policy to be deterministic, the task must be bound to a cpu or + cpus in a single node. Otherwise, the task could be migrated to some + other node at any time after launch and the resulting node will be + indeterminate. Thus, local policy is not very useful for this purpose. + Any of the other mempolicy modes may be used to specify a single node. + +#. The nodes allowed mask will be derived from any non-default task mempolicy, + whether this policy was set explicitly by the task itself or one of its + ancestors, such as numactl. This means that if the task is invoked from a + shell with non-default policy, that policy will be used. One can specify a + node list of "all" with numactl --interleave or --membind [-m] to achieve + interleaving over all nodes in the system or cpuset. + +#. Any task mempolicy specified--e.g., using numactl--will be constrained by + the resource limits of any cpuset in which the task runs. Thus, there will + be no way for a task with non-default policy running in a cpuset with a + subset of the system nodes to allocate huge pages outside the cpuset + without first moving to a cpuset that contains all of the desired nodes. + +#. Boot-time huge page allocation attempts to distribute the requested number + of huge pages over all on-lines nodes with memory. + +Per Node Hugepages Attributes +============================= + +A subset of the contents of the root huge page control directory in sysfs, +described above, will be replicated under each the system device of each +NUMA node with memory in:: + + /sys/devices/system/node/node[0-9]*/hugepages/ + +Under this directory, the subdirectory for each supported huge page size +contains the following attribute files:: + + nr_hugepages + free_hugepages + surplus_hugepages + +The free\_' and surplus\_' attribute files are read-only. They return the number +of free and surplus [overcommitted] huge pages, respectively, on the parent +node. + +The ``nr_hugepages`` attribute returns the total number of huge pages on the +specified node. When this attribute is written, the number of persistent huge +pages on the parent node will be adjusted to the specified value, if sufficient +resources exist, regardless of the task's mempolicy or cpuset constraints. + +Note that the number of overcommit and reserve pages remain global quantities, +as we don't know until fault time, when the faulting task's mempolicy is +applied, from which node the huge page allocation will be attempted. + +.. _using_huge_pages: + +Using Huge Pages +================ + +If the user applications are going to request huge pages using mmap system +call, then it is required that system administrator mount a file system of +type hugetlbfs:: + + mount -t hugetlbfs \ + -o uid=,gid=,mode=,pagesize=,size=,\ + min_size=,nr_inodes= none /mnt/huge + +This command mounts a (pseudo) filesystem of type hugetlbfs on the directory +``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. + +The ``uid`` and ``gid`` options sets the owner and group of the root of the +file system. By default the ``uid`` and ``gid`` of the current process +are taken. + +The ``mode`` option sets the mode of root of file system to value & 01777. +This value is given in octal. By default the value 0755 is picked. + +If the platform supports multiple huge page sizes, the ``pagesize`` option can +be used to specify the huge page size and associated pool. ``pagesize`` +is specified in bytes. If ``pagesize`` is not specified the platform's +default huge page size and associated pool will be used. + +The ``size`` option sets the maximum value of memory (huge pages) allowed +for that filesystem (``/mnt/huge``). The ``size`` option can be specified +in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). +The size is rounded down to HPAGE_SIZE boundary. + +The ``min_size`` option sets the minimum value of memory (huge pages) allowed +for the filesystem. ``min_size`` can be specified in the same way as ``size``, +either bytes or a percentage of the huge page pool. +At mount time, the number of huge pages specified by ``min_size`` are reserved +for use by the filesystem. +If there are not enough free huge pages available, the mount will fail. +As huge pages are allocated to the filesystem and freed, the reserve count +is adjusted so that the sum of allocated and reserved huge pages is always +at least ``min_size``. + +The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` +can use. + +If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on +command line then no limits are set. + +For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can +use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. +For example, size=2K has the same meaning as size=2048. + +While read system calls are supported on files that reside on hugetlb +file systems, write system calls are not. + +Regular chown, chgrp, and chmod commands (with right permissions) could be +used to change the file attributes on hugetlbfs. + +Also, it is important to note that no such mount command is required if +applications are going to use only shmat/shmget system calls or mmap with +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +:ref:`map_hugetlb ` below. + +Users who wish to use hugetlb memory via shared memory segment should be +members of a supplementary group and system admin needs to configure that gid +into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different +applications to use any combination of mmaps and shm* calls, though the mount of +filesystem will be required for using mmap calls without MAP_HUGETLB. + +Syscalls that operate on memory backed by hugetlb pages only have their lengths +aligned to the native page size of the processor; they will normally fail with +errno set to EINVAL or exclude hugetlb pages that extend beyond the length if +not hugepage aligned. For example, munmap(2) will fail if memory is backed by +a hugetlb page and the length is smaller than the hugepage size. + + +Examples +======== + +.. _map_hugetlb: + +``map_hugetlb`` + see tools/testing/selftests/vm/map_hugetlb.c + +``hugepage-shm`` + see tools/testing/selftests/vm/hugepage-shm.c + +``hugepage-mmap`` + see tools/testing/selftests/vm/hugepage-mmap.c + +The `libhugetlbfs`_ library provides a wide range of userspace tools +to help with huge page usability, environment setup, and control. + +.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst new file mode 100644 index 000000000000..92e3a25d2deb --- /dev/null +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -0,0 +1,115 @@ +.. _idle_page_tracking: + +================== +Idle Page Tracking +================== + +Motivation +========== + +The idle page tracking feature allows to track which memory pages are being +accessed by a workload and which are idle. This information can be useful for +estimating the workload's working set size, which, in turn, can be taken into +account when configuring the workload parameters, setting memory cgroup limits, +or deciding where to place the workload within a compute cluster. + +It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. + +.. _user_api: + +User API +======== + +The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. +Currently, it consists of the only read-write file, +``/sys/kernel/mm/page_idle/bitmap``. + +The file implements a bitmap where each bit corresponds to a memory page. The +bitmap is represented by an array of 8-byte integers, and the page at PFN #i is +mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is +set, the corresponding page is idle. + +A page is considered idle if it has not been accessed since it was marked idle +(for more details on what "accessed" actually means see the :ref:`Implementation +Details ` section). +To mark a page idle one has to set the bit corresponding to +the page by writing to the file. A value written to the file is OR-ed with the +current bitmap value. + +Only accesses to user memory pages are tracked. These are pages mapped to a +process address space, page cache and buffer pages, swap cache pages. For other +page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, +and hence such pages are never reported idle. + +For huge pages the idle flag is set only on the head page, so one has to read +``/proc/kpageflags`` in order to correctly count idle huge pages. + +Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return +-EINVAL if you are not starting the read/write on an 8-byte boundary, or +if the size of the read/write is not a multiple of 8 bytes. Writing to +this file beyond max PFN will return -ENXIO. + +That said, in order to estimate the amount of pages that are not used by a +workload one should: + + 1. Mark all the workload's pages as idle by setting corresponding bits in + ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading + ``/proc/pid/pagemap`` if the workload is represented by a process, or by + filtering out alien pages using ``/proc/kpagecgroup`` in case the workload + is placed in a memory cgroup. + + 2. Wait until the workload accesses its working set. + + 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. + If one wants to ignore certain types of pages, e.g. mlocked pages since they + are not reclaimable, he or she can filter them out using + ``/proc/kpageflags``. + +See Documentation/admin-guide/mm/pagemap.rst for more information about +``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. + +.. _impl_details: + +Implementation Details +====================== + +The kernel internally keeps track of accesses to user memory pages in order to +reclaim unreferenced pages first on memory shortage conditions. A page is +considered referenced if it has been recently accessed via a process address +space, in which case one or more PTEs it is mapped to will have the Accessed bit +set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The +latter happens when: + + - a userspace process reads or writes a page using a system call (e.g. read(2) + or write(2)) + + - a page that is used for storing filesystem buffers is read or written, + because a process needs filesystem metadata stored in it (e.g. lists a + directory tree) + + - a page is accessed by a device driver using get_user_pages() + +When a dirty page is written to swap or disk as a result of memory reclaim or +exceeding the dirty memory limit, it is not marked referenced. + +The idle memory tracking feature adds a new page flag, the Idle flag. This flag +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the +:ref:`User API ` +section), and cleared automatically whenever a page is referenced as defined +above. + +When a page is marked idle, the Accessed bit must be cleared in all PTEs it is +mapped to, otherwise we will not be able to detect accesses to the page coming +from a process address space. To avoid interference with the reclaimer, which, +as noted above, uses the Accessed bit to promote actively referenced pages, one +more page flag is introduced, the Young flag. When the PTE Accessed bit is +cleared as a result of setting or updating a page's Idle flag, the Young flag +is set on the page. The reclaimer treats the Young flag as an extra PTE +Accessed bit and therefore will consider such a page as referenced. + +Since the idle memory tracking feature is based on the memory reclaimer logic, +it only works with pages that are on an LRU list, other pages are silently +ignored. That means it will ignore a user memory page if it is isolated, but +since there are usually not many of them, it should not affect the overall +result noticeably. In order not to stall scanning of the idle page bitmap, +locked pages may be skipped too. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c47c16e13a18..6c8b554464bb 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -17,3 +17,12 @@ are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. Here we document in detail how to interact with various mechanisms in the Linux memory management. + +.. toctree:: + :maxdepth: 1 + + hugetlbpage + idle_page_tracking + pagemap + soft-dirty + userfaultfd diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst new file mode 100644 index 000000000000..053ca64fd47a --- /dev/null +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -0,0 +1,197 @@ +.. _pagemap: + +============================= +Examining Process Page Tables +============================= + +pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow +userspace programs to examine the page tables and related information by +reading files in ``/proc``. + +There are four components to pagemap: + + * ``/proc/pid/pagemap``. This file lets a userspace process find out which + physical frame each virtual page is mapped to. It contains one 64-bit + value for each virtual page, containing the following data (from + ``fs/proc/task_mmu.c``, above pagemap_read): + + * Bits 0-54 page frame number (PFN) if present + * Bits 0-4 swap type if swapped + * Bits 5-54 swap offset if swapped + * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) + * Bit 56 page exclusively mapped (since 4.2) + * Bits 57-60 zero + * Bit 61 page is file-page or shared-anon (since 3.5) + * Bit 62 page swapped + * Bit 63 page present + + Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. + In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from + 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. + Reason: information about PFNs helps in exploiting Rowhammer vulnerability. + + If the page is not present but in swap, then the PFN contains an + encoding of the swap file number and the page's offset into the + swap. Unmapped pages return a null PFN. This allows determining + precisely which pages are mapped (or in swap) and comparing mapped + pages between processes. + + Efficient users of this interface will use ``/proc/pid/maps`` to + determine which areas of memory are actually mapped and llseek to + skip over unmapped regions. + + * ``/proc/kpagecount``. This file contains a 64-bit count of the number of + times each page is mapped, indexed by PFN. + + * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each + page, indexed by PFN. + + The flags are (from ``fs/proc/page.c``, above kpageflags_read): + + 0. LOCKED + 1. ERROR + 2. REFERENCED + 3. UPTODATE + 4. DIRTY + 5. LRU + 6. ACTIVE + 7. SLAB + 8. WRITEBACK + 9. RECLAIM + 10. BUDDY + 11. MMAP + 12. ANON + 13. SWAPCACHE + 14. SWAPBACKED + 15. COMPOUND_HEAD + 16. COMPOUND_TAIL + 17. HUGE + 18. UNEVICTABLE + 19. HWPOISON + 20. NOPAGE + 21. KSM + 22. THP + 23. BALLOON + 24. ZERO_PAGE + 25. IDLE + + * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the + memory cgroup each page is charged to, indexed by PFN. Only available when + CONFIG_MEMCG is set. + +Short descriptions to the page flags +==================================== + +0 - LOCKED + page is being locked for exclusive access, e.g. by undergoing read/write IO +7 - SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. +10 - BUDDY + a free memory block managed by the buddy system allocator + The buddy system organizes free memory in blocks of various orders. + An order N block has 2^N physically contiguous pages, with the BUDDY flag + set for and _only_ for the first page. +15 - COMPOUND_HEAD + A compound page with order N consists of 2^N physically contiguous pages. + A compound page with order 2 takes the form of "HTTT", where H donates its + head page and T donates its tail page(s). The major consumers of compound + pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. + memory allocators and various device drivers. However in this interface, + only huge/giga pages are made visible to end users. +16 - COMPOUND_TAIL + A compound page tail (see description above). +17 - HUGE + this is an integral part of a HugeTLB page +19 - HWPOISON + hardware detected memory corruption on this page: don't touch the data! +20 - NOPAGE + no page frame exists at the requested address +21 - KSM + identical memory pages dynamically shared between one or more processes +22 - THP + contiguous pages which construct transparent hugepages +23 - BALLOON + balloon compaction page +24 - ZERO_PAGE + zero page for pfn_zero or huge_zero page +25 - IDLE + page has not been accessed since it was marked idle (see + Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be + stale in case the page was accessed via a PTE. To make sure the flag + is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. + +IO related page flags +--------------------- + +1 - ERROR + IO error occurred +3 - UPTODATE + page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) +4 - DIRTY + page has been written to, hence contains new data + i.e. for file backed page: (in-memory data revision > on-disk one) +8 - WRITEBACK + page is being synced to disk + +LRU related page flags +---------------------- + +5 - LRU + page is in one of the LRU lists +6 - ACTIVE + page is in the active LRU list +18 - UNEVICTABLE + page is in the unevictable (non-)LRU list It is somehow pinned and + not a candidate for LRU page reclaims, e.g. ramfs pages, + shmctl(SHM_LOCK) and mlock() memory segments +2 - REFERENCED + page has been referenced since last LRU list enqueue/requeue +9 - RECLAIM + page will be reclaimed soon after its pageout IO completed +11 - MMAP + a memory mapped page +12 - ANON + a memory mapped page that is not part of a file +13 - SWAPCACHE + page is mapped to swap space, i.e. has an associated swap entry +14 - SWAPBACKED + page is backed by swap/RAM + +The page-types tool in the tools/vm directory can be used to query the +above flags. + +Using pagemap to do something useful +==================================== + +The general procedure for using pagemap to find out about a process' memory +usage goes like this: + + 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are + mapped to what. + 2. Select the maps you are interested in -- all of them, or a particular + library, or the stack or the heap, etc. + 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. + 4. Read a u64 for each page from pagemap. + 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you + just read, seek to that entry in the file, and read the data you want. + +For example, to find the "unique set size" (USS), which is the amount of +memory that a process is using that is not shared with any other process, +you can go through every map in the process, find the PFNs, look those up +in kpagecount, and tally up the number of pages that are only referenced +once. + +Other notes +=========== + +Reading from any of the files will return -EINVAL if you are not starting +the read on an 8-byte boundary (e.g., if you sought an odd number of bytes +into the file), or if the size of the read is not a multiple of 8 bytes. + +Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is +always 12 at most architectures). Since Linux 3.11 their meaning changes +after first clear of soft-dirty bits. Since Linux 4.2 they are used for +flags unconditionally. diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst new file mode 100644 index 000000000000..cb0cfd6672fa --- /dev/null +++ b/Documentation/admin-guide/mm/soft-dirty.rst @@ -0,0 +1,47 @@ +.. _soft_dirty: + +=============== +Soft-Dirty PTEs +=============== + +The soft-dirty is a bit on a PTE which helps to track which pages a task +writes to. In order to do this tracking one should + + 1. Clear soft-dirty bits from the task's PTEs. + + This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the + task in question. + + 2. Wait some time. + + 3. Read soft-dirty bits from the PTEs. + + This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the + 64-bit qword is the soft-dirty one. If set, the respective PTE was + written to since step 1. + + +Internally, to do this tracking, the writable bit is cleared from PTEs +when the soft-dirty bit is cleared. So, after this, when the task tries to +modify a page at some virtual address the #PF occurs and the kernel sets +the soft-dirty bit on the respective PTE. + +Note, that although all the task's address space is marked as r/o after the +soft-dirty bits clear, the #PF-s that occur after that are processed fast. +This is so, since the pages are still mapped to physical memory, and thus all +the kernel does is finds this fact out and puts both writable and soft-dirty +bits on the PTE. + +While in most cases tracking memory changes by #PF-s is more than enough +there is still a scenario when we can lose soft dirty bits -- a task +unmaps a previously mapped memory region and then maps a new one at exactly +the same place. When unmap is called, the kernel internally clears PTE values +including soft dirty bits. To notify user space application about such +memory region renewal the kernel always marks new memory regions (and +expanded regions) as soft dirty. + +This feature is actively used by the checkpoint-restore project. You +can find more details about it on http://criu.org + + +-- Pavel Emelyanov, Apr 9, 2013 diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst new file mode 100644 index 000000000000..5048cf661a8a --- /dev/null +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -0,0 +1,241 @@ +.. _userfaultfd: + +=========== +Userfaultfd +=========== + +Objective +========= + +Userfaults allow the implementation of on-demand paging from userland +and more generally they allow userland to take control of various +memory page faults, something otherwise only the kernel code could do. + +For example userfaults allows a proper and more optimal implementation +of the PROT_NONE+SIGSEGV trick. + +Design +====== + +Userfaults are delivered and resolved through the userfaultfd syscall. + +The userfaultfd (aside from registering and unregistering virtual +memory ranges) provides two primary functionalities: + +1) read/POLLIN protocol to notify a userland thread of the faults + happening + +2) various UFFDIO_* ioctls that can manage the virtual memory regions + registered in the userfaultfd that allows userland to efficiently + resolve the userfaults it receives via 1) or to manage the virtual + memory in the background + +The real advantage of userfaults if compared to regular virtual memory +management of mremap/mprotect is that the userfaults in all their +operations never involve heavyweight structures like vmas (in fact the +userfaultfd runtime load never takes the mmap_sem for writing). + +Vmas are not suitable for page- (or hugepage) granular fault tracking +when dealing with virtual address spaces that could span +Terabytes. Too many vmas would be needed for that. + +The userfaultfd once opened by invoking the syscall, can also be +passed using unix domain sockets to a manager process, so the same +manager process could handle the userfaults of a multitude of +different processes without them being aware about what is going on +(well of course unless they later try to use the userfaultfd +themselves on the same region the manager is already tracking, which +is a corner case that would currently return -EBUSY). + +API +=== + +When first opened the userfaultfd must be enabled invoking the +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or +a later API version) which will specify the read/POLLIN protocol +userland intends to speak on the UFFD and the uffdio_api.features +userland requires. The UFFDIO_API ioctl if successful (i.e. if the +requested uffdio_api.api is spoken also by the running kernel and the +requested features are going to be enabled) will return into +uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of +respectively all the available features of the read(2) protocol and +the generic ioctl available. + +The uffdio_api.features bitmask returned by the UFFDIO_API ioctl +defines what memory types are supported by the userfaultfd and what +events, except page fault notifications, may be generated. + +If the kernel supports registering userfaultfd ranges on hugetlbfs +virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in +uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be +set if the kernel supports registering userfaultfd ranges on shared +memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero +MAP_SHARED, memfd_create, etc). + +The userland application that wants to use userfaultfd with hugetlbfs +or shared memory need to set the corresponding flag in +uffdio_api.features to enable those features. + +If the userland desires to receive notifications for events other than +page faults, it has to verify that uffdio_api.features has appropriate +UFFD_FEATURE_EVENT_* bits set. These events are described in more +detail below in "Non-cooperative userfaultfd" section. + +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should +be invoked (if present in the returned uffdio_api.ioctls bitmask) to +register a memory range in the userfaultfd by setting the +uffdio_register structure accordingly. The uffdio_register.mode +bitmask will specify to the kernel which kind of faults to track for +the range (UFFDIO_REGISTER_MODE_MISSING would track missing +pages). The UFFDIO_REGISTER ioctl will return the +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve +userfaults on the range registered. Not all ioctls will necessarily be +supported for all memory types depending on the underlying virtual +memory backend (anonymous memory vs tmpfs vs real filebacked +mappings). + +Userland can use the uffdio_register.ioctls to manage the virtual +address space in the background (to add or potentially also remove +memory from the userfaultfd registered range). This means a userfault +could be triggering just before userland maps in the background the +user-faulted page. + +The primary ioctl to resolve userfaults is UFFDIO_COPY. That +atomically copies a page into the userfault registered range and wakes +up the blocked userfaults (unless uffdio_copy.mode & +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to +UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an +half copied page since it'll keep userfaulting until the copy has +finished. + +QEMU/KVM +======== + +QEMU/KVM is using the userfaultfd syscall to implement postcopy live +migration. Postcopy live migration is one form of memory +externalization consisting of a virtual machine running with part or +all of its memory residing on a different node in the cloud. The +userfaultfd abstraction is generic enough that not a single line of +KVM kernel code had to be modified in order to add postcopy live +migration to QEMU. + +Guest async page faults, FOLL_NOWAIT and all other GUP features work +just fine in combination with userfaults. Userfaults trigger async +page faults in the guest scheduler so those guest processes that +aren't waiting for userfaults (i.e. network bound) can keep running in +the guest vcpus. + +It is generally beneficial to run one pass of precopy live migration +just before starting postcopy live migration, in order to avoid +generating userfaults for readonly guest regions. + +The implementation of postcopy live migration currently uses one +single bidirectional socket but in the future two different sockets +will be used (to reduce the latency of the userfaults to the minimum +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). + +The QEMU in the source node writes all pages that it knows are missing +in the destination node, into the socket, and the migration thread of +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE +ioctls on the userfaultfd in order to map the received pages into the +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). + +A different postcopy thread in the destination node listens with +poll() to the userfaultfd in parallel. When a POLLIN event is +generated after a userfault triggers, the postcopy thread read() from +the userfaultfd and receives the fault address (or -EAGAIN in case the +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run +by the parallel QEMU migration thread). + +After the QEMU postcopy thread (running in the destination node) gets +the userfault address it writes the information about the missing page +into the socket. The QEMU source node receives the information and +roughly "seeks" to that page address and continues sending all +remaining missing pages from that new page offset. Soon after that +(just the time to flush the tcp_wmem queue through the network) the +migration thread in the QEMU running in the destination node will +receive the page that triggered the userfault and it'll map it as +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it +was spontaneously sent by the source or if it was an urgent page +requested through a userfault). + +By the time the userfaults start, the QEMU in the destination node +doesn't need to keep any per-page state bitmap relative to the live +migration around and a single per-page bitmap has to be maintained in +the QEMU running in the source node to know which pages are still +missing in the destination node. The bitmap in the source node is +checked to find which missing pages to send in round robin and we seek +over it when receiving incoming userfaults. After sending each page of +course the bitmap is updated accordingly. It's also useful to avoid +sending the same page twice (in case the userfault is read by the +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration +thread). + +Non-cooperative userfaultfd +=========================== + +When the userfaultfd is monitored by an external manager, the manager +must be able to track changes in the process virtual memory +layout. Userfaultfd can notify the manager about such changes using +the same read(2) protocol as for the page fault notifications. The +manager has to explicitly enable these events by setting appropriate +bits in uffdio_api.features passed to UFFDIO_API ioctl: + +UFFD_FEATURE_EVENT_FORK + enable userfaultfd hooks for fork(). When this feature is + enabled, the userfaultfd context of the parent process is + duplicated into the newly created process. The manager + receives UFFD_EVENT_FORK with file descriptor of the new + userfaultfd context in the uffd_msg.fork. + +UFFD_FEATURE_EVENT_REMAP + enable notifications about mremap() calls. When the + non-cooperative process moves a virtual memory area to a + different location, the manager will receive + UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + new addresses of the area and its original length. + +UFFD_FEATURE_EVENT_REMOVE + enable notifications about madvise(MADV_REMOVE) and + madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will + be generated upon these calls to madvise. The uffd_msg.remove + will contain start and end addresses of the removed area. + +UFFD_FEATURE_EVENT_UNMAP + enable notifications about memory unmapping. The manager will + get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + end addresses of the unmapped area. + +Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP +are pretty similar, they quite differ in the action expected from the +userfaultfd manager. In the former case, the virtual memory is +removed, but the area is not, the area remains monitored by the +userfaultfd, and if a page fault occurs in that area it will be +delivered to the manager. The proper resolution for such page fault is +to zeromap the faulting address. However, in the latter case, when an +area is unmapped, either explicitly (with munmap() system call), or +implicitly (e.g. during mremap()), the area is removed and in turn the +userfaultfd context for such area disappears too and the manager will +not get further userland page faults from the removed area. Still, the +notification is required in order to prevent manager from using +UFFDIO_COPY on the unmapped area. + +Unlike userland page faults which have to be synchronous and require +explicit or implicit wakeup, all the events are delivered +asynchronously and the non-cooperative process resumes execution as +soon as manager executes read(). The userfaultfd manager should +carefully synchronize calls to UFFDIO_COPY with the events +processing. To aid the synchronization, the UFFDIO_COPY ioctl will +return -ENOSPC when the monitored process exits at the time of +UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed +its virtual memory layout simultaneously with outstanding UFFDIO_COPY +operation. + +The current asynchronous model of the event delivery is optimal for +single threaded non-cooperative userfaultfd manager implementations. A +synchronous event delivery model can be added later as a new +userfaultfd feature to facilitate multithreading enhancements of the +non cooperative manager, for example to allow UFFDIO_COPY ioctls to +run in parallel to the event reception. Single threaded +implementations should continue to use the current async event +delivery model instead. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2d3984c70feb..ef53f808288d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -515,7 +515,8 @@ guarantees: The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/vm/soft-dirty.rst for details). +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst +for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.rst. +/proc/kpagecount. For detailed explanation, see +Documentation/admin-guide/mm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index c8e6d5b031e4..697ef8c225df 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -515,7 +515,7 @@ nr_hugepages Change the minimum size of the hugepage pool. -See Documentation/vm/hugetlbpage.rst +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== @@ -524,7 +524,7 @@ nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. -See Documentation/vm/hugetlbpage.rst +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index cda564d55b3c..f8a96ca16b7a 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -12,14 +12,10 @@ highmem.rst - Outline of highmem and common issues. hmm.rst - Documentation of heterogeneous memory management -hugetlbpage.rst - - a brief summary of hugetlbpage support in the Linux kernel. hugetlbfs_reserv.rst - A brief overview of hugetlbfs reservation design/implementation. hwpoison.rst - explains what hwpoison is -idle_page_tracking.rst - - description of the idle page tracking feature. ksm.rst - how to use the Kernel Samepage Merging feature. mmu_notifier.rst @@ -34,16 +30,12 @@ page_frags.rst - description of page fragments allocator page_migration.rst - description of page migration in NUMA systems. -pagemap.rst - - pagemap, from the userspace perspective page_owner.rst - tracking about who allocated each page remap_file_pages.rst - a note about remap_file_pages() system call slub.rst - a short users guide for SLUB. -soft-dirty.rst - - short explanation for soft-dirty PTEs split_page_table_lock.rst - Separate per-table lock to improve scalability of the old page_table_lock. swap_numa.rst @@ -52,8 +44,6 @@ transhuge.rst - Transparent Hugepage Support, alternative way of using hugepages. unevictable-lru.rst - Unevictable LRU infrastructure -userfaultfd.rst - - description of userfaultfd system call z3fold.txt - outline of z3fold allocator for storing compressed pages zsmalloc.rst diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/vm/hugetlbpage.rst deleted file mode 100644 index 2b374d10284d..000000000000 --- a/Documentation/vm/hugetlbpage.rst +++ /dev/null @@ -1,381 +0,0 @@ -.. _hugetlbpage: - -============= -HugeTLB Pages -============= - -Overview -======== - -The intent of this file is to give a brief summary of hugetlbpage support in -the Linux kernel. This support is built on top of multiple page size support -that is provided by most modern architectures. For example, x86 CPUs normally -support 4K and 2M (1G if architecturally supported) page sizes, ia64 -architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, -256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical -translations. Typically this is a very scarce resource on processor. -Operating systems try to make best use of limited number of TLB resources. -This optimization is more critical now as bigger and bigger physical memories -(several GBs) are more readily available. - -Users can use the huge page support in Linux kernel by either using the mmap -system call or standard SYSV shared memory system calls (shmget, shmat). - -First the Linux kernel needs to be built with the CONFIG_HUGETLBFS -(present under "File systems") and CONFIG_HUGETLB_PAGE (selected -automatically when CONFIG_HUGETLBFS is selected) configuration -options. - -The ``/proc/meminfo`` file provides information about the total number of -persistent hugetlb pages in the kernel's huge page pool. It also displays -default huge page size and information about the number of free, reserved -and surplus huge pages in the pool of huge pages of default size. -The huge page size is needed for generating the proper alignment and -size of the arguments to system calls that map huge page regions. - -The output of ``cat /proc/meminfo`` will include lines like:: - - HugePages_Total: uuu - HugePages_Free: vvv - HugePages_Rsvd: www - HugePages_Surp: xxx - Hugepagesize: yyy kB - Hugetlb: zzz kB - -where: - -HugePages_Total - is the size of the pool of huge pages. -HugePages_Free - is the number of huge pages in the pool that are not yet - allocated. -HugePages_Rsvd - is short for "reserved," and is the number of huge pages for - which a commitment to allocate from the pool has been made, - but no allocation has yet been made. Reserved huge pages - guarantee that an application will be able to allocate a - huge page from the pool of huge pages at fault time. -HugePages_Surp - is short for "surplus," and is the number of huge pages in - the pool above the value in ``/proc/sys/vm/nr_hugepages``. The - maximum number of surplus huge pages is controlled by - ``/proc/sys/vm/nr_overcommit_hugepages``. -Hugepagesize - is the default hugepage size (in Kb). -Hugetlb - is the total amount of memory (in kB), consumed by huge - pages of all sizes. - If huge pages of different sizes are in use, this number - will exceed HugePages_Total \* Hugepagesize. To get more - detailed information, please, refer to - ``/sys/kernel/mm/hugepages`` (described below). - - -``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" -configured in the kernel. - -``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge -pages in the kernel's huge page pool. "Persistent" huge pages will be -returned to the huge page pool when freed by a task. A user with root -privileges can dynamically allocate more or free some persistent huge pages -by increasing or decreasing the value of ``nr_hugepages``. - -Pages that are used as huge pages are reserved inside the kernel and cannot -be used for other purposes. Huge pages cannot be swapped out under -memory pressure. - -Once a number of huge pages have been pre-allocated to the kernel huge page -pool, a user with appropriate privilege can use either the mmap system call -or shared memory system calls to use the huge pages. See the discussion of -:ref:`Using Huge Pages `, below. - -The administrator can allocate persistent huge pages on the kernel boot -command line by specifying the "hugepages=N" parameter, where 'N' = the -number of huge pages requested. This is the most reliable method of -allocating huge pages as memory has not yet become fragmented. - -Some platforms support multiple huge page sizes. To allocate huge pages -of a specific size, one must precede the huge pages boot command parameters -with a huge page size selection parameter "hugepagesz=". must -be specified in bytes with optional scale suffix [kKmMgG]. The default huge -page size may be selected with the "default_hugepagesz=" boot parameter. - -When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` -indicates the current number of pre-allocated huge pages of the default size. -Thus, one can use the following command to dynamically allocate/deallocate -default sized persistent huge pages:: - - echo 20 > /proc/sys/vm/nr_hugepages - -This command will try to adjust the number of default sized huge pages in the -huge page pool to 20, allocating or freeing huge pages, as required. - -On a NUMA platform, the kernel will attempt to distribute the huge page pool -over all the set of allowed nodes specified by the NUMA memory policy of the -task that modifies ``nr_hugepages``. The default for the allowed nodes--when the -task has default memory policy--is all on-line nodes with memory. Allowed -nodes with insufficient available, contiguous memory for a huge page will be -silently skipped when allocating persistent huge pages. See the -:ref:`discussion below ` -of the interaction of task memory policy, cpusets and per node attributes -with the allocation and freeing of persistent huge pages. - -The success or failure of huge page allocation depends on the amount of -physically contiguous memory that is present in system at the time of the -allocation attempt. If the kernel is unable to allocate huge pages from -some nodes in a NUMA system, it will attempt to make up the difference by -allocating extra pages on other nodes with sufficient available contiguous -memory, if any. - -System administrators may want to put this command in one of the local rc -init files. This will enable the kernel to allocate huge pages early in -the boot process when the possibility of getting physical contiguous pages -is still very high. Administrators can verify the number of huge pages -actually allocated by checking the sysctl or meminfo. To check the per node -distribution of huge pages in a NUMA system, use:: - - cat /sys/devices/system/node/node*/meminfo | fgrep Huge - -``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of -huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are -requested by applications. Writing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain that -number of "surplus" huge pages from the kernel's normal page pool, when the -persistent huge page pool is exhausted. As these surplus huge pages become -unused, they are freed back to the kernel's normal page pool. - -When increasing the huge page pool size via ``nr_hugepages``, any existing -surplus pages will first be promoted to persistent huge pages. Then, additional -huge pages will be allocated, if necessary and if possible, to fulfill -the new persistent huge page pool size. - -The administrator may shrink the pool of persistent huge pages for -the default huge page size by setting the ``nr_hugepages`` sysctl to a -smaller value. The kernel will attempt to balance the freeing of huge pages -across all nodes in the memory policy of the task modifying ``nr_hugepages``. -Any free huge pages on the selected nodes will be freed back to the kernel's -normal page pool. - -Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that -it becomes less than the number of huge pages in use will convert the balance -of the in-use huge pages to surplus huge pages. This will occur even if -the number of surplus pages would exceed the overcommit value. As long as -this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is -increased sufficiently, or the surplus huge pages go out of use and are freed-- -no more surplus huge pages will be allowed to be allocated. - -With support for multiple huge page pools at run-time available, much of -the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in -sysfs. -The ``/proc`` interfaces discussed above have been retained for backwards -compatibility. The root huge page control directory in sysfs is:: - - /sys/kernel/mm/hugepages - -For each huge page size supported by the running kernel, a subdirectory -will exist, of the form:: - - hugepages-${size}kB - -Inside each of these directories, the same set of files will exist:: - - nr_hugepages - nr_hugepages_mempolicy - nr_overcommit_hugepages - free_hugepages - resv_hugepages - surplus_hugepages - -which function as described above for the default huge page-sized case. - -.. _mem_policy_and_hp_alloc: - -Interaction of Task Memory Policy with Huge Page Allocation/Freeing -=================================================================== - -Whether huge pages are allocated and freed via the ``/proc`` interface or -the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the -NUMA nodes from which huge pages are allocated or freed are controlled by the -NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` -sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy -is ignored. - -The recommended method to allocate or free huge pages to/from the kernel -huge page pool, using the ``nr_hugepages`` example above, is:: - - numactl --interleave echo 20 \ - >/proc/sys/vm/nr_hugepages_mempolicy - -or, more succinctly:: - - numactl -m echo 20 >/proc/sys/vm/nr_hugepages_mempolicy - -This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes -specified in , depending on whether number of persistent huge pages -is initially less than or greater than 20, respectively. No huge pages will be -allocated nor freed on any node not included in the specified . - -When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any -memory policy mode--bind, preferred, local or interleave--may be used. The -resulting effect on persistent huge page allocation is as follows: - -#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], - persistent huge pages will be distributed across the node or nodes - specified in the mempolicy as if "interleave" had been specified. - However, if a node in the policy does not contain sufficient contiguous - memory for a huge page, the allocation will not "fallback" to the nearest - neighbor node with sufficient contiguous memory. To do this would cause - undesirable imbalance in the distribution of the huge page pool, or - possibly, allocation of persistent huge pages on nodes not allowed by - the task's memory policy. - -#. One or more nodes may be specified with the bind or interleave policy. - If more than one node is specified with the preferred policy, only the - lowest numeric id will be used. Local policy will select the node where - the task is running at the time the nodes_allowed mask is constructed. - For local policy to be deterministic, the task must be bound to a cpu or - cpus in a single node. Otherwise, the task could be migrated to some - other node at any time after launch and the resulting node will be - indeterminate. Thus, local policy is not very useful for this purpose. - Any of the other mempolicy modes may be used to specify a single node. - -#. The nodes allowed mask will be derived from any non-default task mempolicy, - whether this policy was set explicitly by the task itself or one of its - ancestors, such as numactl. This means that if the task is invoked from a - shell with non-default policy, that policy will be used. One can specify a - node list of "all" with numactl --interleave or --membind [-m] to achieve - interleaving over all nodes in the system or cpuset. - -#. Any task mempolicy specified--e.g., using numactl--will be constrained by - the resource limits of any cpuset in which the task runs. Thus, there will - be no way for a task with non-default policy running in a cpuset with a - subset of the system nodes to allocate huge pages outside the cpuset - without first moving to a cpuset that contains all of the desired nodes. - -#. Boot-time huge page allocation attempts to distribute the requested number - of huge pages over all on-lines nodes with memory. - -Per Node Hugepages Attributes -============================= - -A subset of the contents of the root huge page control directory in sysfs, -described above, will be replicated under each the system device of each -NUMA node with memory in:: - - /sys/devices/system/node/node[0-9]*/hugepages/ - -Under this directory, the subdirectory for each supported huge page size -contains the following attribute files:: - - nr_hugepages - free_hugepages - surplus_hugepages - -The free\_' and surplus\_' attribute files are read-only. They return the number -of free and surplus [overcommitted] huge pages, respectively, on the parent -node. - -The ``nr_hugepages`` attribute returns the total number of huge pages on the -specified node. When this attribute is written, the number of persistent huge -pages on the parent node will be adjusted to the specified value, if sufficient -resources exist, regardless of the task's mempolicy or cpuset constraints. - -Note that the number of overcommit and reserve pages remain global quantities, -as we don't know until fault time, when the faulting task's mempolicy is -applied, from which node the huge page allocation will be attempted. - -.. _using_huge_pages: - -Using Huge Pages -================ - -If the user applications are going to request huge pages using mmap system -call, then it is required that system administrator mount a file system of -type hugetlbfs:: - - mount -t hugetlbfs \ - -o uid=,gid=,mode=,pagesize=,size=,\ - min_size=,nr_inodes= none /mnt/huge - -This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. - -The ``uid`` and ``gid`` options sets the owner and group of the root of the -file system. By default the ``uid`` and ``gid`` of the current process -are taken. - -The ``mode`` option sets the mode of root of file system to value & 01777. -This value is given in octal. By default the value 0755 is picked. - -If the platform supports multiple huge page sizes, the ``pagesize`` option can -be used to specify the huge page size and associated pool. ``pagesize`` -is specified in bytes. If ``pagesize`` is not specified the platform's -default huge page size and associated pool will be used. - -The ``size`` option sets the maximum value of memory (huge pages) allowed -for that filesystem (``/mnt/huge``). The ``size`` option can be specified -in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). -The size is rounded down to HPAGE_SIZE boundary. - -The ``min_size`` option sets the minimum value of memory (huge pages) allowed -for the filesystem. ``min_size`` can be specified in the same way as ``size``, -either bytes or a percentage of the huge page pool. -At mount time, the number of huge pages specified by ``min_size`` are reserved -for use by the filesystem. -If there are not enough free huge pages available, the mount will fail. -As huge pages are allocated to the filesystem and freed, the reserve count -is adjusted so that the sum of allocated and reserved huge pages is always -at least ``min_size``. - -The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` -can use. - -If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on -command line then no limits are set. - -For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can -use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. -For example, size=2K has the same meaning as size=2048. - -While read system calls are supported on files that reside on hugetlb -file systems, write system calls are not. - -Regular chown, chgrp, and chmod commands (with right permissions) could be -used to change the file attributes on hugetlbfs. - -Also, it is important to note that no such mount command is required if -applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see -:ref:`map_hugetlb ` below. - -Users who wish to use hugetlb memory via shared memory segment should be -members of a supplementary group and system admin needs to configure that gid -into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different -applications to use any combination of mmaps and shm* calls, though the mount of -filesystem will be required for using mmap calls without MAP_HUGETLB. - -Syscalls that operate on memory backed by hugetlb pages only have their lengths -aligned to the native page size of the processor; they will normally fail with -errno set to EINVAL or exclude hugetlb pages that extend beyond the length if -not hugepage aligned. For example, munmap(2) will fail if memory is backed by -a hugetlb page and the length is smaller than the hugepage size. - - -Examples -======== - -.. _map_hugetlb: - -``map_hugetlb`` - see tools/testing/selftests/vm/map_hugetlb.c - -``hugepage-shm`` - see tools/testing/selftests/vm/hugepage-shm.c - -``hugepage-mmap`` - see tools/testing/selftests/vm/hugepage-mmap.c - -The `libhugetlbfs`_ library provides a wide range of userspace tools -to help with huge page usability, environment setup, and control. - -.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst index 070aa1e716b7..09bd24a92784 100644 --- a/Documentation/vm/hwpoison.rst +++ b/Documentation/vm/hwpoison.rst @@ -155,7 +155,7 @@ Testing value). This allows stress testing of many kinds of pages. The page_flags are the same as in /proc/kpageflags. The flag bits are defined in include/linux/kernel-page-flags.h and - documented in Documentation/vm/pagemap.rst + documented in Documentation/admin-guide/mm/pagemap.rst * Architecture specific MCE injector diff --git a/Documentation/vm/idle_page_tracking.rst b/Documentation/vm/idle_page_tracking.rst deleted file mode 100644 index d1c4609a5220..000000000000 --- a/Documentation/vm/idle_page_tracking.rst +++ /dev/null @@ -1,115 +0,0 @@ -.. _idle_page_tracking: - -================== -Idle Page Tracking -================== - -Motivation -========== - -The idle page tracking feature allows to track which memory pages are being -accessed by a workload and which are idle. This information can be useful for -estimating the workload's working set size, which, in turn, can be taken into -account when configuring the workload parameters, setting memory cgroup limits, -or deciding where to place the workload within a compute cluster. - -It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. - -.. _user_api: - -User API -======== - -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. -Currently, it consists of the only read-write file, -``/sys/kernel/mm/page_idle/bitmap``. - -The file implements a bitmap where each bit corresponds to a memory page. The -bitmap is represented by an array of 8-byte integers, and the page at PFN #i is -mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is -set, the corresponding page is idle. - -A page is considered idle if it has not been accessed since it was marked idle -(for more details on what "accessed" actually means see the :ref:`Implementation -Details ` section). -To mark a page idle one has to set the bit corresponding to -the page by writing to the file. A value written to the file is OR-ed with the -current bitmap value. - -Only accesses to user memory pages are tracked. These are pages mapped to a -process address space, page cache and buffer pages, swap cache pages. For other -page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, -and hence such pages are never reported idle. - -For huge pages the idle flag is set only on the head page, so one has to read -``/proc/kpageflags`` in order to correctly count idle huge pages. - -Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return --EINVAL if you are not starting the read/write on an 8-byte boundary, or -if the size of the read/write is not a multiple of 8 bytes. Writing to -this file beyond max PFN will return -ENXIO. - -That said, in order to estimate the amount of pages that are not used by a -workload one should: - - 1. Mark all the workload's pages as idle by setting corresponding bits in - ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading - ``/proc/pid/pagemap`` if the workload is represented by a process, or by - filtering out alien pages using ``/proc/kpagecgroup`` in case the workload - is placed in a memory cgroup. - - 2. Wait until the workload accesses its working set. - - 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. - If one wants to ignore certain types of pages, e.g. mlocked pages since they - are not reclaimable, he or she can filter them out using - ``/proc/kpageflags``. - -See Documentation/vm/pagemap.rst for more information about -``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. - -.. _impl_details: - -Implementation Details -====================== - -The kernel internally keeps track of accesses to user memory pages in order to -reclaim unreferenced pages first on memory shortage conditions. A page is -considered referenced if it has been recently accessed via a process address -space, in which case one or more PTEs it is mapped to will have the Accessed bit -set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The -latter happens when: - - - a userspace process reads or writes a page using a system call (e.g. read(2) - or write(2)) - - - a page that is used for storing filesystem buffers is read or written, - because a process needs filesystem metadata stored in it (e.g. lists a - directory tree) - - - a page is accessed by a device driver using get_user_pages() - -When a dirty page is written to swap or disk as a result of memory reclaim or -exceeding the dirty memory limit, it is not marked referenced. - -The idle memory tracking feature adds a new page flag, the Idle flag. This flag -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the -:ref:`User API ` -section), and cleared automatically whenever a page is referenced as defined -above. - -When a page is marked idle, the Accessed bit must be cleared in all PTEs it is -mapped to, otherwise we will not be able to detect accesses to the page coming -from a process address space. To avoid interference with the reclaimer, which, -as noted above, uses the Accessed bit to promote actively referenced pages, one -more page flag is introduced, the Young flag. When the PTE Accessed bit is -cleared as a result of setting or updating a page's Idle flag, the Young flag -is set on the page. The reclaimer treats the Young flag as an extra PTE -Accessed bit and therefore will consider such a page as referenced. - -Since the idle memory tracking feature is based on the memory reclaimer logic, -it only works with pages that are on an LRU list, other pages are silently -ignored. That means it will ignore a user memory page if it is isolated, but -since there are usually not many of them, it should not affect the overall -result noticeably. In order not to stall scanning of the idle page bitmap, -locked pages may be skipped too. diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 6c451421a01e..ed58cb9f9675 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -13,15 +13,10 @@ various features of the Linux memory management .. toctree:: :maxdepth: 1 - hugetlbpage - idle_page_tracking ksm numa_memory_policy - pagemap transhuge - soft-dirty swap_numa - userfaultfd zswap Kernel developers MM documentation diff --git a/Documentation/vm/pagemap.rst b/Documentation/vm/pagemap.rst deleted file mode 100644 index 7ba8cbd57ad3..000000000000 --- a/Documentation/vm/pagemap.rst +++ /dev/null @@ -1,197 +0,0 @@ -.. _pagemap: - -============================= -Examining Process Page Tables -============================= - -pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow -userspace programs to examine the page tables and related information by -reading files in ``/proc``. - -There are four components to pagemap: - - * ``/proc/pid/pagemap``. This file lets a userspace process find out which - physical frame each virtual page is mapped to. It contains one 64-bit - value for each virtual page, containing the following data (from - ``fs/proc/task_mmu.c``, above pagemap_read): - - * Bits 0-54 page frame number (PFN) if present - * Bits 0-4 swap type if swapped - * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) - * Bit 56 page exclusively mapped (since 4.2) - * Bits 57-60 zero - * Bit 61 page is file-page or shared-anon (since 3.5) - * Bit 62 page swapped - * Bit 63 page present - - Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. - In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from - 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. - Reason: information about PFNs helps in exploiting Rowhammer vulnerability. - - If the page is not present but in swap, then the PFN contains an - encoding of the swap file number and the page's offset into the - swap. Unmapped pages return a null PFN. This allows determining - precisely which pages are mapped (or in swap) and comparing mapped - pages between processes. - - Efficient users of this interface will use ``/proc/pid/maps`` to - determine which areas of memory are actually mapped and llseek to - skip over unmapped regions. - - * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. - - * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each - page, indexed by PFN. - - The flags are (from ``fs/proc/page.c``, above kpageflags_read): - - 0. LOCKED - 1. ERROR - 2. REFERENCED - 3. UPTODATE - 4. DIRTY - 5. LRU - 6. ACTIVE - 7. SLAB - 8. WRITEBACK - 9. RECLAIM - 10. BUDDY - 11. MMAP - 12. ANON - 13. SWAPCACHE - 14. SWAPBACKED - 15. COMPOUND_HEAD - 16. COMPOUND_TAIL - 17. HUGE - 18. UNEVICTABLE - 19. HWPOISON - 20. NOPAGE - 21. KSM - 22. THP - 23. BALLOON - 24. ZERO_PAGE - 25. IDLE - - * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the - memory cgroup each page is charged to, indexed by PFN. Only available when - CONFIG_MEMCG is set. - -Short descriptions to the page flags -==================================== - -0 - LOCKED - page is being locked for exclusive access, e.g. by undergoing read/write IO -7 - SLAB - page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator - When compound page is used, SLUB/SLQB will only set this flag on the head - page; SLOB will not flag it at all. -10 - BUDDY - a free memory block managed by the buddy system allocator - The buddy system organizes free memory in blocks of various orders. - An order N block has 2^N physically contiguous pages, with the BUDDY flag - set for and _only_ for the first page. -15 - COMPOUND_HEAD - A compound page with order N consists of 2^N physically contiguous pages. - A compound page with order 2 takes the form of "HTTT", where H donates its - head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc. - memory allocators and various device drivers. However in this interface, - only huge/giga pages are made visible to end users. -16 - COMPOUND_TAIL - A compound page tail (see description above). -17 - HUGE - this is an integral part of a HugeTLB page -19 - HWPOISON - hardware detected memory corruption on this page: don't touch the data! -20 - NOPAGE - no page frame exists at the requested address -21 - KSM - identical memory pages dynamically shared between one or more processes -22 - THP - contiguous pages which construct transparent hugepages -23 - BALLOON - balloon compaction page -24 - ZERO_PAGE - zero page for pfn_zero or huge_zero page -25 - IDLE - page has not been accessed since it was marked idle (see - Documentation/vm/idle_page_tracking.rst). Note that this flag may be - stale in case the page was accessed via a PTE. To make sure the flag - is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. - -IO related page flags ---------------------- - -1 - ERROR - IO error occurred -3 - UPTODATE - page has up-to-date data - ie. for file backed page: (in-memory data revision >= on-disk one) -4 - DIRTY - page has been written to, hence contains new data - i.e. for file backed page: (in-memory data revision > on-disk one) -8 - WRITEBACK - page is being synced to disk - -LRU related page flags ----------------------- - -5 - LRU - page is in one of the LRU lists -6 - ACTIVE - page is in the active LRU list -18 - UNEVICTABLE - page is in the unevictable (non-)LRU list It is somehow pinned and - not a candidate for LRU page reclaims, e.g. ramfs pages, - shmctl(SHM_LOCK) and mlock() memory segments -2 - REFERENCED - page has been referenced since last LRU list enqueue/requeue -9 - RECLAIM - page will be reclaimed soon after its pageout IO completed -11 - MMAP - a memory mapped page -12 - ANON - a memory mapped page that is not part of a file -13 - SWAPCACHE - page is mapped to swap space, i.e. has an associated swap entry -14 - SWAPBACKED - page is backed by swap/RAM - -The page-types tool in the tools/vm directory can be used to query the -above flags. - -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - -Other notes -=========== - -Reading from any of the files will return -EINVAL if you are not starting -the read on an 8-byte boundary (e.g., if you sought an odd number of bytes -into the file), or if the size of the read is not a multiple of 8 bytes. - -Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is -always 12 at most architectures). Since Linux 3.11 their meaning changes -after first clear of soft-dirty bits. Since Linux 4.2 they are used for -flags unconditionally. diff --git a/Documentation/vm/soft-dirty.rst b/Documentation/vm/soft-dirty.rst deleted file mode 100644 index cb0cfd6672fa..000000000000 --- a/Documentation/vm/soft-dirty.rst +++ /dev/null @@ -1,47 +0,0 @@ -.. _soft_dirty: - -=============== -Soft-Dirty PTEs -=============== - -The soft-dirty is a bit on a PTE which helps to track which pages a task -writes to. In order to do this tracking one should - - 1. Clear soft-dirty bits from the task's PTEs. - - This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the - task in question. - - 2. Wait some time. - - 3. Read soft-dirty bits from the PTEs. - - This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the - 64-bit qword is the soft-dirty one. If set, the respective PTE was - written to since step 1. - - -Internally, to do this tracking, the writable bit is cleared from PTEs -when the soft-dirty bit is cleared. So, after this, when the task tries to -modify a page at some virtual address the #PF occurs and the kernel sets -the soft-dirty bit on the respective PTE. - -Note, that although all the task's address space is marked as r/o after the -soft-dirty bits clear, the #PF-s that occur after that are processed fast. -This is so, since the pages are still mapped to physical memory, and thus all -the kernel does is finds this fact out and puts both writable and soft-dirty -bits on the PTE. - -While in most cases tracking memory changes by #PF-s is more than enough -there is still a scenario when we can lose soft dirty bits -- a task -unmaps a previously mapped memory region and then maps a new one at exactly -the same place. When unmap is called, the kernel internally clears PTE values -including soft dirty bits. To notify user space application about such -memory region renewal the kernel always marks new memory regions (and -expanded regions) as soft dirty. - -This feature is actively used by the checkpoint-restore project. You -can find more details about it on http://criu.org - - --- Pavel Emelyanov, Apr 9, 2013 diff --git a/Documentation/vm/userfaultfd.rst b/Documentation/vm/userfaultfd.rst deleted file mode 100644 index 5048cf661a8a..000000000000 --- a/Documentation/vm/userfaultfd.rst +++ /dev/null @@ -1,241 +0,0 @@ -.. _userfaultfd: - -=========== -Userfaultfd -=========== - -Objective -========= - -Userfaults allow the implementation of on-demand paging from userland -and more generally they allow userland to take control of various -memory page faults, something otherwise only the kernel code could do. - -For example userfaults allows a proper and more optimal implementation -of the PROT_NONE+SIGSEGV trick. - -Design -====== - -Userfaults are delivered and resolved through the userfaultfd syscall. - -The userfaultfd (aside from registering and unregistering virtual -memory ranges) provides two primary functionalities: - -1) read/POLLIN protocol to notify a userland thread of the faults - happening - -2) various UFFDIO_* ioctls that can manage the virtual memory regions - registered in the userfaultfd that allows userland to efficiently - resolve the userfaults it receives via 1) or to manage the virtual - memory in the background - -The real advantage of userfaults if compared to regular virtual memory -management of mremap/mprotect is that the userfaults in all their -operations never involve heavyweight structures like vmas (in fact the -userfaultfd runtime load never takes the mmap_sem for writing). - -Vmas are not suitable for page- (or hugepage) granular fault tracking -when dealing with virtual address spaces that could span -Terabytes. Too many vmas would be needed for that. - -The userfaultfd once opened by invoking the syscall, can also be -passed using unix domain sockets to a manager process, so the same -manager process could handle the userfaults of a multitude of -different processes without them being aware about what is going on -(well of course unless they later try to use the userfaultfd -themselves on the same region the manager is already tracking, which -is a corner case that would currently return -EBUSY). - -API -=== - -When first opened the userfaultfd must be enabled invoking the -UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or -a later API version) which will specify the read/POLLIN protocol -userland intends to speak on the UFFD and the uffdio_api.features -userland requires. The UFFDIO_API ioctl if successful (i.e. if the -requested uffdio_api.api is spoken also by the running kernel and the -requested features are going to be enabled) will return into -uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of -respectively all the available features of the read(2) protocol and -the generic ioctl available. - -The uffdio_api.features bitmask returned by the UFFDIO_API ioctl -defines what memory types are supported by the userfaultfd and what -events, except page fault notifications, may be generated. - -If the kernel supports registering userfaultfd ranges on hugetlbfs -virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in -uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be -set if the kernel supports registering userfaultfd ranges on shared -memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero -MAP_SHARED, memfd_create, etc). - -The userland application that wants to use userfaultfd with hugetlbfs -or shared memory need to set the corresponding flag in -uffdio_api.features to enable those features. - -If the userland desires to receive notifications for events other than -page faults, it has to verify that uffdio_api.features has appropriate -UFFD_FEATURE_EVENT_* bits set. These events are described in more -detail below in "Non-cooperative userfaultfd" section. - -Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should -be invoked (if present in the returned uffdio_api.ioctls bitmask) to -register a memory range in the userfaultfd by setting the -uffdio_register structure accordingly. The uffdio_register.mode -bitmask will specify to the kernel which kind of faults to track for -the range (UFFDIO_REGISTER_MODE_MISSING would track missing -pages). The UFFDIO_REGISTER ioctl will return the -uffdio_register.ioctls bitmask of ioctls that are suitable to resolve -userfaults on the range registered. Not all ioctls will necessarily be -supported for all memory types depending on the underlying virtual -memory backend (anonymous memory vs tmpfs vs real filebacked -mappings). - -Userland can use the uffdio_register.ioctls to manage the virtual -address space in the background (to add or potentially also remove -memory from the userfaultfd registered range). This means a userfault -could be triggering just before userland maps in the background the -user-faulted page. - -The primary ioctl to resolve userfaults is UFFDIO_COPY. That -atomically copies a page into the userfault registered range and wakes -up the blocked userfaults (unless uffdio_copy.mode & -UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to -UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an -half copied page since it'll keep userfaulting until the copy has -finished. - -QEMU/KVM -======== - -QEMU/KVM is using the userfaultfd syscall to implement postcopy live -migration. Postcopy live migration is one form of memory -externalization consisting of a virtual machine running with part or -all of its memory residing on a different node in the cloud. The -userfaultfd abstraction is generic enough that not a single line of -KVM kernel code had to be modified in order to add postcopy live -migration to QEMU. - -Guest async page faults, FOLL_NOWAIT and all other GUP features work -just fine in combination with userfaults. Userfaults trigger async -page faults in the guest scheduler so those guest processes that -aren't waiting for userfaults (i.e. network bound) can keep running in -the guest vcpus. - -It is generally beneficial to run one pass of precopy live migration -just before starting postcopy live migration, in order to avoid -generating userfaults for readonly guest regions. - -The implementation of postcopy live migration currently uses one -single bidirectional socket but in the future two different sockets -will be used (to reduce the latency of the userfaults to the minimum -possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). - -The QEMU in the source node writes all pages that it knows are missing -in the destination node, into the socket, and the migration thread of -the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE -ioctls on the userfaultfd in order to map the received pages into the -guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). - -A different postcopy thread in the destination node listens with -poll() to the userfaultfd in parallel. When a POLLIN event is -generated after a userfault triggers, the postcopy thread read() from -the userfaultfd and receives the fault address (or -EAGAIN in case the -userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run -by the parallel QEMU migration thread). - -After the QEMU postcopy thread (running in the destination node) gets -the userfault address it writes the information about the missing page -into the socket. The QEMU source node receives the information and -roughly "seeks" to that page address and continues sending all -remaining missing pages from that new page offset. Soon after that -(just the time to flush the tcp_wmem queue through the network) the -migration thread in the QEMU running in the destination node will -receive the page that triggered the userfault and it'll map it as -usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it -was spontaneously sent by the source or if it was an urgent page -requested through a userfault). - -By the time the userfaults start, the QEMU in the destination node -doesn't need to keep any per-page state bitmap relative to the live -migration around and a single per-page bitmap has to be maintained in -the QEMU running in the source node to know which pages are still -missing in the destination node. The bitmap in the source node is -checked to find which missing pages to send in round robin and we seek -over it when receiving incoming userfaults. After sending each page of -course the bitmap is updated accordingly. It's also useful to avoid -sending the same page twice (in case the userfault is read by the -postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration -thread). - -Non-cooperative userfaultfd -=========================== - -When the userfaultfd is monitored by an external manager, the manager -must be able to track changes in the process virtual memory -layout. Userfaultfd can notify the manager about such changes using -the same read(2) protocol as for the page fault notifications. The -manager has to explicitly enable these events by setting appropriate -bits in uffdio_api.features passed to UFFDIO_API ioctl: - -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When this feature is - enabled, the userfaultfd context of the parent process is - duplicated into the newly created process. The manager - receives UFFD_EVENT_FORK with file descriptor of the new - userfaultfd context in the uffd_msg.fork. - -UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() calls. When the - non-cooperative process moves a virtual memory area to a - different location, the manager will receive - UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and - new addresses of the area and its original length. - -UFFD_FEATURE_EVENT_REMOVE - enable notifications about madvise(MADV_REMOVE) and - madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will - be generated upon these calls to madvise. The uffd_msg.remove - will contain start and end addresses of the removed area. - -UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory unmapping. The manager will - get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and - end addresses of the unmapped area. - -Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP -are pretty similar, they quite differ in the action expected from the -userfaultfd manager. In the former case, the virtual memory is -removed, but the area is not, the area remains monitored by the -userfaultfd, and if a page fault occurs in that area it will be -delivered to the manager. The proper resolution for such page fault is -to zeromap the faulting address. However, in the latter case, when an -area is unmapped, either explicitly (with munmap() system call), or -implicitly (e.g. during mremap()), the area is removed and in turn the -userfaultfd context for such area disappears too and the manager will -not get further userland page faults from the removed area. Still, the -notification is required in order to prevent manager from using -UFFDIO_COPY on the unmapped area. - -Unlike userland page faults which have to be synchronous and require -explicit or implicit wakeup, all the events are delivered -asynchronously and the non-cooperative process resumes execution as -soon as manager executes read(). The userfaultfd manager should -carefully synchronize calls to UFFDIO_COPY with the events -processing. To aid the synchronization, the UFFDIO_COPY ioctl will -return -ENOSPC when the monitored process exits at the time of -UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed -its virtual memory layout simultaneously with outstanding UFFDIO_COPY -operation. - -The current asynchronous model of the event delivery is optimal for -single threaded non-cooperative userfaultfd manager implementations. A -synchronous event delivery model can be added later as a new -userfaultfd feature to facilitate multithreading enhancements of the -non cooperative manager, for example to allow UFFDIO_COPY ioctls to -run in parallel to the event reception. Single threaded -implementations should continue to use the current async event -delivery model instead. diff --git a/fs/Kconfig b/fs/Kconfig index ba53dc2a9691..ac4ac908f001 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -196,7 +196,7 @@ config HUGETLBFS help hugetlbfs is a filesystem backing for HugeTLB pages, based on ramfs. For architectures that support it, say Y here and read - for details. + for details. If unsure, say N. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 333cda80c3dd..ed48b6e36202 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -937,7 +937,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the - * Documentation/vm/soft-dirty.rst for full description + * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */ pte_t ptent = *pte; @@ -1417,7 +1417,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) + * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped * Bits 57-60 zero * Bit 61 page is file-page or shared-anon diff --git a/mm/Kconfig b/mm/Kconfig index 9bdb0189caaf..2d7ef6207e1e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -530,7 +530,7 @@ config MEM_SOFT_DIRTY into a page just as regular dirty bit, but unlike the latter it can be cleared by hands. - See Documentation/vm/soft-dirty.rst for more details. + See Documentation/admin-guide/mm/soft-dirty.rst for more details. config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" @@ -656,7 +656,8 @@ config IDLE_PAGE_TRACKING be useful to tune memory cgroup limits and/or for job placement within a compute cluster. - See Documentation/vm/idle_page_tracking.rst for more details. + See Documentation/admin-guide/mm/idle_page_tracking.rst for + more details. # arch_add_memory() comprehends device memory config ARCH_HAS_ZONE_DEVICE -- cgit v1.2.3 From 3ecf53e41a642d4172cff1f641b23fa1baaa229a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 8 May 2018 10:02:10 +0300 Subject: docs/vm: move numa_memory_policy.rst to Documentation/admin-guide/mm The document describes userspace API and as such it belongs to Documentation/admin-guide/mm Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/mm/hugetlbpage.rst | 2 +- Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/numa_memory_policy.rst | 495 +++++++++++++++++++++ Documentation/filesystems/proc.txt | 2 +- Documentation/filesystems/tmpfs.txt | 5 +- Documentation/vm/00-INDEX | 2 - Documentation/vm/index.rst | 1 - Documentation/vm/numa.rst | 2 +- Documentation/vm/numa_memory_policy.rst | 495 --------------------- 9 files changed, 502 insertions(+), 503 deletions(-) create mode 100644 Documentation/admin-guide/mm/numa_memory_policy.rst delete mode 100644 Documentation/vm/numa_memory_policy.rst (limited to 'Documentation/filesystems') diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index a8b0806377bb..1cc0bc78d10e 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -220,7 +220,7 @@ memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: #. Regardless of mempolicy mode [see - :ref:`Documentation/vm/numa_memory_policy.rst `], + :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst `], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index ad28644fee35..a69aa69af255 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -24,6 +24,7 @@ the Linux memory management. hugetlbpage idle_page_tracking ksm + numa_memory_policy pagemap soft-dirty userfaultfd diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst new file mode 100644 index 000000000000..d78c5b315f72 --- /dev/null +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -0,0 +1,495 @@ +.. _numa_memory_policy: + +================== +NUMA Memory Policy +================== + +What is NUMA Memory Policy? +============================ + +In the Linux kernel, "memory policy" determines from which node the kernel will +allocate memory in a NUMA system or in an emulated NUMA system. Linux has +supported platforms with Non-Uniform Memory Access architectures since 2.4.?. +The current memory policy support was added to Linux 2.6 around May 2004. This +document attempts to describe the concepts and APIs of the 2.6 memory policy +support. + +Memory policies should not be confused with cpusets +(``Documentation/cgroup-v1/cpusets.txt``) +which is an administrative mechanism for restricting the nodes from which +memory may be allocated by a set of processes. Memory policies are a +programming interface that a NUMA-aware application can take advantage of. When +both cpusets and policies are applied to a task, the restrictions of the cpuset +takes priority. See :ref:`Memory Policies and cpusets ` +below for more details. + +Memory Policy Concepts +====================== + +Scope of Memory Policies +------------------------ + +The Linux kernel supports _scopes_ of memory policy, described here from +most general to most specific: + +System Default Policy + this policy is "hard coded" into the kernel. It is the policy + that governs all page allocations that aren't controlled by + one of the more specific policy scopes discussed below. When + the system is "up and running", the system default policy will + use "local allocation" described below. However, during boot + up, the system default policy will be set to interleave + allocations across all nodes with "sufficient" memory, so as + not to overload the initial boot node with boot-time + allocations. + +Task/Process Policy + this is an optional, per-task policy. When defined for a + specific task, this policy controls all page allocations made + by or on behalf of the task that aren't controlled by a more + specific scope. If a task does not define a task policy, then + all page allocations that would have been controlled by the + task policy "fall back" to the System Default Policy. + + The task policy applies to the entire address space of a task. Thus, + it is inheritable, and indeed is inherited, across both fork() + [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task + to establish the task policy for a child task exec()'d from an + executable image that has no awareness of memory policy. See the + :ref:`Memory Policy APIs ` section, + below, for an overview of the system call + that a task may use to set/change its task/process policy. + + In a multi-threaded task, task policies apply only to the thread + [Linux kernel task] that installs the policy and any threads + subsequently created by that thread. Any sibling threads existing + at the time a new task policy is installed retain their current + policy. + + A task policy applies only to pages allocated after the policy is + installed. Any pages already faulted in by the task when the task + changes its task policy remain where they were allocated based on + the policy at the time they were allocated. + +.. _vma_policy: + +VMA Policy + A "VMA" or "Virtual Memory Area" refers to a range of a task's + virtual address space. A task may define a specific policy for a range + of its virtual address space. See the + :ref:`Memory Policy APIs ` section, + below, for an overview of the mbind() system call used to set a VMA + policy. + + A VMA policy will govern the allocation of pages that back + this region of the address space. Any regions of the task's + address space that don't have an explicit VMA policy will fall + back to the task policy, which may itself fall back to the + System Default Policy. + + VMA policies have a few complicating details: + + * VMA policy applies ONLY to anonymous pages. These include + pages allocated for anonymous segments, such as the task + stack and heap, and any regions of the address space + mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is + applied to a file mapping, it will be ignored if the mapping + used the MAP_SHARED flag. If the file mapping used the + MAP_PRIVATE flag, the VMA policy will only be applied when + an anonymous page is allocated on an attempt to write to the + mapping-- i.e., at Copy-On-Write. + + * VMA policies are shared between all tasks that share a + virtual address space--a.k.a. threads--independent of when + the policy is installed; and they are inherited across + fork(). However, because VMA policies refer to a specific + region of a task's address space, and because the address + space is discarded and recreated on exec*(), VMA policies + are NOT inheritable across exec(). Thus, only NUMA-aware + applications may use VMA policies. + + * A task may install a new VMA policy on a sub-range of a + previously mmap()ed region. When this happens, Linux splits + the existing virtual memory area into 2 or 3 VMAs, each with + it's own policy. + + * By default, VMA policy applies only to pages allocated after + the policy is installed. Any pages already faulted into the + VMA range remain where they were allocated based on the + policy at the time they were allocated. However, since + 2.6.16, Linux supports page migration via the mbind() system + call, so that page contents can be moved to match a newly + installed policy. + +Shared Policy + Conceptually, shared policies apply to "memory objects" mapped + shared into one or more tasks' distinct address spaces. An + application installs shared policies the same way as VMA + policies--using the mbind() system call specifying a range of + virtual addresses that map the shared object. However, unlike + VMA policies, which can be considered to be an attribute of a + range of a task's address space, shared policies apply + directly to the shared object. Thus, all tasks that attach to + the object share the policy, and all pages allocated for the + shared object, by any task, will obey the shared policy. + + As of 2.6.22, only shared memory segments, created by shmget() or + mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared + policy support was added to Linux, the associated data structures were + added to hugetlbfs shmem segments. At the time, hugetlbfs did not + support allocation at fault time--a.k.a lazy allocation--so hugetlbfs + shmem segments were never "hooked up" to the shared policy support. + Although hugetlbfs segments now support lazy allocation, their support + for shared policy has not been completed. + + As mentioned above in :ref:`VMA policies ` section, + allocations of page cache pages for regular files mmap()ed + with MAP_SHARED ignore any VMA policy installed on the virtual + address range backed by the shared file mapping. Rather, + shared page cache pages, including pages backing private + mappings that have not yet been written by the task, follow + task policy, if any, else System Default Policy. + + The shared policy infrastructure supports different policies on subset + ranges of the shared object. However, Linux still splits the VMA of + the task that installs the policy for each range of distinct policy. + Thus, different tasks that attach to a shared memory segment can have + different VMA configurations mapping that one shared object. This + can be seen by examining the /proc//numa_maps of tasks sharing + a shared memory region, when one task has installed shared policy on + one or more ranges of the region. + +Components of Memory Policies +----------------------------- + +A NUMA memory policy consists of a "mode", optional mode flags, and +an optional set of nodes. The mode determines the behavior of the +policy, the optional mode flags determine the behavior of the mode, +and the optional set of nodes can be viewed as the arguments to the +policy behavior. + +Internally, memory policies are implemented by a reference counted +structure, struct mempolicy. Details of this structure will be +discussed in context, below, as required to explain the behavior. + +NUMA memory policy supports the following 4 behavioral modes: + +Default Mode--MPOL_DEFAULT + This mode is only used in the memory policy APIs. Internally, + MPOL_DEFAULT is converted to the NULL memory policy in all + policy scopes. Any existing non-default policy will simply be + removed when MPOL_DEFAULT is specified. As a result, + MPOL_DEFAULT means "fall back to the next most specific policy + scope." + + For example, a NULL or default task policy will fall back to the + system default policy. A NULL or default vma policy will fall + back to the task policy. + + When specified in one of the memory policy APIs, the Default mode + does not use the optional set of nodes. + + It is an error for the set of nodes specified for this policy to + be non-empty. + +MPOL_BIND + This mode specifies that memory must come from the set of + nodes specified by the policy. Memory will be allocated from + the node in the set with sufficient free memory that is + closest to the node where the allocation takes place. + +MPOL_PREFERRED + This mode specifies that the allocation should be attempted + from the single node specified in the policy. If that + allocation fails, the kernel will search other nodes, in order + of increasing distance from the preferred node based on + information provided by the platform firmware. + + Internally, the Preferred policy uses a single node--the + preferred_node member of struct mempolicy. When the internal + mode flag MPOL_F_LOCAL is set, the preferred_node is ignored + and the policy is interpreted as local allocation. "Local" + allocation policy can be viewed as a Preferred policy that + starts at the node containing the cpu where the allocation + takes place. + + It is possible for the user to specify that local allocation + is always preferred by passing an empty nodemask with this + mode. If an empty nodemask is passed, the policy cannot use + the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags + described below. + +MPOL_INTERLEAVED + This mode specifies that page allocations be interleaved, on a + page granularity, across the nodes specified in the policy. + This mode also behaves slightly differently, based on the + context where it is used: + + For allocation of anonymous pages and shared memory pages, + Interleave mode indexes the set of nodes specified by the + policy using the page offset of the faulting address into the + segment [VMA] containing the address modulo the number of + nodes specified by the policy. It then attempts to allocate a + page, starting at the selected node, as if the node had been + specified by a Preferred policy or had been selected by a + local allocation. That is, allocation will follow the per + node zonelist. + + For allocation of page cache pages, Interleave mode indexes + the set of nodes specified by the policy using a node counter + maintained per task. This counter wraps around to the lowest + specified node after it reaches the highest specified node. + This will tend to spread the pages out over the nodes + specified by the policy based on the order in which they are + allocated, rather than based on any page offset into an + address range or file. During system boot up, the temporary + interleaved system default policy works in this mode. + +NUMA memory policy supports the following optional mode flags: + +MPOL_F_STATIC_NODES + This flag specifies that the nodemask passed by + the user should not be remapped if the task or VMA's set of allowed + nodes changes after the memory policy has been defined. + + Without this flag, any time a mempolicy is rebound because of a + change in the set of allowed nodes, the node (Preferred) or + nodemask (Bind, Interleave) is remapped to the new set of + allowed nodes. This may result in nodes being used that were + previously undesired. + + With this flag, if the user-specified nodes overlap with the + nodes allowed by the task's cpuset, then the memory policy is + applied to their intersection. If the two sets of nodes do not + overlap, the Default policy is used. + + For example, consider a task that is attached to a cpuset with + mems 1-3 that sets an Interleave policy over the same set. If + the cpuset's mems change to 3-5, the Interleave will now occur + over nodes 3, 4, and 5. With this flag, however, since only node + 3 is allowed from the user's nodemask, the "interleave" only + occurs over that node. If no nodes from the user's nodemask are + now allowed, the Default behavior is used. + + MPOL_F_STATIC_NODES cannot be combined with the + MPOL_F_RELATIVE_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). + +MPOL_F_RELATIVE_NODES + This flag specifies that the nodemask passed + by the user will be mapped relative to the set of the task or VMA's + set of allowed nodes. The kernel stores the user-passed nodemask, + and if the allowed nodes changes, then that original nodemask will + be remapped relative to the new set of allowed nodes. + + Without this flag (and without MPOL_F_STATIC_NODES), anytime a + mempolicy is rebound because of a change in the set of allowed + nodes, the node (Preferred) or nodemask (Bind, Interleave) is + remapped to the new set of allowed nodes. That remap may not + preserve the relative nature of the user's passed nodemask to its + set of allowed nodes upon successive rebinds: a nodemask of + 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of + allowed nodes is restored to its original state. + + With this flag, the remap is done so that the node numbers from + the user's passed nodemask are relative to the set of allowed + nodes. In other words, if nodes 0, 2, and 4 are set in the user's + nodemask, the policy will be effected over the first (and in the + Bind or Interleave case, the third and fifth) nodes in the set of + allowed nodes. The nodemask passed by the user represents nodes + relative to task or VMA's set of allowed nodes. + + If the user's nodemask includes nodes that are outside the range + of the new set of allowed nodes (for example, node 5 is set in + the user's nodemask when the set of allowed nodes is only 0-3), + then the remap wraps around to the beginning of the nodemask and, + if not already set, sets the node in the mempolicy nodemask. + + For example, consider a task that is attached to a cpuset with + mems 2-5 that sets an Interleave policy over the same set with + MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the + interleave now occurs over nodes 3,5-7. If the cpuset's mems + then change to 0,2-3,5, then the interleave occurs over nodes + 0,2-3,5. + + Thanks to the consistent remapping, applications preparing + nodemasks to specify memory policies using this flag should + disregard their current, actual cpuset imposed memory placement + and prepare the nodemask as if they were always located on + memory nodes 0 to N-1, where N is the number of memory nodes the + policy is intended to manage. Let the kernel then remap to the + set of memory nodes allowed by the task's cpuset, as that may + change over time. + + MPOL_F_RELATIVE_NODES cannot be combined with the + MPOL_F_STATIC_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). + +Memory Policy Reference Counting +================================ + +To resolve use/free races, struct mempolicy contains an atomic reference +count field. Internal interfaces, mpol_get()/mpol_put() increment and +decrement this reference count, respectively. mpol_put() will only free +the structure back to the mempolicy kmem cache when the reference count +goes to zero. + +When a new memory policy is allocated, its reference count is initialized +to '1', representing the reference held by the task that is installing the +new policy. When a pointer to a memory policy structure is stored in another +structure, another reference is added, as the task's reference will be dropped +on completion of the policy installation. + +During run-time "usage" of the policy, we attempt to minimize atomic operations +on the reference count, as this can lead to cache lines bouncing between cpus +and NUMA nodes. "Usage" here means one of the following: + +1) querying of the policy, either by the task itself [using the get_mempolicy() + API discussed below] or by another task using the /proc//numa_maps + interface. + +2) examination of the policy to determine the policy mode and associated node + or node lists, if any, for page allocation. This is considered a "hot + path". Note that for MPOL_BIND, the "usage" extends across the entire + allocation process, which may sleep during page reclaimation, because the + BIND policy nodemask is used, by reference, to filter ineligible nodes. + +We can avoid taking an extra reference during the usages listed above as +follows: + +1) we never need to get/free the system default policy as this is never + changed nor freed, once the system is up and running. + +2) for querying the policy, we do not need to take an extra reference on the + target task's task policy nor vma policies because we always acquire the + task's mm's mmap_sem for read during the query. The set_mempolicy() and + mbind() APIs [see below] always acquire the mmap_sem for write when + installing or replacing task or vma policies. Thus, there is no possibility + of a task or thread freeing a policy while another task or thread is + querying it. + +3) Page allocation usage of task or vma policy occurs in the fault path where + we hold them mmap_sem for read. Again, because replacing the task or vma + policy requires that the mmap_sem be held for write, the policy can't be + freed out from under us while we're using it for page allocation. + +4) Shared policies require special consideration. One task can replace a + shared memory policy while another task, with a distinct mmap_sem, is + querying or allocating a page based on the policy. To resolve this + potential race, the shared policy infrastructure adds an extra reference + to the shared policy during lookup while holding a spin lock on the shared + policy management structure. This requires that we drop this extra + reference when we're finished "using" the policy. We must drop the + extra reference on shared policies in the same query/allocation paths + used for non-shared policies. For this reason, shared policies are marked + as such, and the extra reference is dropped "conditionally"--i.e., only + for shared policies. + + Because of this extra reference counting, and because we must lookup + shared policies in a tree structure under spinlock, shared policies are + more expensive to use in the page allocation path. This is especially + true for shared policies on shared memory regions shared by tasks running + on different NUMA nodes. This extra overhead can be avoided by always + falling back to task or system default policy for shared memory regions, + or by prefaulting the entire shared memory region into memory and locking + it down. However, this might not be appropriate for all applications. + +.. _memory_policy_apis: + +Memory Policy APIs +================== + +Linux supports 3 system calls for controlling memory policy. These APIS +always affect only the calling task, the calling task's address space, or +some shared object mapped into the calling task's address space. + +.. note:: + the headers that define these APIs and the parameter data types for + user space applications reside in a package that is not part of the + Linux kernel. The kernel system call interfaces, with the 'sys\_' + prefix, are defined in ; the mode and flag + definitions are defined in . + +Set [Task] Memory Policy:: + + long set_mempolicy(int mode, const unsigned long *nmask, + unsigned long maxnode); + +Set's the calling task's "task/process memory policy" to mode +specified by the 'mode' argument and the set of nodes defined by +'nmask'. 'nmask' points to a bit mask of node ids containing at least +'maxnode' ids. Optional mode flags may be passed by combining the +'mode' argument with the flag (for example: MPOL_INTERLEAVE | +MPOL_F_STATIC_NODES). + +See the set_mempolicy(2) man page for more details + + +Get [Task] Memory Policy or Related Information:: + + long get_mempolicy(int *mode, + const unsigned long *nmask, unsigned long maxnode, + void *addr, int flags); + +Queries the "task/process memory policy" of the calling task, or the +policy or location of a specified virtual address, depending on the +'flags' argument. + +See the get_mempolicy(2) man page for more details + + +Install VMA/Shared Policy for a Range of Task's Address Space:: + + long mbind(void *start, unsigned long len, int mode, + const unsigned long *nmask, unsigned long maxnode, + unsigned flags); + +mbind() installs the policy specified by (mode, nmask, maxnodes) as a +VMA policy for the range of the calling task's address space specified +by the 'start' and 'len' arguments. Additional actions may be +requested via the 'flags' argument. + +See the mbind(2) man page for more details. + +Memory Policy Command Line Interface +==================================== + +Although not strictly part of the Linux implementation of memory policy, +a command line tool, numactl(8), exists that allows one to: + ++ set the task policy for a specified program via set_mempolicy(2), fork(2) and + exec(2) + ++ set the shared policy for a shared memory segment via mbind(2) + +The numactl(8) tool is packaged with the run-time version of the library +containing the memory policy system call wrappers. Some distributions +package the headers and compile-time libraries in a separate development +package. + +.. _mem_pol_and_cpusets: + +Memory Policies and cpusets +=========================== + +Memory policies work within cpusets as described above. For memory policies +that require a node or set of nodes, the nodes are restricted to the set of +nodes whose memories are allowed by the cpuset constraints. If the nodemask +specified for the policy contains nodes that are not allowed by the cpuset and +MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes +specified for the policy and the set of nodes with memory is used. If the +result is the empty set, the policy is considered invalid and cannot be +installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped +onto and folded into the task's set of allowed nodes as previously described. + +The interaction of memory policies and cpusets can be problematic when tasks +in two cpusets share access to a memory region, such as shared memory segments +created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and +any of the tasks install shared policy on the region, only nodes whose +memories are allowed in both cpusets may be used in the policies. Obtaining +this information requires "stepping outside" the memory policy APIs to use the +cpuset information and requires that one know in what cpusets other task might +be attaching to the shared region. Furthermore, if the cpusets' allowed +memory sets are disjoint, "local" allocation is the only valid policy. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index ef53f808288d..520f6a84cf50 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -566,7 +566,7 @@ address policy mapping details Where: "address" is the starting address for the mapping; -"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt); +"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); "mapping details" summarizes mapping data such as mapping type, page usage counters, node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page size, in KB, that is backing the mapping up. diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 627389a34f77..d06e9a59a9f4 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -105,8 +105,9 @@ policy for the file will revert to "default" policy. NUMA memory allocation policies have optional flags that can be used in conjunction with their modes. These optional flags can be specified when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/vm/numa_memory_policy.rst for a list of all available -memory allocation policy mode flags and their effect on memory policy. +See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of +all available memory allocation policy mode flags and their effect on +memory policy. =static is equivalent to MPOL_F_STATIC_NODES =relative is equivalent to MPOL_F_RELATIVE_NODES diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index f8a96ca16b7a..f4a4f3e884cf 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -22,8 +22,6 @@ mmu_notifier.rst - a note about clearing pte/pmd and mmu notifications numa.rst - information about NUMA specific code in the Linux vm. -numa_memory_policy.rst - - documentation of concepts and APIs of the 2.6 memory policy support. overcommit-accounting.rst - description of the Linux kernels overcommit handling modes. page_frags.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index ed58cb9f9675..8e1cc667eef1 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -14,7 +14,6 @@ various features of the Linux memory management :maxdepth: 1 ksm - numa_memory_policy transhuge swap_numa zswap diff --git a/Documentation/vm/numa.rst b/Documentation/vm/numa.rst index aada84bc8c46..185d8a568168 100644 --- a/Documentation/vm/numa.rst +++ b/Documentation/vm/numa.rst @@ -110,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces, such as taskset(1) and numactl(1), and program interfaces such as sched_setaffinity(2). Further, one can modify the kernel's default local allocation behavior using Linux NUMA memory policy. -[see Documentation/vm/numa_memory_policy.rst.] +[see Documentation/admin-guide/mm/numa_memory_policy.rst.] System administrators can restrict the CPUs and nodes' memories that a non- privileged user can specify in the scheduling or NUMA commands and functions diff --git a/Documentation/vm/numa_memory_policy.rst b/Documentation/vm/numa_memory_policy.rst deleted file mode 100644 index d78c5b315f72..000000000000 --- a/Documentation/vm/numa_memory_policy.rst +++ /dev/null @@ -1,495 +0,0 @@ -.. _numa_memory_policy: - -================== -NUMA Memory Policy -================== - -What is NUMA Memory Policy? -============================ - -In the Linux kernel, "memory policy" determines from which node the kernel will -allocate memory in a NUMA system or in an emulated NUMA system. Linux has -supported platforms with Non-Uniform Memory Access architectures since 2.4.?. -The current memory policy support was added to Linux 2.6 around May 2004. This -document attempts to describe the concepts and APIs of the 2.6 memory policy -support. - -Memory policies should not be confused with cpusets -(``Documentation/cgroup-v1/cpusets.txt``) -which is an administrative mechanism for restricting the nodes from which -memory may be allocated by a set of processes. Memory policies are a -programming interface that a NUMA-aware application can take advantage of. When -both cpusets and policies are applied to a task, the restrictions of the cpuset -takes priority. See :ref:`Memory Policies and cpusets ` -below for more details. - -Memory Policy Concepts -====================== - -Scope of Memory Policies ------------------------- - -The Linux kernel supports _scopes_ of memory policy, described here from -most general to most specific: - -System Default Policy - this policy is "hard coded" into the kernel. It is the policy - that governs all page allocations that aren't controlled by - one of the more specific policy scopes discussed below. When - the system is "up and running", the system default policy will - use "local allocation" described below. However, during boot - up, the system default policy will be set to interleave - allocations across all nodes with "sufficient" memory, so as - not to overload the initial boot node with boot-time - allocations. - -Task/Process Policy - this is an optional, per-task policy. When defined for a - specific task, this policy controls all page allocations made - by or on behalf of the task that aren't controlled by a more - specific scope. If a task does not define a task policy, then - all page allocations that would have been controlled by the - task policy "fall back" to the System Default Policy. - - The task policy applies to the entire address space of a task. Thus, - it is inheritable, and indeed is inherited, across both fork() - [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task - to establish the task policy for a child task exec()'d from an - executable image that has no awareness of memory policy. See the - :ref:`Memory Policy APIs ` section, - below, for an overview of the system call - that a task may use to set/change its task/process policy. - - In a multi-threaded task, task policies apply only to the thread - [Linux kernel task] that installs the policy and any threads - subsequently created by that thread. Any sibling threads existing - at the time a new task policy is installed retain their current - policy. - - A task policy applies only to pages allocated after the policy is - installed. Any pages already faulted in by the task when the task - changes its task policy remain where they were allocated based on - the policy at the time they were allocated. - -.. _vma_policy: - -VMA Policy - A "VMA" or "Virtual Memory Area" refers to a range of a task's - virtual address space. A task may define a specific policy for a range - of its virtual address space. See the - :ref:`Memory Policy APIs ` section, - below, for an overview of the mbind() system call used to set a VMA - policy. - - A VMA policy will govern the allocation of pages that back - this region of the address space. Any regions of the task's - address space that don't have an explicit VMA policy will fall - back to the task policy, which may itself fall back to the - System Default Policy. - - VMA policies have a few complicating details: - - * VMA policy applies ONLY to anonymous pages. These include - pages allocated for anonymous segments, such as the task - stack and heap, and any regions of the address space - mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is - applied to a file mapping, it will be ignored if the mapping - used the MAP_SHARED flag. If the file mapping used the - MAP_PRIVATE flag, the VMA policy will only be applied when - an anonymous page is allocated on an attempt to write to the - mapping-- i.e., at Copy-On-Write. - - * VMA policies are shared between all tasks that share a - virtual address space--a.k.a. threads--independent of when - the policy is installed; and they are inherited across - fork(). However, because VMA policies refer to a specific - region of a task's address space, and because the address - space is discarded and recreated on exec*(), VMA policies - are NOT inheritable across exec(). Thus, only NUMA-aware - applications may use VMA policies. - - * A task may install a new VMA policy on a sub-range of a - previously mmap()ed region. When this happens, Linux splits - the existing virtual memory area into 2 or 3 VMAs, each with - it's own policy. - - * By default, VMA policy applies only to pages allocated after - the policy is installed. Any pages already faulted into the - VMA range remain where they were allocated based on the - policy at the time they were allocated. However, since - 2.6.16, Linux supports page migration via the mbind() system - call, so that page contents can be moved to match a newly - installed policy. - -Shared Policy - Conceptually, shared policies apply to "memory objects" mapped - shared into one or more tasks' distinct address spaces. An - application installs shared policies the same way as VMA - policies--using the mbind() system call specifying a range of - virtual addresses that map the shared object. However, unlike - VMA policies, which can be considered to be an attribute of a - range of a task's address space, shared policies apply - directly to the shared object. Thus, all tasks that attach to - the object share the policy, and all pages allocated for the - shared object, by any task, will obey the shared policy. - - As of 2.6.22, only shared memory segments, created by shmget() or - mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared - policy support was added to Linux, the associated data structures were - added to hugetlbfs shmem segments. At the time, hugetlbfs did not - support allocation at fault time--a.k.a lazy allocation--so hugetlbfs - shmem segments were never "hooked up" to the shared policy support. - Although hugetlbfs segments now support lazy allocation, their support - for shared policy has not been completed. - - As mentioned above in :ref:`VMA policies ` section, - allocations of page cache pages for regular files mmap()ed - with MAP_SHARED ignore any VMA policy installed on the virtual - address range backed by the shared file mapping. Rather, - shared page cache pages, including pages backing private - mappings that have not yet been written by the task, follow - task policy, if any, else System Default Policy. - - The shared policy infrastructure supports different policies on subset - ranges of the shared object. However, Linux still splits the VMA of - the task that installs the policy for each range of distinct policy. - Thus, different tasks that attach to a shared memory segment can have - different VMA configurations mapping that one shared object. This - can be seen by examining the /proc//numa_maps of tasks sharing - a shared memory region, when one task has installed shared policy on - one or more ranges of the region. - -Components of Memory Policies ------------------------------ - -A NUMA memory policy consists of a "mode", optional mode flags, and -an optional set of nodes. The mode determines the behavior of the -policy, the optional mode flags determine the behavior of the mode, -and the optional set of nodes can be viewed as the arguments to the -policy behavior. - -Internally, memory policies are implemented by a reference counted -structure, struct mempolicy. Details of this structure will be -discussed in context, below, as required to explain the behavior. - -NUMA memory policy supports the following 4 behavioral modes: - -Default Mode--MPOL_DEFAULT - This mode is only used in the memory policy APIs. Internally, - MPOL_DEFAULT is converted to the NULL memory policy in all - policy scopes. Any existing non-default policy will simply be - removed when MPOL_DEFAULT is specified. As a result, - MPOL_DEFAULT means "fall back to the next most specific policy - scope." - - For example, a NULL or default task policy will fall back to the - system default policy. A NULL or default vma policy will fall - back to the task policy. - - When specified in one of the memory policy APIs, the Default mode - does not use the optional set of nodes. - - It is an error for the set of nodes specified for this policy to - be non-empty. - -MPOL_BIND - This mode specifies that memory must come from the set of - nodes specified by the policy. Memory will be allocated from - the node in the set with sufficient free memory that is - closest to the node where the allocation takes place. - -MPOL_PREFERRED - This mode specifies that the allocation should be attempted - from the single node specified in the policy. If that - allocation fails, the kernel will search other nodes, in order - of increasing distance from the preferred node based on - information provided by the platform firmware. - - Internally, the Preferred policy uses a single node--the - preferred_node member of struct mempolicy. When the internal - mode flag MPOL_F_LOCAL is set, the preferred_node is ignored - and the policy is interpreted as local allocation. "Local" - allocation policy can be viewed as a Preferred policy that - starts at the node containing the cpu where the allocation - takes place. - - It is possible for the user to specify that local allocation - is always preferred by passing an empty nodemask with this - mode. If an empty nodemask is passed, the policy cannot use - the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags - described below. - -MPOL_INTERLEAVED - This mode specifies that page allocations be interleaved, on a - page granularity, across the nodes specified in the policy. - This mode also behaves slightly differently, based on the - context where it is used: - - For allocation of anonymous pages and shared memory pages, - Interleave mode indexes the set of nodes specified by the - policy using the page offset of the faulting address into the - segment [VMA] containing the address modulo the number of - nodes specified by the policy. It then attempts to allocate a - page, starting at the selected node, as if the node had been - specified by a Preferred policy or had been selected by a - local allocation. That is, allocation will follow the per - node zonelist. - - For allocation of page cache pages, Interleave mode indexes - the set of nodes specified by the policy using a node counter - maintained per task. This counter wraps around to the lowest - specified node after it reaches the highest specified node. - This will tend to spread the pages out over the nodes - specified by the policy based on the order in which they are - allocated, rather than based on any page offset into an - address range or file. During system boot up, the temporary - interleaved system default policy works in this mode. - -NUMA memory policy supports the following optional mode flags: - -MPOL_F_STATIC_NODES - This flag specifies that the nodemask passed by - the user should not be remapped if the task or VMA's set of allowed - nodes changes after the memory policy has been defined. - - Without this flag, any time a mempolicy is rebound because of a - change in the set of allowed nodes, the node (Preferred) or - nodemask (Bind, Interleave) is remapped to the new set of - allowed nodes. This may result in nodes being used that were - previously undesired. - - With this flag, if the user-specified nodes overlap with the - nodes allowed by the task's cpuset, then the memory policy is - applied to their intersection. If the two sets of nodes do not - overlap, the Default policy is used. - - For example, consider a task that is attached to a cpuset with - mems 1-3 that sets an Interleave policy over the same set. If - the cpuset's mems change to 3-5, the Interleave will now occur - over nodes 3, 4, and 5. With this flag, however, since only node - 3 is allowed from the user's nodemask, the "interleave" only - occurs over that node. If no nodes from the user's nodemask are - now allowed, the Default behavior is used. - - MPOL_F_STATIC_NODES cannot be combined with the - MPOL_F_RELATIVE_NODES flag. It also cannot be used for - MPOL_PREFERRED policies that were created with an empty nodemask - (local allocation). - -MPOL_F_RELATIVE_NODES - This flag specifies that the nodemask passed - by the user will be mapped relative to the set of the task or VMA's - set of allowed nodes. The kernel stores the user-passed nodemask, - and if the allowed nodes changes, then that original nodemask will - be remapped relative to the new set of allowed nodes. - - Without this flag (and without MPOL_F_STATIC_NODES), anytime a - mempolicy is rebound because of a change in the set of allowed - nodes, the node (Preferred) or nodemask (Bind, Interleave) is - remapped to the new set of allowed nodes. That remap may not - preserve the relative nature of the user's passed nodemask to its - set of allowed nodes upon successive rebinds: a nodemask of - 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of - allowed nodes is restored to its original state. - - With this flag, the remap is done so that the node numbers from - the user's passed nodemask are relative to the set of allowed - nodes. In other words, if nodes 0, 2, and 4 are set in the user's - nodemask, the policy will be effected over the first (and in the - Bind or Interleave case, the third and fifth) nodes in the set of - allowed nodes. The nodemask passed by the user represents nodes - relative to task or VMA's set of allowed nodes. - - If the user's nodemask includes nodes that are outside the range - of the new set of allowed nodes (for example, node 5 is set in - the user's nodemask when the set of allowed nodes is only 0-3), - then the remap wraps around to the beginning of the nodemask and, - if not already set, sets the node in the mempolicy nodemask. - - For example, consider a task that is attached to a cpuset with - mems 2-5 that sets an Interleave policy over the same set with - MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the - interleave now occurs over nodes 3,5-7. If the cpuset's mems - then change to 0,2-3,5, then the interleave occurs over nodes - 0,2-3,5. - - Thanks to the consistent remapping, applications preparing - nodemasks to specify memory policies using this flag should - disregard their current, actual cpuset imposed memory placement - and prepare the nodemask as if they were always located on - memory nodes 0 to N-1, where N is the number of memory nodes the - policy is intended to manage. Let the kernel then remap to the - set of memory nodes allowed by the task's cpuset, as that may - change over time. - - MPOL_F_RELATIVE_NODES cannot be combined with the - MPOL_F_STATIC_NODES flag. It also cannot be used for - MPOL_PREFERRED policies that were created with an empty nodemask - (local allocation). - -Memory Policy Reference Counting -================================ - -To resolve use/free races, struct mempolicy contains an atomic reference -count field. Internal interfaces, mpol_get()/mpol_put() increment and -decrement this reference count, respectively. mpol_put() will only free -the structure back to the mempolicy kmem cache when the reference count -goes to zero. - -When a new memory policy is allocated, its reference count is initialized -to '1', representing the reference held by the task that is installing the -new policy. When a pointer to a memory policy structure is stored in another -structure, another reference is added, as the task's reference will be dropped -on completion of the policy installation. - -During run-time "usage" of the policy, we attempt to minimize atomic operations -on the reference count, as this can lead to cache lines bouncing between cpus -and NUMA nodes. "Usage" here means one of the following: - -1) querying of the policy, either by the task itself [using the get_mempolicy() - API discussed below] or by another task using the /proc//numa_maps - interface. - -2) examination of the policy to determine the policy mode and associated node - or node lists, if any, for page allocation. This is considered a "hot - path". Note that for MPOL_BIND, the "usage" extends across the entire - allocation process, which may sleep during page reclaimation, because the - BIND policy nodemask is used, by reference, to filter ineligible nodes. - -We can avoid taking an extra reference during the usages listed above as -follows: - -1) we never need to get/free the system default policy as this is never - changed nor freed, once the system is up and running. - -2) for querying the policy, we do not need to take an extra reference on the - target task's task policy nor vma policies because we always acquire the - task's mm's mmap_sem for read during the query. The set_mempolicy() and - mbind() APIs [see below] always acquire the mmap_sem for write when - installing or replacing task or vma policies. Thus, there is no possibility - of a task or thread freeing a policy while another task or thread is - querying it. - -3) Page allocation usage of task or vma policy occurs in the fault path where - we hold them mmap_sem for read. Again, because replacing the task or vma - policy requires that the mmap_sem be held for write, the policy can't be - freed out from under us while we're using it for page allocation. - -4) Shared policies require special consideration. One task can replace a - shared memory policy while another task, with a distinct mmap_sem, is - querying or allocating a page based on the policy. To resolve this - potential race, the shared policy infrastructure adds an extra reference - to the shared policy during lookup while holding a spin lock on the shared - policy management structure. This requires that we drop this extra - reference when we're finished "using" the policy. We must drop the - extra reference on shared policies in the same query/allocation paths - used for non-shared policies. For this reason, shared policies are marked - as such, and the extra reference is dropped "conditionally"--i.e., only - for shared policies. - - Because of this extra reference counting, and because we must lookup - shared policies in a tree structure under spinlock, shared policies are - more expensive to use in the page allocation path. This is especially - true for shared policies on shared memory regions shared by tasks running - on different NUMA nodes. This extra overhead can be avoided by always - falling back to task or system default policy for shared memory regions, - or by prefaulting the entire shared memory region into memory and locking - it down. However, this might not be appropriate for all applications. - -.. _memory_policy_apis: - -Memory Policy APIs -================== - -Linux supports 3 system calls for controlling memory policy. These APIS -always affect only the calling task, the calling task's address space, or -some shared object mapped into the calling task's address space. - -.. note:: - the headers that define these APIs and the parameter data types for - user space applications reside in a package that is not part of the - Linux kernel. The kernel system call interfaces, with the 'sys\_' - prefix, are defined in ; the mode and flag - definitions are defined in . - -Set [Task] Memory Policy:: - - long set_mempolicy(int mode, const unsigned long *nmask, - unsigned long maxnode); - -Set's the calling task's "task/process memory policy" to mode -specified by the 'mode' argument and the set of nodes defined by -'nmask'. 'nmask' points to a bit mask of node ids containing at least -'maxnode' ids. Optional mode flags may be passed by combining the -'mode' argument with the flag (for example: MPOL_INTERLEAVE | -MPOL_F_STATIC_NODES). - -See the set_mempolicy(2) man page for more details - - -Get [Task] Memory Policy or Related Information:: - - long get_mempolicy(int *mode, - const unsigned long *nmask, unsigned long maxnode, - void *addr, int flags); - -Queries the "task/process memory policy" of the calling task, or the -policy or location of a specified virtual address, depending on the -'flags' argument. - -See the get_mempolicy(2) man page for more details - - -Install VMA/Shared Policy for a Range of Task's Address Space:: - - long mbind(void *start, unsigned long len, int mode, - const unsigned long *nmask, unsigned long maxnode, - unsigned flags); - -mbind() installs the policy specified by (mode, nmask, maxnodes) as a -VMA policy for the range of the calling task's address space specified -by the 'start' and 'len' arguments. Additional actions may be -requested via the 'flags' argument. - -See the mbind(2) man page for more details. - -Memory Policy Command Line Interface -==================================== - -Although not strictly part of the Linux implementation of memory policy, -a command line tool, numactl(8), exists that allows one to: - -+ set the task policy for a specified program via set_mempolicy(2), fork(2) and - exec(2) - -+ set the shared policy for a shared memory segment via mbind(2) - -The numactl(8) tool is packaged with the run-time version of the library -containing the memory policy system call wrappers. Some distributions -package the headers and compile-time libraries in a separate development -package. - -.. _mem_pol_and_cpusets: - -Memory Policies and cpusets -=========================== - -Memory policies work within cpusets as described above. For memory policies -that require a node or set of nodes, the nodes are restricted to the set of -nodes whose memories are allowed by the cpuset constraints. If the nodemask -specified for the policy contains nodes that are not allowed by the cpuset and -MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes -specified for the policy and the set of nodes with memory is used. If the -result is the empty set, the policy is considered invalid and cannot be -installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped -onto and folded into the task's set of allowed nodes as previously described. - -The interaction of memory policies and cpusets can be problematic when tasks -in two cpusets share access to a memory region, such as shared memory segments -created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and -any of the tasks install shared policy on the region, only nodes whose -memories are allowed in both cpusets may be used in the policies. Obtaining -this information requires "stepping outside" the memory policy APIs to use the -cpuset information and requires that one know in what cpusets other task might -be attaching to the shared region. Furthermore, if the cpusets' allowed -memory sets are disjoint, "local" allocation is the only valid policy. -- cgit v1.2.3