diff options
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/cleancache.txt | 4 | ||||
-rw-r--r-- | Documentation/vm/hugetlbpage.txt | 55 | ||||
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 38 | ||||
-rw-r--r-- | Documentation/vm/zsmalloc.txt | 70 |
4 files changed, 129 insertions, 38 deletions
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt index 01d76282444e..e4b49df7a048 100644 --- a/Documentation/vm/cleancache.txt +++ b/Documentation/vm/cleancache.txt @@ -28,9 +28,7 @@ IMPLEMENTATION OVERVIEW A cleancache "backend" that provides transcendent memory registers itself to the kernel's cleancache "frontend" by calling cleancache_register_ops, passing a pointer to a cleancache_ops structure with funcs set appropriately. -Note that cleancache_register_ops returns the previous settings so that -chaining can be performed if desired. The functions provided must conform to -certain semantics as follows: +The functions provided must conform to certain semantics as follows: Most important, cleancache is "ephemeral". Pages which are copied into cleancache have an indefinite lifetime which is completely unknowable diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index f2d3a100fe38..030977fb8d2d 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -267,21 +267,34 @@ call, then it is required that system administrator mount a file system of type hugetlbfs: mount -t hugetlbfs \ - -o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> \ - none /mnt/huge + -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ + min_size=<value>,nr_inodes=<value> none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory /mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken. The mode option sets the mode of root of file system to value & 01777. This value is given in octal. -By default the value 0755 is picked. The size option sets the maximum value of -memory (huge pages) allowed for that filesystem (/mnt/huge). The size is -rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of -inodes that /mnt/huge can use. If the size or nr_inodes option is not -provided on command line then no limits are set. For size and nr_inodes -options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For -example, size=2K has the same meaning as size=2048. +By default the value 0755 is picked. If the paltform supports multiple huge +page sizes, the pagesize option can be used to specify the huge page size and +associated pool. pagesize is specified in bytes. If pagesize is not specified +the paltform's default huge page size and associated pool will be used. The +size option sets the maximum value of memory (huge pages) allowed for that +filesystem (/mnt/huge). The size option can be specified in bytes, or as a +percentage of the specified huge page pool (nr_hugepages). The size is +rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum +value of memory (huge pages) allowed for the filesystem. min_size can be +specified in the same way as size, either bytes or a percentage of the +huge page pool. At mount time, the number of huge pages specified by +min_size are reserved for use by the filesystem. If there are not enough +free huge pages available, the mount will fail. As huge pages are allocated +to the filesystem and freed, the reserve count is adjusted so that the sum +of allocated and reserved huge pages is always at least min_size. The option +nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the +size, min_size or nr_inodes option is not provided on command line then +no limits are set. For pagesize, size, min_size and nr_inodes options, you +can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K +has the same meaning as size=2048. While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. @@ -289,15 +302,23 @@ file systems, write system calls are not. Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. -Also, it is important to note that no such mount command is required if the +Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment -should be a member of a supplementary group and system admin needs to -configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for -same or different applications to use any combination of mmaps and shm* -calls, though the mount of filesystem will be required for using mmap calls -without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see -map_hugetlb.c. +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb +below. + +Users who wish to use hugetlb memory via shared memory segment should be a +member of a supplementary group and system admin needs to configure that gid +into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different +applications to use any combination of mmaps and shm* calls, though the mount of +filesystem will be required for using mmap calls without MAP_HUGETLB. + +Syscalls that operate on memory backed by hugetlb pages only have their lengths +aligned to the native page size of the processor; they will normally fail with +errno set to EINVAL or exclude hugetlb pages that extend beyond the length if +not hugepage aligned. For example, munmap(2) will fail if memory is backed by +a hugetlb page and the length is smaller than the hugepage size. + Examples ======== diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 744f82f86c58..3be0bfc4738d 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -22,6 +22,7 @@ CONTENTS - Filtering special vmas. - munlock()/munlockall() system call handling. - Migrating mlocked pages. + - Compacting mlocked pages. - mmap(MAP_LOCKED) system call handling. - munmap()/exit()/exec() system call handling. - try_to_unmap(). @@ -317,7 +318,7 @@ If the VMA passes some filtering as described in "Filtering Special Vmas" below, mlock_fixup() will attempt to merge the VMA with its neighbors or split off a subset of the VMA if the range does not cover the entire VMA. Once the VMA has been merged or split or neither, mlock_fixup() will call -__mlock_vma_pages_range() to fault in the pages via get_user_pages() and to +populate_vma_page_range() to fault in the pages via get_user_pages() and to mark the pages as mlocked via mlock_vma_page(). Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, @@ -327,7 +328,7 @@ fault path or in vmscan. Also note that a page returned by get_user_pages() could be truncated or migrated out from under us, while we're trying to mlock it. To detect this, -__mlock_vma_pages_range() checks page_mapping() after acquiring the page lock. +populate_vma_page_range() checks page_mapping() after acquiring the page lock. If the page is still associated with its mapping, we'll go ahead and call mlock_vma_page(). If the mapping is gone, we just unlock the page and move on. In the worst case, this will result in a page mapped in a VM_LOCKED VMA @@ -392,7 +393,7 @@ ignored for munlock. If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the specified range. The range is then munlocked via the function -__mlock_vma_pages_range() - the same function used to mlock a VMA range - +populate_vma_page_range() - the same function used to mlock a VMA range - passing a flag to indicate that munlock() is being performed. Because the VMA access protections could have been changed to PROT_NONE after @@ -402,7 +403,7 @@ get_user_pages() was enhanced to accept a flag to ignore the permissions when fetching the pages - all of which should be resident as a result of previous mlocking. -For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling +For munlock(), populate_vma_page_range() unlocks individual pages by calling munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() use the Test*PageMlocked() function to handle the case where @@ -450,6 +451,17 @@ list because of a race between munlock and migration, page migration uses the putback_lru_page() function to add migrated pages back to the LRU. +COMPACTING MLOCKED PAGES +------------------------ + +The unevictable LRU can be scanned for compactable regions and the default +behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls +this behavior (see Documentation/sysctl/vm.txt). Once scanning of the +unevictable LRU is enabled, the work of compaction is mostly handled by +the page migration code and the same work flow as described in MIGRATING +MLOCKED PAGES will apply. + + mmap(MAP_LOCKED) SYSTEM CALL HANDLING ------------------------------------- @@ -463,21 +475,11 @@ populate the page table. To mlock a range of memory under the unevictable/mlock infrastructure, the mmap() handler and task address space expansion functions call -mlock_vma_pages_range() specifying the vma and the address range to mlock. -mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in -"Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have -already been set by the caller, in filtered VMAs. Thus these VMA's need not be -visited for munlock when the region is unmapped. - -For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to -fault/allocate the pages and mlock them. Again, like mlock_fixup(), -mlock_vma_pages_range() downgrades the mmap semaphore to read mode before -attempting to fault/allocate and mlock the pages and "upgrades" the semaphore -back to write mode before returning. - -The callers of mlock_vma_pages_range() will have already added the memory range +populate_vma_page_range() specifying the vma and the address range to mlock. + +The callers of populate_vma_page_range() will have already added the memory range to be mlocked to the task's "locked_vm". To account for filtered VMAs, -mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the +populate_vma_page_range() returns the number of pages NOT mlocked. All of the callers then subtract a non-negative return value from the task's locked_vm. A negative return value represent an error - for example, from get_user_pages() attempting to fault in a VMA with PROT_NONE access. In this case, we leave the diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000000000000..64ed63c4f69d --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,70 @@ +zsmalloc +-------- + +This allocator is designed for use with zram. Thus, the allocator is +supposed to work well under low memory conditions. In particular, it +never attempts higher order page allocation which is very likely to +fail under memory pressure. On the other hand, if we just use single +(0-order) pages, it would suffer from very high fragmentation -- +any object of size PAGE_SIZE/2 or larger would occupy an entire page. +This was one of the major issues with its predecessor (xvmalloc). + +To overcome these issues, zsmalloc allocates a bunch of 0-order pages +and links them together using various 'struct page' fields. These linked +pages act as a single higher-order page i.e. an object can span 0-order +page boundaries. The code refers to these linked pages as a single entity +called zspage. + +For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE +since this satisfies the requirements of all its current users (in the +worst case, page is incompressible and is thus stored "as-is" i.e. in +uncompressed form). For allocation requests larger than this size, failure +is returned (see zs_malloc). + +Additionally, zs_malloc() does not return a dereferenceable pointer. +Instead, it returns an opaque handle (unsigned long) which encodes actual +location of the allocated object. The reason for this indirection is that +zsmalloc does not keep zspages permanently mapped since that would cause +issues on 32-bit systems where the VA region for kernel space mappings +is very small. So, before using the allocating memory, the object has to +be mapped using zs_map_object() to get a usable pointer and subsequently +unmapped using zs_unmap_object(). + +stat +---- + +With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via +/sys/kernel/debug/zsmalloc/<user name>. Here is a sample of stat output: + +# cat /sys/kernel/debug/zsmalloc/zram0/classes + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage + .. + .. + 9 176 0 1 186 129 8 4 + 10 192 1 0 2880 2872 135 3 + 11 208 0 1 819 795 42 2 + 12 224 0 1 219 159 12 4 + .. + .. + + +class: index +size: object size zspage stores +almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below) +almost_full: the number of ZS_ALMOST_FULL zspages(see below) +obj_allocated: the number of objects allocated +obj_used: the number of objects allocated to the user +pages_used: the number of pages allocated for the class +pages_per_zspage: the number of 0-order pages to make a zspage + +We assign a zspage to ZS_ALMOST_EMPTY fullness group when: + n <= N / f, where +n = number of allocated objects +N = total number of objects zspage can store +f = fullness_threshold_frac(ie, 4 at the moment) + +Similarly, we assign zspage to: + ZS_ALMOST_FULL when n > N / f + ZS_EMPTY when n == 0 + ZS_FULL when n == N |