diff options
Diffstat (limited to 'Documentation/admin-guide/blockdev')
| -rw-r--r-- | Documentation/admin-guide/blockdev/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/admin-guide/blockdev/paride.rst | 2 | ||||
| -rw-r--r-- | Documentation/admin-guide/blockdev/zoned_loop.rst | 190 | ||||
| -rw-r--r-- | Documentation/admin-guide/blockdev/zram.rst | 165 |
4 files changed, 280 insertions, 78 deletions
diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst index 957ccf617797..3262397ebe8f 100644 --- a/Documentation/admin-guide/blockdev/index.rst +++ b/Documentation/admin-guide/blockdev/index.rst @@ -11,6 +11,7 @@ Block Devices nbd paride ramdisk + zoned_loop zram drbd/index diff --git a/Documentation/admin-guide/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst index e85ad37cc0e5..b2f627d4c2f8 100644 --- a/Documentation/admin-guide/blockdev/paride.rst +++ b/Documentation/admin-guide/blockdev/paride.rst @@ -118,7 +118,7 @@ and high-level drivers that you would use: ================ ============ ======== All parports and all protocol drivers are probed automatically unless probe=0 -parameter is used. So just "modprobe epat" is enough for a Imation SuperDisk +parameter is used. So just "modprobe epat" is enough for an Imation SuperDisk drive to work. Manual device creation:: diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst new file mode 100644 index 000000000000..f4f1f3121bf9 --- /dev/null +++ b/Documentation/admin-guide/blockdev/zoned_loop.rst @@ -0,0 +1,190 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Zoned Loop Block Device +======================= + +.. Contents: + + 1) Overview + 2) Creating a Zoned Device + 3) Deleting a Zoned Device + 4) Example + + +1) Overview +----------- + +The zoned loop block device driver (zloop) allows a user to create a zoned block +device using one regular file per zone as backing storage. This driver does not +directly control any hardware and uses read, write and truncate operations to +regular files of a file system to emulate a zoned block device. + +Using zloop, zoned block devices with a configurable capacity, zone size and +number of conventional zones can be created. The storage for each zone of the +device is implemented using a regular file with a maximum size equal to the zone +size. The size of a file backing a conventional zone is always equal to the zone +size. The size of a file backing a sequential zone indicates the amount of data +sequentially written to the file, that is, the size of the file directly +indicates the position of the write pointer of the zone. + +When resetting a sequential zone, its backing file size is truncated to zero. +Conversely, for a zone finish operation, the backing file is truncated to the +zone size. With this, the maximum capacity of a zloop zoned block device created +can be larger configured to be larger than the storage space available on the +backing file system. Of course, for such configuration, writing more data than +the storage space available on the backing file system will result in write +errors. + +The zoned loop block device driver implements a complete zone transition state +machine. That is, zones can be empty, implicitly opened, explicitly opened, +closed or full. The current implementation does not support any limits on the +maximum number of open and active zones. + +No user tools are necessary to create and delete zloop devices. + +2) Creating a Zoned Device +-------------------------- + +Once the zloop module is loaded (or if zloop is compiled in the kernel), the +character device file /dev/zloop-control can be used to add a zloop device. +This is done by writing an "add" command directly to the /dev/zloop-control +device:: + + $ modprobe zloop + $ ls -l /dev/zloop* + crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control + + $ mkdir -p <base directory/<device ID> + $ echo "add [options]" > /dev/zloop-control + +The options available for the add command can be listed by reading the +/dev/zloop-control device:: + + $ cat /dev/zloop-control + add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,max_open_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io,zone_append=%u,ordered_zone_append,discard_write_cache + remove id=%d + +In more details, the options that can be used with the "add" command are as +follows. + +=================== ========================================================= +id Device number (the X in /dev/zloopX). + Default: automatically assigned. +capacity_mb Device total capacity in MiB. This is always rounded up + to the nearest higher multiple of the zone size. + Default: 16384 MiB (16 GiB). +zone_size_mb Device zone size in MiB. Default: 256 MiB. +zone_capacity_mb Device zone capacity (must always be equal to or lower + than the zone size. Default: zone size. +conv_zones Total number of conventioanl zones starting from + sector 0 + Default: 8 +max_open_zones Maximum number of open sequential write required zones + (0 for no limit). + Default: 0 +base_dir Path to the base directory where to create the directory + containing the zone files of the device. + Default=/var/local/zloop. + The device directory containing the zone files is always + named with the device ID. E.g. the default zone file + directory for /dev/zloop0 is /var/local/zloop/0. +nr_queues Number of I/O queues of the zoned block device. This + value is always capped by the number of online CPUs + Default: 1 +queue_depth Maximum I/O queue depth per I/O queue. + Default: 64 +buffered_io Do buffered IOs instead of direct IOs (default: false) +zone_append Enable or disable a zloop device native zone append + support. + Default: 1 (enabled). + If native zone append support is disabled, the block layer + will emulate this operation using regular write + operations. +ordered_zone_append Enable zloop mitigation of zone append reordering. + Default: disabled. + This is useful for testing file systems file data mapping + (extents), as when enabled, this can significantly reduce + the number of data extents needed to for a file data + mapping. +discard_write_cache Discard all data that was not explicitly persisted using a + flush operation when the device is removed by truncating + each zone file to the size recorded during the last flush + operation. This simulates power fail events where + uncommitted data is lost. +=================== ========================================================= + +3) Deleting a Zoned Device +-------------------------- + +Deleting an unused zoned loop block device is done by issuing the "remove" +command to /dev/zloop-control, specifying the ID of the device to remove:: + + $ echo "remove id=X" > /dev/zloop-control + +The remove command does not have any option. + +A zoned device that was removed can be re-added again without any change to the +state of the device zones: the device zones are restored to their last state +before the device was removed. Adding again a zoned device after it was removed +must always be done using the same configuration as when the device was first +added. If a zone configuration change is detected, an error will be returned and +the zoned device will not be created. + +To fully delete a zoned device, after executing the remove operation, the device +base directory containing the backing files of the device zones must be deleted. + +4) Example +---------- + +The following sequence of commands creates a 2GB zoned device with zones of 64 +MB and a zone capacity of 63 MB:: + + $ modprobe zloop + $ mkdir -p /var/local/zloop/0 + $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity_mb=63" > /dev/zloop-control + +For the device created (/dev/zloop0), the zone backing files are all created +under the default base directory (/var/local/zloop):: + + $ ls -l /var/local/zloop/0 + total 0 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000008 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000009 + ... + +The zoned device created (/dev/zloop0) can then be used normally:: + + $ lsblk -z + NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN + zloop0 host-managed 64M 32 0 0 1M 4K + $ blkzone report /dev/zloop0 + start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + ... + +Deleting this device is done using the command:: + + $ echo "remove id=0" > /dev/zloop-control + +The removed device can be re-added again using the same "add" command as when +the device was first created. To fully delete a zoned device, its backing files +should also be deleted after executing the remove command:: + + $ rm -r /var/local/zloop/0 diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 1576fb93f06c..60b07a7e30cd 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -54,7 +54,7 @@ The list of possible return codes: If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: - echo 3 > /sys/block/zram0/max_comp_streams + echo foo > /sys/block/zram0/comp_algorithm if [ $? -ne 0 ]; then handle_error fi @@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. -2) Set max number of compression streams -======================================== - -Regardless of the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPU - thus -allowing several concurrent compression operations. The number of -allocated compression streams goes down when some of the CPUs -become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or have only 1 CPU online. - -To find out how many streams are currently available:: - - cat /sys/block/zram0/max_comp_streams - -3) Select compression algorithm +2) Select compression algorithm =============================== Using comp_algorithm device attribute one can see available and @@ -107,7 +93,7 @@ Examples:: For the time being, the `comp_algorithm` content shows only compression algorithms that are supported by zram. -4) Set compression algorithm parameters: Optional +3) Set compression algorithm parameters: Optional ================================================= Compression algorithms may support specific parameters which can be @@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some algorithms), for other algorithms `level` is acceleration level (the higher the value the lower the compression ratio). -5) Set Disksize +4) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -6) Set memory limit: Optional +5) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -177,7 +163,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -7) Activate +6) Activate =========== :: @@ -188,7 +174,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -8) Add/remove zram devices +7) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -208,7 +194,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -9) Stats +8) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram<id>/ @@ -228,8 +214,9 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature -max_comp_streams RW the number of possible concurrent compress - operations +writeback_batch_size RW show and set maximum number of in-flight + writeback operations +compressed_writeback RW show and set compressed writeback feature comp_algorithm RW show and change the compression algorithm algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction @@ -238,7 +225,6 @@ backing_dev RW set up backend storage for zram to write out idle WO mark allocated slot as idle ====================== ====== =============================================== - User space is advised to use the following files to read the device statistics. File /sys/block/zram<id>/stat @@ -310,7 +296,7 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -10) Deactivate +9) Deactivate ============== :: @@ -318,7 +304,7 @@ a single line of text and contains the following stats separated by whitespace: swapoff /dev/zram0 umount /dev/zram1 -11) Reset +10) Reset ========= Write any positive value to 'reset' sysfs node:: @@ -333,6 +319,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -347,24 +353,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -385,6 +374,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. @@ -430,13 +436,33 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. +By default zram stores written back pages in decompressed (raw) form, which +means that writeback operation involves decompression of the page before +writing it to the backing device. This behavior can be changed by enabling +`compressed_writeback` feature, which causes zram to write compressed pages +to the backing device, thus avoiding decompression overhead. To enable +this feature, execute:: + + $ echo yes > /sys/block/zramX/compressed_writeback + +Note that this feature should be configured before the `zramX` device is +initialized. + +Depending on backing device storage type, writeback operation may benefit +from a higher number of in-flight write requests (batched writes). The +number of maximum in-flight writeback operations can be configured via +`writeback_batch_size` attribute. To change the default value (which is 32), +execute:: + + $ echo 64 > /sys/block/zramX/writeback_batch_size + If admin wants to measure writeback count in a certain period, they could know it via /sys/block/zram0/bd_stat's 3rd column. recompression ------------- -With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative +With `CONFIG_ZRAM_MULTI_COMP`, zram can recompress pages using alternative (secondary) compression algorithms. The basic idea is that alternative compression algorithm can provide better compression ratio at a price of (potentially) slower compression/decompression speeds. Alternative compression @@ -445,7 +471,7 @@ that default algorithm failed to compress). Another application is idle pages recompression - pages that are cold and sit in the memory can be recompressed using more effective algorithm and, hence, reduce zsmalloc memory usage. -With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: +With `CONFIG_ZRAM_MULTI_COMP`, zram supports up to 4 compression algorithms: one primary and up to 3 secondary ones. Primary zram compressor is explained in "3) Select compression algorithm", secondary algorithms are configured using recomp_algorithm device attribute. @@ -469,58 +495,43 @@ configuration::: #select deflate recompression algorithm, priority 2 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm -Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, +Another device attribute that `CONFIG_ZRAM_MULTI_COMP` enables is `recompress`, which controls recompression. Examples::: #IDLE pages recompression is activated by `idle` mode - echo "type=idle" > /sys/block/zramX/recompress + echo "type=idle priority=1" > /sys/block/zramX/recompress #HUGE pages recompression is activated by `huge` mode - echo "type=huge" > /sys/block/zram0/recompress + echo "type=huge priority=2" > /sys/block/zram0/recompress #HUGE_IDLE pages recompression is activated by `huge_idle` mode - echo "type=huge_idle" > /sys/block/zramX/recompress + echo "type=huge_idle priority=1" > /sys/block/zramX/recompress The number of idle pages can be significant, so user-space can pass a size threshold (in bytes) to the recompress knob: zram will recompress only pages of equal or greater size::: #recompress all pages larger than 3000 bytes - echo "threshold=3000" > /sys/block/zramX/recompress + echo "threshold=3000 priority=1" > /sys/block/zramX/recompress #recompress idle pages larger than 2000 bytes - echo "type=idle threshold=2000" > /sys/block/zramX/recompress + echo "type=idle threshold=2000 priority=1" > \ + /sys/block/zramX/recompress It is also possible to limit the number of pages zram re-compression will attempt to recompress::: - echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress - -Recompression of idle pages requires memory tracking. - -During re-compression for every page, that matches re-compression criteria, -ZRAM iterates the list of registered alternative compression algorithms in -order of their priorities. ZRAM stops either when re-compression was -successful (re-compressed object is smaller in size than the original one) -and matches re-compression criteria (e.g. size threshold) or when there are -no secondary algorithms left to try. If none of the secondary algorithms can -successfully re-compressed the page such a page is marked as incompressible, -so ZRAM will not attempt to re-compress it in the future. - -This re-compression behaviour, when it iterates through the list of -registered compression algorithms, increases our chances of finding the -algorithm that successfully compresses a particular page. Sometimes, however, -it is convenient (and sometimes even necessary) to limit recompression to -only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a `algo` or `priority` parameter::: - - #use zstd algorithm only (if registered) - echo "type=huge algo=zstd" > /sys/block/zramX/recompress + echo "type=huge_idle priority=1 max_pages=42" > \ + /sys/block/zramX/recompress - #use zstd algorithm only (if zstd was registered under priority 1) - echo "type=huge priority=1" > /sys/block/zramX/recompress +It is advised to always specify `priority` parameter. While it is also +possible to specify `algo` parameter, so that `zram` will use algorithm's +name to determine the priority, it is not recommended, since it can lead to +unexpected results when the same algorithm is configured with different +priorities (e.g. different parameters). `priority` is the only way to +guarantee that the expected algorithm will be used. memory tracking =============== |
