diff options
Diffstat (limited to 'Documentation/admin-guide/device-mapper')
10 files changed, 424 insertions, 51 deletions
diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst index 4d667228e744..a1e673c0e782 100644 --- a/Documentation/admin-guide/device-mapper/delay.rst +++ b/Documentation/admin-guide/device-mapper/delay.rst @@ -3,7 +3,7 @@ dm-delay ======== Device-Mapper's "delay" target delays reads and/or writes -and/or flushs and optionally maps them to different devices. +and/or flushes and optionally maps them to different devices. Arguments:: @@ -18,7 +18,7 @@ Table line has to either have 3, 6 or 9 arguments: to write and flush operations on optionally different write_device with optionally different sector offset -9: same as 6 arguments plus define flush_offset and flush_delay explicitely +9: same as 6 arguments plus define flush_offset and flush_delay explicitly on/with optionally different flush_device/flush_offset. Offsets are specified in sectors. @@ -40,7 +40,7 @@ Example scripts #!/bin/sh # # Create mapped device delaying write and flush operations for 400ms and - # splitting reads to device $1 but writes and flushs to different device $2 + # splitting reads to device $1 but writes and flushes to different device $2 # to different offsets of 2048 and 4096 sectors respectively. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400" @@ -48,7 +48,7 @@ Example scripts :: #!/bin/sh # - # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms + # Create mapped device delaying reads for 50ms, writes for 100ms and flushes for 333ms # onto the same backing device at offset 0 sectors. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333" diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index 9f8139ff97d6..4467f6d4b632 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -146,6 +146,11 @@ integrity:<bytes>:<type> integrity for the encrypted device. The additional space is then used for storing authentication tag (and persistent IV if needed). +integrity_key_size:<bytes> + Optionally set the integrity key size if it differs from the digest size. + It allows the use of wrapped key algorithms where the key size is + independent of the cryptographic key size. + sector_size:<bytes> Use <bytes> as the encryption unit instead of 512 bytes sectors. This option can be in range 512 - 4096 bytes and must be power of two. diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst index d8a5f14d0e3c..c2e18ecc065c 100644 --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst @@ -92,6 +92,11 @@ Target arguments: allowed. This mode is useful for data recovery if the device cannot be activated in any of the other standard modes. + I - inline mode - in this mode, dm-integrity will store integrity + data directly in the underlying device sectors. + The underlying device must have an integrity profile that + allows storing user integrity data and provides enough + space for the selected integrity tag. 5. the number of additional arguments diff --git a/Documentation/admin-guide/device-mapper/dm-pcache.rst b/Documentation/admin-guide/device-mapper/dm-pcache.rst new file mode 100644 index 000000000000..09d327ef4b14 --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-pcache.rst @@ -0,0 +1,202 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= +dm-pcache — Persistent Cache +================================= + +*Author: Dongsheng Yang <dongsheng.yang@linux.dev>* + +This document describes *dm-pcache*, a Device-Mapper target that lets a +byte-addressable *DAX* (persistent-memory, “pmem”) region act as a +high-performance, crash-persistent cache in front of a slower block +device. The code lives in `drivers/md/dm-pcache/`. + +Quick feature summary +===================== + +* *Write-back* caching (only mode currently supported). +* *16 MiB segments* allocated on the pmem device. +* *Data CRC32* verification (optional, per cache). +* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX + == 2`) and protected with CRC+sequence numbers. +* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism +* Pure *DAX path* I/O – no extra BIO round-trips +* *Log-structured write-back* that preserves backend crash-consistency + + +Constructor +=========== + +:: + + pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>] + +========================= ==================================================== +``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…). + All metadata *and* cached blocks are stored here. + +``backing_dev`` The slow block device to be cached. + +``cache_mode`` Optional, Only ``writeback`` is accepted at the + moment. + +``data_crc`` Optional, default to ``false`` + + * ``true`` – store CRC32 for every cached entry + and verify on reads + * ``false`` – skip CRC (faster) +========================= ==================================================== + +Example +------- + +.. code-block:: shell + + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + +The first time a pmem device is used, dm-pcache formats it automatically +(super-block, cache_info, etc.). + + +Status line +=========== + +``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints: + +:: + + <sb_flags> <seg_total> <cache_segs> <segs_used> \ + <gc_percent> <cache_flags> \ + <key_head_seg>:<key_head_off> \ + <dirty_tail_seg>:<dirty_tail_off> \ + <key_tail_seg>:<key_tail_off> + +Field meanings +-------------- + +=============================== ============================================= +``sb_flags`` Super-block flags (e.g. endian marker). + +``seg_total`` Number of physical *pmem* segments. + +``cache_segs`` Number of segments used for cache. + +``segs_used`` Segments currently allocated (bitmap weight). + +``gc_percent`` Current GC high-water mark (0-90). + +``cache_flags`` Bit 0 – DATA_CRC enabled + Bit 1 – INIT_DONE (cache initialised) + Bits 2-5 – cache mode (0 == WB). + +``key_head`` Where new key-sets are being written. + +``dirty_tail`` First dirty key-set that still needs + write-back to the backing device. + +``key_tail`` First key-set that may be reclaimed by GC. +=============================== ============================================= + + +Messages +======== + +*Change GC trigger* + +:: + + dmsetup message <dev> 0 gc_percent <0-90> + + +Theory of operation +=================== + +Sub-devices +----------- + +==================== ========================================================= +backing_dev Any block device (SSD/HDD/loop/LVM, etc.). +cache_dev DAX device; must expose direct-access memory. +==================== ========================================================= + +Segments and key-sets +--------------------- + +* The pmem space is divided into *16 MiB segments*. +* Each write allocates space from a per-CPU *data_head* inside a segment. +* A *cache-key* records a logical range on the origin and where it lives + inside pmem (segment + offset + generation). +* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem + and are themselves crash-safe (CRC). +* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets. + +Write-back +---------- + +Dirty keys are queued into a tree; a background worker copies data +back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the +upper layers forces an immediate metadata commit. + +Garbage collection +------------------ + +GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks +from *key_tail*, frees segments whose every key has been invalidated, and +advances *key_tail*. + +CRC verification +---------------- + +If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data +range when it is inserted and stores it in the on-media key. Reads +validate the CRC before copying to the caller. + + +Failure handling +================ + +* *pmem media errors* – all metadata copies are read with + ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation. +* *Cache full* – if no free segment can be found, writes return ``-EBUSY``; + dm-pcache retries internally (request deferral). +* *System crash* – on attach, the driver replays ksets from *key_tail* to + rebuild the in-core trees; every segment’s generation guards against + use-after-free keys. + + +Limitations & TODO +================== + +* Only *write-back* mode; other modes planned. +* Only FIFO cache invalidate; other (LRU, ARC...) planned. +* Table reload is not supported currently. +* Discard planned. + + +Example workflow +================ + +.. code-block:: shell + + # 1. Create devices + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + + # 2. Put a filesystem on top + mkfs.ext4 /dev/mapper/pcache_sdb + mount /dev/mapper/pcache_sdb /mnt + + # 3. Tune GC threshold to 80 % + dmsetup message pcache_sdb 0 gc_percent 80 + + # 4. Observe status + watch -n1 'dmsetup status pcache_sdb' + + # 5. Shutdown + umount /mnt + dmsetup remove pcache_sdb + + +``dm-pcache`` is under active development; feedback, bug reports and patches +are very welcome! diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst index bb17e26e3c1b..3780f6e6b6bb 100644 --- a/Documentation/admin-guide/device-mapper/dm-raid.rst +++ b/Documentation/admin-guide/device-mapper/dm-raid.rst @@ -20,10 +20,10 @@ The target is named "raid" and it accepts the following parameters:: raid0 RAID0 striping (no resilience) raid1 RAID1 mirroring raid4 RAID4 with dedicated last parity disk - raid5_n RAID5 with dedicated last parity disk supporting takeover + raid5_n RAID5 with dedicated last parity disk supporting takeover from/to raid1 Same as raid4 - - Transitory layout + - Transitory layout for takeover from/to raid1 raid5_la RAID5 left asymmetric - rotating parity 0 with data continuation @@ -48,8 +48,8 @@ The target is named "raid" and it accepts the following parameters:: raid6_n_6 RAID6 with dedicate parity disks - parity and Q-syndrome on the last 2 disks; - layout for takeover from/to raid4/raid5_n - raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk + layout for takeover from/to raid0/raid4/raid5_n + raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk supporting takeover from/to raid5 - layout for takeover from raid5_la from/to raid6 raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk @@ -173,9 +173,9 @@ The target is named "raid" and it accepts the following parameters:: The delta_disks option value (-251 < N < +251) triggers device removal (negative value) or device addition (positive value) to any reshape supporting raid levels 4/5/6 and 10. - RAID levels 4/5/6 allow for addition of devices (metadata - and data device tuple), raid10_near and raid10_offset only - allow for device addition. raid10_far does not support any + RAID levels 4/5/6 allow for addition and removal of devices + (metadata and data device tuple), raid10_near and raid10_offset + only allow for device addition. raid10_far does not support any reshaping at all. A minimum of devices have to be kept to enforce resilience, which is 3 devices for raid4/5 and 4 devices for raid6. @@ -372,6 +372,72 @@ to safely enable discard support for RAID 4/5/6: 'devices_handle_discards_safely' +Takeover/Reshape Support +------------------------ +The target natively supports these two types of MDRAID conversions: + +o Takeover: Converts an array from one RAID level to another + +o Reshape: Changes the internal layout while maintaining the current RAID level + +Each operation is only valid under specific constraints imposed by the existing array's layout and configuration. + + +Takeover: +linear -> raid1 with N >= 2 mirrors +raid0 -> raid4 (add dedicated parity device) +raid0 -> raid5 (add dedicated parity device) +raid0 -> raid10 with near layout and N >= 2 mirror groups (raid0 stripes have to become first member within mirror groups) +raid1 -> linear +raid1 -> raid5 with 2 mirrors +raid4 -> raid5 w/ rotating parity +raid5 with dedicated parity device -> raid4 +raid5 -> raid6 (with dedicated Q-syndrome) +raid6 (with dedicated Q-syndrome) -> raid5 +raid10 with near layout and even number of disks -> raid0 (select any in-sync device from each mirror group) + +Reshape: +linear: not possible +raid0: not possible +raid1: change number of mirrors +raid4: add and remove stripes (minimum 3), change stripesize +raid5: add and remove stripes (minimum 3, special case 2 for raid1 takeover), change rotating parity algorithms, change stripesize +raid6: add and remove stripes (minimum 4), change rotating syndrome algorithms, change stripesize +raid10 near: add stripes (minimum 4), change stripesize, no stripe removal possible, change to offset layout +raid10 offset: add stripes, change stripesize, no stripe removal possible, change to near layout +raid10 far: not possible + +Table line examples: + +### raid1 -> raid5 +# +# 2 devices limitation in raid1. +# raid5 personality is able to just map 2 like raid1. +# Reshape after takeover to change to full raid5 layout + + 0 1960886272 raid raid1 3 0 region_size 2048 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + +# dm-0 and dm-2 are e.g. 4MiB large metadata devices, dm-1 and dm-3 have to be at least 1960886272 big. +# +# Table line to takeover to raid5 + + 0 1960886272 raid raid5 3 0 region_size 2048 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + +# Add required out-of-place reshape space to the beginniong of the given 2 data devices, +# allocate another metadata/data device tuple with the same sizes for the parity space +# and zero the first 4K of the metadata device. +# +# Example table of the out-of-place reshape space addition for one data device, e.g. dm-1 + + 0 8192 linear 8:0 0 1960903888 # <- must be free space segment + 8192 1960886272 linear 8:0 0 2048 # previous data segment + +# Mapping table for e.g. raid5_rs reshape causing the size of the raid device to double-fold once the reshape finishes. +# Check the status output (e.g. "dmsetup status $RaidDev") for progress. + + 0 $((2 * 1960886272)) raid raid5 7 0 region_size 2048 data_offset 8192 delta_disk 1 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + + Version History --------------- diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index cc5aec861576..030d854628ac 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -18,6 +18,7 @@ Device Mapper dm-integrity dm-io dm-log + dm-pcache dm-queue-length dm-raid dm-service-time @@ -39,10 +40,3 @@ Device Mapper verity writecache zero - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/device-mapper/thin-provisioning.rst b/Documentation/admin-guide/device-mapper/thin-provisioning.rst index bafebf79da4b..b2fa49a5608a 100644 --- a/Documentation/admin-guide/device-mapper/thin-provisioning.rst +++ b/Documentation/admin-guide/device-mapper/thin-provisioning.rst @@ -80,11 +80,11 @@ less sharing than average you'll need a larger-than-average metadata device. As a guide, we suggest you calculate the number of bytes to use in the metadata device as 48 * $data_dev_size / $data_block_size but round it up -to 2MB if the answer is smaller. If you're creating large numbers of +to 2MiB if the answer is smaller. If you're creating large numbers of snapshots which are recording large amounts of change, you may find you need to increase this. -The largest size supported is 16GB: If the device is larger, +The largest size supported is 16GiB: If the device is larger, a warning will be issued and the excess space will not be used. Reloading a pool table @@ -107,13 +107,13 @@ Using an existing pool device $data_block_size gives the smallest unit of disk space that can be allocated at a time expressed in units of 512-byte sectors. -$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a -multiple of 128 (64KB). $data_block_size cannot be changed after the +$data_block_size must be between 128 (64KiB) and 2097152 (1GiB) and a +multiple of 128 (64KiB). $data_block_size cannot be changed after the thin-pool is created. People primarily interested in thin provisioning -may want to use a value such as 1024 (512KB). People doing lots of -snapshotting may want a smaller value such as 128 (64KB). If you are +may want to use a value such as 1024 (512KiB). People doing lots of +snapshotting may want a smaller value such as 128 (64KiB). If you are not zeroing newly-allocated data, a larger $data_block_size in the -region of 256000 (128MB) is suggested. +region of 262144 (128MiB) is suggested. $low_water_mark is expressed in blocks of size $data_block_size. If free space on the data device drops below this level then a dm event @@ -291,7 +291,7 @@ i) Constructor error_if_no_space: Error IOs, instead of queueing, if no space. - Data block size must be between 64KB (128 sectors) and 1GB + Data block size must be between 64KiB (128 sectors) and 1GiB (2097152 sectors) inclusive. diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst b/Documentation/admin-guide/device-mapper/vdo-design.rst index 3cd59decbec0..faa0ecd4a5ae 100644 --- a/Documentation/admin-guide/device-mapper/vdo-design.rst +++ b/Documentation/admin-guide/device-mapper/vdo-design.rst @@ -600,7 +600,7 @@ lock and return itself to the pool. All storage within vdo is managed as 4KB blocks, but it can accept writes as small as 512 bytes. Processing a write that is smaller than 4K requires a read-modify-write operation that reads the relevant 4K block, copies the -new data over the approriate sectors of the block, and then launches a +new data over the appropriate sectors of the block, and then launches a write operation for the modified data block. The read and write stages of this operation are nearly identical to the normal read and write operations, and a single data_vio is used throughout this operation. diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst index a14e6d3e787c..8a67b320a97b 100644 --- a/Documentation/admin-guide/device-mapper/vdo.rst +++ b/Documentation/admin-guide/device-mapper/vdo.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0-only +====== dm-vdo ====== diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst index a65c1602cb23..eb9475d7e196 100644 --- a/Documentation/admin-guide/device-mapper/verity.rst +++ b/Documentation/admin-guide/device-mapper/verity.rst @@ -87,35 +87,57 @@ panic_on_corruption Panic the device when a corrupted block is discovered. This option is not compatible with ignore_corruption and restart_on_corruption. +restart_on_error + Restart the system when an I/O error is detected. + This option can be combined with the restart_on_corruption option. + +panic_on_error + Panic the device when an I/O error is detected. This option is + not compatible with the restart_on_error option but can be combined + with the panic_on_corruption option. + ignore_zero_blocks Do not verify blocks that are expected to contain zeroes and always return zeroes instead. This may be useful if the partition contains unused blocks that are not guaranteed to contain zeroes. use_fec_from_device <fec_dev> - Use forward error correction (FEC) to recover from corruption if hash - verification fails. Use encoding data from the specified device. This - may be the same device where data and hash blocks reside, in which case - fec_start must be outside data and hash areas. + Use forward error correction (FEC) parity data from the specified device to + try to automatically recover from corruption and I/O errors. + + If this option is given, then <fec_roots> and <fec_blocks> must also be + given. <hash_block_size> must also be equal to <data_block_size>. + + <fec_dev> can be the same as <dev>, in which case <fec_start> must be + outside the data area. It can also be the same as <hash_dev>, in which case + <fec_start> must be outside the hash and optional additional metadata areas. - If the encoding data covers additional metadata, it must be accessible - on the hash device after the hash blocks. + If the data <dev> is encrypted, the <fec_dev> should be too. - Note: block sizes for data and hash devices must match. Also, if the - verity <dev> is encrypted the <fec_dev> should be too. + For more information, see `Forward error correction`_. fec_roots <num> - Number of generator roots. This equals to the number of parity bytes in - the encoding data. For example, in RS(M, N) encoding, the number of roots - is M-N. + The number of parity bytes in each 255-byte Reed-Solomon codeword. The + Reed-Solomon code used will be an RS(255, k) code where k = 255 - fec_roots. + + The supported values are 2 through 24 inclusive. Higher values provide + stronger error correction. However, the minimum value of 2 already provides + strong error correction due to the use of interleaving, so 2 is the + recommended value for most users. fec_roots=2 corresponds to an + RS(255, 253) code, which has a space overhead of about 0.8%. fec_blocks <num> - The number of encoding data blocks on the FEC device. The block size for - the FEC device is <data_block_size>. + The total number of <data_block_size> blocks that are error-checked using + FEC. This must be at least the sum of <num_data_blocks> and the number of + blocks needed by the hash tree. It can include additional metadata blocks, + which are assumed to be accessible on <hash_dev> following the hash blocks. + + Note that this is *not* the number of parity blocks. The number of parity + blocks is inferred from <fec_blocks>, <fec_roots>, and <data_block_size>. fec_start <offset> - This is the offset, in <data_block_size> blocks, from the start of the - FEC device to the beginning of the encoding data. + This is the offset, in <data_block_size> blocks, from the start of <fec_dev> + to the beginning of the parity data. check_at_most_once Verify data blocks only the first time they are read from the data device, @@ -142,8 +164,15 @@ root_hash_sig_key_desc <key_description> already in the secondary trusted keyring. try_verify_in_tasklet - If verity hashes are in cache, verify data blocks in kernel tasklet instead - of workqueue. This option can reduce IO latency. + If verity hashes are in cache and the IO size does not exceed the limit, + verify data blocks in bottom half instead of workqueue. This option can + reduce IO latency. The size limits can be configured via + /sys/module/dm_verity/parameters/use_bh_bytes. The four parameters + correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, + IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn. + For example: + <none>,<rt>,<be>,<idle> + 4096,4096,4096,4096 Theory of operation =================== @@ -164,11 +193,6 @@ per-block basis. This allows for a lightweight hash computation on first read into the page cache. Block hashes are stored linearly, aligned to the nearest block size. -If forward error correction (FEC) support is enabled any recovery of -corrupted data will be verified using the cryptographic hash of the -corresponding data. This is why combining error correction with -integrity checking is essential. - Hash Tree --------- @@ -196,6 +220,80 @@ The tree looks something like: / ... \ / . . . \ / \ blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767 +Forward error correction +------------------------ + +dm-verity's optional forward error correction (FEC) support adds strong error +correction capabilities to dm-verity. It allows systems that would be rendered +inoperable by errors to continue operating, albeit with reduced performance. + +FEC uses Reed-Solomon (RS) codes that are interleaved across the entire +device(s), allowing long bursts of corrupt or unreadable blocks to be recovered. + +dm-verity validates any FEC-corrected block against the wanted hash before using +it. Therefore, FEC doesn't affect the security properties of dm-verity. + +The integration of FEC with dm-verity provides significant benefits over a +separate error correction layer: + +- dm-verity invokes FEC only when a block's hash doesn't match the wanted hash + or the block cannot be read at all. As a result, FEC doesn't add overhead to + the common case where no error occurs. + +- dm-verity hashes are also used to identify erasure locations for RS decoding. + This allows correcting twice as many errors. + +FEC uses an RS(255, k) code where k = 255 - fec_roots. fec_roots is usually 2. +This means that each k (usually 253) message bytes have fec_roots (usually 2) +bytes of parity data added to get a 255-byte codeword. (Many external sources +call RS codewords "blocks". Since dm-verity already uses the term "block" to +mean something else, we'll use the clearer term "RS codeword".) + +FEC checks fec_blocks blocks of message data in total, consisting of: + +1. The data blocks from the data device +2. The hash blocks from the hash device +3. Optional additional metadata that follows the hash blocks on the hash device + +dm-verity assumes that the FEC parity data was computed as if the following +procedure were followed: + +1. Concatenate the message data from the above sources. +2. Zero-pad to the next multiple of k blocks. Let msg be the resulting byte + array, and msglen its length in bytes. +3. For 0 <= i < msglen / k (for each RS codeword): + a. Select msg[i + j * msglen / k] for 0 <= j < k. + Consider these to be the 'k' message bytes of an RS codeword. + b. Compute the corresponding 'fec_roots' parity bytes of the RS codeword, + and concatenate them to the FEC parity data. + +Step 3a interleaves the RS codewords across the entire device using an +interleaving degree of data_block_size * ceil(fec_blocks / k). This is the +maximal interleaving, such that the message data consists of a region containing +byte 0 of all the RS codewords, then a region containing byte 1 of all the RS +codewords, and so on up to the region for byte 'k - 1'. Note that the number of +codewords is set to a multiple of data_block_size; thus, the regions are +block-aligned, and there is an implicit zero padding of up to 'k - 1' blocks. + +This interleaving allows long bursts of errors to be corrected. It provides +much stronger error correction than storage devices typically provide, while +keeping the space overhead low. + +The cost is slow decoding: correcting a single block usually requires reading +254 extra blocks spread evenly across the device(s). However, that is +acceptable because dm-verity uses FEC only when there is actually an error. + +The list below contains additional details about the RS codes used by +dm-verity's FEC. Userspace programs that generate the parity data need to use +these parameters for the parity data to match exactly: + +- Field used is GF(256) +- Bytes are mapped to/from GF(256) elements in the natural way, where bits 0 + through 7 (low-order to high-order) map to the coefficients of x^0 through x^7 +- Field generator polynomial is x^8 + x^4 + x^3 + x^2 + 1 +- The codes used are systematic, BCH-view codes +- Primitive element alpha is 'x' +- First consecutive root of code generator polynomial is 'x^0' On-disk format ============== @@ -220,8 +318,10 @@ is available at the cryptsetup project's wiki page Status ====== -V (for Valid) is returned if every check performed so far was valid. -If any check failed, C (for Corruption) is returned. +1. V (for Valid) is returned if every check performed so far was valid. + If any check failed, C (for Corruption) is returned. +2. Number of corrected blocks by Forward Error Correction. + '-' if Forward Error Correction is not enabled. Example ======= |
