diff options
| -rw-r--r-- | Documentation/filesystems/erofs.rst | 129 |
1 files changed, 66 insertions, 63 deletions
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index fe06308e546c..e4f84ba91052 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -7,83 +7,90 @@ EROFS - Enhanced Read-Only File System Overview ======== -EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a -generic read-only filesystem solution for various read-only use cases instead -of just focusing on storage space saving without considering any side effects -of runtime performance. - -It is designed to meet the needs of flexibility, feature extendability and user -payload friendly, etc. Apart from those, it is still kept as a simple -random-access friendly high-performance filesystem to get rid of unneeded I/O -amplification and memory-resident overhead compared to similar approaches. - -It is implemented to be a better choice for the following scenarios: - - - read-only storage media or - - - part of a fully trusted read-only solution, which means it needs to be +EROFS (Enhanced Read-Only File System) is a modern, efficient, and secure +read-only kernel filesystem designed for various use cases including immutable +system images, container images, application sandbox images, and dataset +distribution. + +An immutable image filesystem can be regarded as an enhanced archive format +which allows golden images to be built once and mounted everywhere -- images are +bit-for-bit identical across all deployments and can be verified, audited, or +shared without concerns about runtime modifications (in this model, all user +writes should be redirected into another trusted filesystem, for example, via +overlayfs for copy-on-write-style redirection, by design). + +EROFS is a dedicated implementation of the image filesystem idea above, with a +flexible, hierarchical on-disk design so that needed features can be enabled on +demand. Filesystem data in the core format is strictly block-aligned in order +to perform optimally on all kinds of storage media, including block devices and +memory-backed devices. The on-disk format is easy to parse and purposely avoids +the unnecessary metadata redundancy found in generic writable filesystems, which +can suffer from extra inconsistency issues -- making it ideal for security +auditing and untrusted remote access. In addition, designs such as inline data, +inline/shared extended attributes, and optimized (de)compression provide better +space efficiency while maintaining high performance. + +In short, EROFS aims to be a better fit for the following scenarios: + + - As part of a secure immutable storage solution, where it needs to be immutable and bit-for-bit identical to the official golden image for - their releases due to security or other considerations and - - - hope to minimize extra storage space with guaranteed end-to-end performance - by using compact layout, transparent file compression and direct access, - especially for those embedded devices with limited memory and high-density - hosts with numerous containers. + each individual copy, in order to meet security, data sharing, and/or + other requirements; -Here are the main features of EROFS: + - Minimizing storage overhead with guaranteed end-to-end performance + by using compact (meta)data layout, optimized transparent data compression, + deduplication and direct access, especially for those embedded devices with + limited memory and high-density hosts with numerous containers. - - Little endian on-disk design; +Here is the list of highlights: - - Block-based distribution and file-based distribution over fscache are - supported; + - Little endian on-disk design with 48-bit block addressing, supporting up + to 1 EiB filesystem capacity with 4 KiB block size; - - Support multiple devices to refer to external blobs, which can be used - for container images; + - Two compact inode metadata layouts for space and performance efficiency: - - 32-bit block addresses for each device, therefore 16TiB address space at - most with 4KiB block size for now; + ======================== ======== ====================================== + compact extended + ======================== ======== ====================================== + Inode core metadata size 32 bytes 64 bytes + Max file size 4 GiB 16 EiB (also limited by max. vol size) + Max uids/gids 65536 4294967296 + Nanosecond timestamps no yes + Max hardlinks 65536 4294967296 + ======================== ======== ====================================== - - Two inode layouts for different requirements: + - Support tailpacking inline data for better space efficiency and reduce + unneeded I/O amplification; - ===================== ============ ====================================== - compact (v1) extended (v2) - ===================== ============ ====================================== - Inode metadata size 32 bytes 64 bytes - Max file size 4 GiB 16 EiB (also limited by max. vol size) - Max uids/gids 65536 4294967296 - Per-inode timestamp no yes (64 + 32-bit timestamp) - Max hardlinks 65536 4294967296 - Metadata reserved 8 bytes 18 bytes - ===================== ============ ====================================== + - Block-based and file-backed distribution are both supported; - - Support extended attributes as an option; + - Multiple devices to reference external data blobs: inode data can be + optionally placed into external blobs, which enables image layering and data + sharing among different filesystems; - - Support a bloom filter that speeds up negative extended attribute lookups; + - Inline and shared extended attributes with an optional bloom filter that + speeds up negative extended attribute lookups; - - Support POSIX.1e ACLs by using extended attributes; + - POSIX.1e ACLs by using extended attributes; - - Support transparent data compression as an option: - LZ4, MicroLZMA, DEFLATE and Zstandard algorithms can be used on a per-file - basis; In addition, inplace decompression is also supported to avoid bounce - compressed buffers and unnecessary page cache thrashing. + - Transparent data compression as an option: Supported algorithms (LZ4, + MicroLZMA, DEFLATE and Zstandard) can be selected on a per-inode basis. + Both the on-disk metadata and decompression runtime have been heavily + optimized to minimize the overhead for better performance. - - Support chunk-based data deduplication and rolling-hash compressed data - deduplication; + - Merging tail-end data into a special inode as fragments; - - Support tailpacking inline compared to byte-addressed unaligned metadata - or smaller block size alternatives; + - Chunk-based deduplication and rolling-hash compressed data deduplication; - - Support merging tail-end data into a special inode as fragments. + - Direct I/O and FSDAX support on uncompressed inodes for use cases such as + secure containers, loop devices, and ramdisks that do not need page caching; - - Support large folios to make use of THPs (Transparent Hugepages); + - Page cache sharing among inodes with identical content fingerprints on + the same machine. - - Support direct I/O on uncompressed files to avoid double caching for loop - devices; +For more detailed information, please refer to our documentation site: - - Support FSDAX on uncompressed images for secure containers and ramdisks in - order to get rid of unnecessary page cache. - - - Support file-based on-demand loading with the Fscache infrastructure. +- https://erofs.docs.kernel.org The following git tree provides the file system user-space tools under development, such as a formatting tool (mkfs.erofs), an on-disk consistency & @@ -91,10 +98,6 @@ compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs): - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -For more information, please also refer to the documentation site: - -- https://erofs.docs.kernel.org - Bugs and patches are welcome, please kindly help us and send to the following linux-erofs mailing list: |
