summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--Documentation/filesystems/erofs.rst129
1 files changed, 66 insertions, 63 deletions
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index fe06308e546c..e4f84ba91052 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -7,83 +7,90 @@ EROFS - Enhanced Read-Only File System
Overview
========
-EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a
-generic read-only filesystem solution for various read-only use cases instead
-of just focusing on storage space saving without considering any side effects
-of runtime performance.
-
-It is designed to meet the needs of flexibility, feature extendability and user
-payload friendly, etc. Apart from those, it is still kept as a simple
-random-access friendly high-performance filesystem to get rid of unneeded I/O
-amplification and memory-resident overhead compared to similar approaches.
-
-It is implemented to be a better choice for the following scenarios:
-
- - read-only storage media or
-
- - part of a fully trusted read-only solution, which means it needs to be
+EROFS (Enhanced Read-Only File System) is a modern, efficient, and secure
+read-only kernel filesystem designed for various use cases including immutable
+system images, container images, application sandbox images, and dataset
+distribution.
+
+An immutable image filesystem can be regarded as an enhanced archive format
+which allows golden images to be built once and mounted everywhere -- images are
+bit-for-bit identical across all deployments and can be verified, audited, or
+shared without concerns about runtime modifications (in this model, all user
+writes should be redirected into another trusted filesystem, for example, via
+overlayfs for copy-on-write-style redirection, by design).
+
+EROFS is a dedicated implementation of the image filesystem idea above, with a
+flexible, hierarchical on-disk design so that needed features can be enabled on
+demand. Filesystem data in the core format is strictly block-aligned in order
+to perform optimally on all kinds of storage media, including block devices and
+memory-backed devices. The on-disk format is easy to parse and purposely avoids
+the unnecessary metadata redundancy found in generic writable filesystems, which
+can suffer from extra inconsistency issues -- making it ideal for security
+auditing and untrusted remote access. In addition, designs such as inline data,
+inline/shared extended attributes, and optimized (de)compression provide better
+space efficiency while maintaining high performance.
+
+In short, EROFS aims to be a better fit for the following scenarios:
+
+ - As part of a secure immutable storage solution, where it needs to be
immutable and bit-for-bit identical to the official golden image for
- their releases due to security or other considerations and
-
- - hope to minimize extra storage space with guaranteed end-to-end performance
- by using compact layout, transparent file compression and direct access,
- especially for those embedded devices with limited memory and high-density
- hosts with numerous containers.
+ each individual copy, in order to meet security, data sharing, and/or
+ other requirements;
-Here are the main features of EROFS:
+ - Minimizing storage overhead with guaranteed end-to-end performance
+ by using compact (meta)data layout, optimized transparent data compression,
+ deduplication and direct access, especially for those embedded devices with
+ limited memory and high-density hosts with numerous containers.
- - Little endian on-disk design;
+Here is the list of highlights:
- - Block-based distribution and file-based distribution over fscache are
- supported;
+ - Little endian on-disk design with 48-bit block addressing, supporting up
+ to 1 EiB filesystem capacity with 4 KiB block size;
- - Support multiple devices to refer to external blobs, which can be used
- for container images;
+ - Two compact inode metadata layouts for space and performance efficiency:
- - 32-bit block addresses for each device, therefore 16TiB address space at
- most with 4KiB block size for now;
+ ======================== ======== ======================================
+ compact extended
+ ======================== ======== ======================================
+ Inode core metadata size 32 bytes 64 bytes
+ Max file size 4 GiB 16 EiB (also limited by max. vol size)
+ Max uids/gids 65536 4294967296
+ Nanosecond timestamps no yes
+ Max hardlinks 65536 4294967296
+ ======================== ======== ======================================
- - Two inode layouts for different requirements:
+ - Support tailpacking inline data for better space efficiency and reduce
+ unneeded I/O amplification;
- ===================== ============ ======================================
- compact (v1) extended (v2)
- ===================== ============ ======================================
- Inode metadata size 32 bytes 64 bytes
- Max file size 4 GiB 16 EiB (also limited by max. vol size)
- Max uids/gids 65536 4294967296
- Per-inode timestamp no yes (64 + 32-bit timestamp)
- Max hardlinks 65536 4294967296
- Metadata reserved 8 bytes 18 bytes
- ===================== ============ ======================================
+ - Block-based and file-backed distribution are both supported;
- - Support extended attributes as an option;
+ - Multiple devices to reference external data blobs: inode data can be
+ optionally placed into external blobs, which enables image layering and data
+ sharing among different filesystems;
- - Support a bloom filter that speeds up negative extended attribute lookups;
+ - Inline and shared extended attributes with an optional bloom filter that
+ speeds up negative extended attribute lookups;
- - Support POSIX.1e ACLs by using extended attributes;
+ - POSIX.1e ACLs by using extended attributes;
- - Support transparent data compression as an option:
- LZ4, MicroLZMA, DEFLATE and Zstandard algorithms can be used on a per-file
- basis; In addition, inplace decompression is also supported to avoid bounce
- compressed buffers and unnecessary page cache thrashing.
+ - Transparent data compression as an option: Supported algorithms (LZ4,
+ MicroLZMA, DEFLATE and Zstandard) can be selected on a per-inode basis.
+ Both the on-disk metadata and decompression runtime have been heavily
+ optimized to minimize the overhead for better performance.
- - Support chunk-based data deduplication and rolling-hash compressed data
- deduplication;
+ - Merging tail-end data into a special inode as fragments;
- - Support tailpacking inline compared to byte-addressed unaligned metadata
- or smaller block size alternatives;
+ - Chunk-based deduplication and rolling-hash compressed data deduplication;
- - Support merging tail-end data into a special inode as fragments.
+ - Direct I/O and FSDAX support on uncompressed inodes for use cases such as
+ secure containers, loop devices, and ramdisks that do not need page caching;
- - Support large folios to make use of THPs (Transparent Hugepages);
+ - Page cache sharing among inodes with identical content fingerprints on
+ the same machine.
- - Support direct I/O on uncompressed files to avoid double caching for loop
- devices;
+For more detailed information, please refer to our documentation site:
- - Support FSDAX on uncompressed images for secure containers and ramdisks in
- order to get rid of unnecessary page cache.
-
- - Support file-based on-demand loading with the Fscache infrastructure.
+- https://erofs.docs.kernel.org
The following git tree provides the file system user-space tools under
development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
@@ -91,10 +98,6 @@ compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
-For more information, please also refer to the documentation site:
-
-- https://erofs.docs.kernel.org
-
Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list: