diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/fscrypt.rst | 164 | ||||
-rw-r--r-- | Documentation/filesystems/idmappings.rst | 14 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 61 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 11 | ||||
-rw-r--r-- | Documentation/filesystems/tmpfs.rst | 38 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 12 |
6 files changed, 226 insertions, 74 deletions
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index eccd327e6df5..a624e92f2687 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -332,54 +332,121 @@ Encryption modes and usage fscrypt allows one encryption mode to be specified for file contents and one encryption mode to be specified for filenames. Different directory trees are permitted to use different encryption modes. + +Supported modes +--------------- + Currently, the following pairs of encryption modes are supported: - AES-256-XTS for contents and AES-256-CTS-CBC for filenames -- AES-128-CBC for contents and AES-128-CTS-CBC for filenames +- AES-256-XTS for contents and AES-256-HCTR2 for filenames - Adiantum for both contents and filenames -- AES-256-XTS for contents and AES-256-HCTR2 for filenames (v2 policies only) -- SM4-XTS for contents and SM4-CTS-CBC for filenames (v2 policies only) - -If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair. - -AES-128-CBC was added only for low-powered embedded devices with -crypto accelerators such as CAAM or CESA that do not support XTS. To -use AES-128-CBC, CONFIG_CRYPTO_ESSIV and CONFIG_CRYPTO_SHA256 (or -another SHA-256 implementation) must be enabled so that ESSIV can be -used. - -Adiantum is a (primarily) stream cipher-based mode that is fast even -on CPUs without dedicated crypto instructions. It's also a true -wide-block mode, unlike XTS. It can also eliminate the need to derive -per-file encryption keys. However, it depends on the security of two -primitives, XChaCha12 and AES-256, rather than just one. See the -paper "Adiantum: length-preserving encryption for entry-level -processors" (https://eprint.iacr.org/2018/720.pdf) for more details. -To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast -implementations of ChaCha and NHPoly1305 should be enabled, e.g. -CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM. - -AES-256-HCTR2 is another true wide-block encryption mode that is intended for -use on CPUs with dedicated crypto instructions. AES-256-HCTR2 has the property -that a bitflip in the plaintext changes the entire ciphertext. This property -makes it desirable for filename encryption since initialization vectors are -reused within a directory. For more details on AES-256-HCTR2, see the paper -"Length-preserving encryption with HCTR2" -(https://eprint.iacr.org/2021/1441.pdf). To use AES-256-HCTR2, -CONFIG_CRYPTO_HCTR2 must be enabled. Also, fast implementations of XCTR and -POLYVAL should be enabled, e.g. CRYPTO_POLYVAL_ARM64_CE and -CRYPTO_AES_ARM64_CE_BLK for ARM64. - -SM4 is a Chinese block cipher that is an alternative to AES. It has -not seen as much security review as AES, and it only has a 128-bit key -size. It may be useful in cases where its use is mandated. -Otherwise, it should not be used. For SM4 support to be available, it -also needs to be enabled in the kernel crypto API. - -New encryption modes can be added relatively easily, without changes -to individual filesystems. However, authenticated encryption (AE) -modes are not currently supported because of the difficulty of dealing -with ciphertext expansion. +- AES-128-CBC-ESSIV for contents and AES-128-CTS-CBC for filenames +- SM4-XTS for contents and SM4-CTS-CBC for filenames + +Authenticated encryption modes are not currently supported because of +the difficulty of dealing with ciphertext expansion. Therefore, +contents encryption uses a block cipher in `XTS mode +<https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS>`_ or +`CBC-ESSIV mode +<https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_(ESSIV)>`_, +or a wide-block cipher. Filenames encryption uses a +block cipher in `CTS-CBC mode +<https://en.wikipedia.org/wiki/Ciphertext_stealing>`_ or a wide-block +cipher. + +The (AES-256-XTS, AES-256-CTS-CBC) pair is the recommended default. +It is also the only option that is *guaranteed* to always be supported +if the kernel supports fscrypt at all; see `Kernel config options`_. + +The (AES-256-XTS, AES-256-HCTR2) pair is also a good choice that +upgrades the filenames encryption to use a wide-block cipher. (A +*wide-block cipher*, also called a tweakable super-pseudorandom +permutation, has the property that changing one bit scrambles the +entire result.) As described in `Filenames encryption`_, a wide-block +cipher is the ideal mode for the problem domain, though CTS-CBC is the +"least bad" choice among the alternatives. For more information about +HCTR2, see `the HCTR2 paper <https://eprint.iacr.org/2021/1441.pdf>`_. + +Adiantum is recommended on systems where AES is too slow due to lack +of hardware acceleration for AES. Adiantum is a wide-block cipher +that uses XChaCha12 and AES-256 as its underlying components. Most of +the work is done by XChaCha12, which is much faster than AES when AES +acceleration is unavailable. For more information about Adiantum, see +`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_. + +The (AES-128-CBC-ESSIV, AES-128-CTS-CBC) pair exists only to support +systems whose only form of AES acceleration is an off-CPU crypto +accelerator such as CAAM or CESA that does not support XTS. + +The remaining mode pairs are the "national pride ciphers": + +- (SM4-XTS, SM4-CTS-CBC) + +Generally speaking, these ciphers aren't "bad" per se, but they +receive limited security review compared to the usual choices such as +AES and ChaCha. They also don't bring much new to the table. It is +suggested to only use these ciphers where their use is mandated. + +Kernel config options +--------------------- + +Enabling fscrypt support (CONFIG_FS_ENCRYPTION) automatically pulls in +only the basic support from the crypto API needed to use AES-256-XTS +and AES-256-CTS-CBC encryption. For optimal performance, it is +strongly recommended to also enable any available platform-specific +kconfig options that provide acceleration for the algorithm(s) you +wish to use. Support for any "non-default" encryption modes typically +requires extra kconfig options as well. + +Below, some relevant options are listed by encryption mode. Note, +acceleration options not listed below may be available for your +platform; refer to the kconfig menus. File contents encryption can +also be configured to use inline encryption hardware instead of the +kernel crypto API (see `Inline encryption support`_); in that case, +the file contents mode doesn't need to supported in the kernel crypto +API, but the filenames mode still does. + +- AES-256-XTS and AES-256-CTS-CBC + - Recommended: + - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK + - x86: CONFIG_CRYPTO_AES_NI_INTEL + +- AES-256-HCTR2 + - Mandatory: + - CONFIG_CRYPTO_HCTR2 + - Recommended: + - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK + - arm64: CONFIG_CRYPTO_POLYVAL_ARM64_CE + - x86: CONFIG_CRYPTO_AES_NI_INTEL + - x86: CONFIG_CRYPTO_POLYVAL_CLMUL_NI + +- Adiantum + - Mandatory: + - CONFIG_CRYPTO_ADIANTUM + - Recommended: + - arm32: CONFIG_CRYPTO_CHACHA20_NEON + - arm32: CONFIG_CRYPTO_NHPOLY1305_NEON + - arm64: CONFIG_CRYPTO_CHACHA20_NEON + - arm64: CONFIG_CRYPTO_NHPOLY1305_NEON + - x86: CONFIG_CRYPTO_CHACHA20_X86_64 + - x86: CONFIG_CRYPTO_NHPOLY1305_SSE2 + - x86: CONFIG_CRYPTO_NHPOLY1305_AVX2 + +- AES-128-CBC-ESSIV and AES-128-CTS-CBC: + - Mandatory: + - CONFIG_CRYPTO_ESSIV + - CONFIG_CRYPTO_SHA256 or another SHA-256 implementation + - Recommended: + - AES-CBC acceleration + +fscrypt also uses HMAC-SHA512 for key derivation, so enabling SHA-512 +acceleration is recommended: + +- SHA-512 + - Recommended: + - arm64: CONFIG_CRYPTO_SHA512_ARM64_CE + - x86: CONFIG_CRYPTO_SHA512_SSSE3 Contents encryption ------------------- @@ -493,7 +560,14 @@ This structure must be initialized as follows: be set to constants from ``<linux/fscrypt.h>`` which identify the encryption modes to use. If unsure, use FSCRYPT_MODE_AES_256_XTS (1) for ``contents_encryption_mode`` and FSCRYPT_MODE_AES_256_CTS - (4) for ``filenames_encryption_mode``. + (4) for ``filenames_encryption_mode``. For details, see `Encryption + modes and usage`_. + + v1 encryption policies only support three combinations of modes: + (FSCRYPT_MODE_AES_256_XTS, FSCRYPT_MODE_AES_256_CTS), + (FSCRYPT_MODE_AES_128_CBC, FSCRYPT_MODE_AES_128_CTS), and + (FSCRYPT_MODE_ADIANTUM, FSCRYPT_MODE_ADIANTUM). v2 policies support + all combinations documented in `Supported modes`_. - ``flags`` contains optional flags from ``<linux/fscrypt.h>``: diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst index ad6d21640576..d095c5838f94 100644 --- a/Documentation/filesystems/idmappings.rst +++ b/Documentation/filesystems/idmappings.rst @@ -146,9 +146,10 @@ For the rest of this document we will prefix all userspace ids with ``u`` and all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So an idmapping will be written as ``u0:k10000:r10000``. -For example, the id ``u1000`` is an id in the upper idmapset or "userspace -idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a -kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``. +For example, within this idmapping, the id ``u1000`` is an id in the upper +idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to +``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset" +starting with ``k10000``. A kernel id is always created by an idmapping. Such idmappings are associated with user namespaces. Since we mainly care about how idmappings work we're not @@ -373,6 +374,13 @@ kernel maps the caller's userspace id down into a kernel id according to the caller's idmapping and then maps that kernel id up according to the filesystem's idmapping. +From the implementation point it's worth mentioning how idmappings are represented. +All idmappings are taken from the corresponding user namespace. + + - caller's idmapping (usually taken from ``current_user_ns()``) + - filesystem's idmapping (``sb->s_user_ns``) + - mount's idmapping (``mnt_idmap(vfsmnt)``) + Let's see some examples with caller/filesystem idmapping but without mount idmappings. This will exhibit some problems we can hit. After that we will revisit/reconsider these examples, this time using mount idmappings, to see how diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 0ca479dbb1cd..2fd01b9aaced 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -85,13 +85,14 @@ prototypes:: struct dentry *dentry, struct fileattr *fa); int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int); + struct offset_ctx *(*get_offset_ctx)(struct inode *inode); locking rules: all may block -============== ============================================= +============== ================================================== ops i_rwsem(inode) -============== ============================================= +============== ================================================== lookup: shared create: exclusive link: exclusive (both) @@ -115,7 +116,8 @@ atomic_open: shared (exclusive if O_CREAT is set in open flags) tmpfile: no fileattr_get: no or exclusive fileattr_set: exclusive -============== ============================================= +get_offset_ctx no +============== ================================================== Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem @@ -374,10 +376,17 @@ invalidate_lock before invalidating page cache in truncate / hole punch path (and thus calling into ->invalidate_folio) to block races between page cache invalidation and page cache filling functions (fault, read, ...). -->release_folio() is called when the kernel is about to try to drop the -buffers from the folio in preparation for freeing it. It returns false to -indicate that the buffers are (or may be) freeable. If ->release_folio is -NULL, the kernel assumes that the fs has no private interest in the buffers. +->release_folio() is called when the MM wants to make a change to the +folio that would invalidate the filesystem's private data. For example, +it may be about to be removed from the address_space or split. The folio +is locked and not under writeback. It may be dirty. The gfp parameter +is not usually used for allocation, but rather to indicate what the +filesystem may do to attempt to free the private data. The filesystem may +return false to indicate that the folio's private data cannot be freed. +If it returns true, it should have already removed the private data from +the folio. If a filesystem does not provide a ->release_folio method, +the pagecache will assume that private data is buffer_heads and call +try_to_free_buffers(). ->free_folio() is called when the kernel has dropped the folio from the page cache. @@ -627,26 +636,29 @@ vm_operations_struct prototypes:: - void (*open)(struct vm_area_struct*); - void (*close)(struct vm_area_struct*); - vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *); + void (*open)(struct vm_area_struct *); + void (*close)(struct vm_area_struct *); + vm_fault_t (*fault)(struct vm_fault *); + vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order); + vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end); vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: -============= ========= =========================== +============= ========== =========================== ops mmap_lock PageLocked(page) -============= ========= =========================== -open: yes -close: yes -fault: yes can return with page locked -map_pages: read -page_mkwrite: yes can return with page locked -pfn_mkwrite: yes -access: yes -============= ========= =========================== +============= ========== =========================== +open: write +close: read/write +fault: read can return with page locked +huge_fault: maybe-read +map_pages: maybe-read +page_mkwrite: read can return with page locked +pfn_mkwrite: read +access: read +============= ========== =========================== ->fault() is called when a previously not present pte is about to be faulted in. The filesystem must find and return the page associated with the passed in @@ -656,11 +668,18 @@ then ensure the page is not already truncated (invalidate_lock will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the page locked. The VM will unlock the page. +->huge_fault() is called when there is no PUD or PMD entry present. This +gives the filesystem the opportunity to install a PUD or PMD sized page. +Filesystems can also use the ->fault method to return a PMD sized page, +so implementing this function may not be necessary. In particular, +filesystems should not call filemap_fault() from ->huge_fault(). +The mmap_lock may not be held when this method is called. + ->map_pages() is called when VM asks to map easy accessible pages. Filesystem should find and map pages associated with offsets from "start_pgoff" till "end_pgoff". ->map_pages() is called with the RCU lock held and must not block. If it's not possible to reach a page without blocking, -filesystem should skip it. Filesystem should use do_set_pte() to setup +filesystem should skip it. Filesystem should use set_pte_range() to setup page table entry. Pointer to entry associated with the page is passed in "pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 0f5da78ef4f9..98969d713e2e 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -938,3 +938,14 @@ file pointer instead of struct dentry pointer. d_tmpfile() is similarly changed to simplify callers. The passed file is in a non-open state and on success must be opened before returning (e.g. by calling finish_open_simple()). + +--- + +**mandatory** + +Calling convention for ->huge_fault has changed. It now takes a page +order instead of an enum page_entry_size, and it may be called without the +mmap_lock held. All in-tree users have been audited and do not seem to +depend on the mmap_lock being held, but out of tree users should verify +for themselves. If they do need it, they can return VM_FAULT_RETRY to +be called with the mmap_lock held. diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 2cd8fa332feb..56a26c843dbe 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -21,8 +21,8 @@ explained further below, some of which can be reconfigured dynamically on the fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs filesystem can be resized but it cannot be resized to a size below its current usage. tmpfs also supports POSIX ACLs, and extended attributes for the -trusted.* and security.* namespaces. ramfs does not use swap and you cannot -modify any parameter for a ramfs filesystem. The size limit of a ramfs +trusted.*, security.* and user.* namespaces. ramfs does not use swap and you +cannot modify any parameter for a ramfs filesystem. The size limit of a ramfs filesystem is how much memory you have available, and so care must be taken if used so to not run out of memory. @@ -97,6 +97,9 @@ mount with such options, since it allows any user with write access to use up all the memory on the machine; but enhances the scalability of that instance in a system with many CPUs making intensive use of it. +If nr_inodes is not 0, that limited space for inodes is also used up by +extended attributes: "df -i"'s IUsed and IUse% increase, IFree decreases. + tmpfs blocks may be swapped out, when there is a shortage of memory. tmpfs has a mount option to disable its use of swap: @@ -123,6 +126,37 @@ sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled: which can be used to deny huge pages on all tmpfs mounts in an emergency, or to force huge pages on all tmpfs mounts for testing. +tmpfs also supports quota with the following mount options + +======================== ================================================= +quota User and group quota accounting and enforcement + is enabled on the mount. Tmpfs is using hidden + system quota files that are initialized on mount. +usrquota User quota accounting and enforcement is enabled + on the mount. +grpquota Group quota accounting and enforcement is enabled + on the mount. +usrquota_block_hardlimit Set global user quota block hard limit. +usrquota_inode_hardlimit Set global user quota inode hard limit. +grpquota_block_hardlimit Set global group quota block hard limit. +grpquota_inode_hardlimit Set global group quota inode hard limit. +======================== ================================================= + +None of the quota related mount options can be set or changed on remount. + +Quota limit parameters accept a suffix k, m or g for kilo, mega and giga +and can't be changed on remount. Default global quota limits are taking +effect for any and all user/group/project except root the first time the +quota entry for user/group/project id is being accessed - typically the +first time an inode with a particular id ownership is being created after +the mount. In other words, instead of the limits being initialized to zero, +they are initialized with the particular value provided with these mount +options. The limits can be changed for any user/group id at any time as they +normally can be. + +Note that tmpfs quotas do not support user namespaces so no uid/gid +translation is done if quotas are enabled inside user namespaces. + tmpfs has a mount option to set the NUMA memory allocation policy for all files in that instance (if CONFIG_NUMA is enabled) - which can be adjusted on the fly via 'mount -o remount ...' diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index cb2a97e49872..f8fe815ab1f3 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -260,9 +260,11 @@ filesystem. The following members are defined: void (*evict_inode) (struct inode *); void (*put_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - int (*freeze_super) (struct super_block *); + int (*freeze_super) (struct super_block *sb, + enum freeze_holder who); int (*freeze_fs) (struct super_block *); - int (*thaw_super) (struct super_block *); + int (*thaw_super) (struct super_block *sb, + enum freeze_wholder who); int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); @@ -515,6 +517,7 @@ As of kernel 2.6.22, the following members are defined: int (*fileattr_set)(struct mnt_idmap *idmap, struct dentry *dentry, struct fileattr *fa); int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); + struct offset_ctx *(*get_offset_ctx)(struct inode *inode); }; Again, all methods are called without any locks being held, unless @@ -675,7 +678,10 @@ otherwise noted. called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to change miscellaneous file flags and attributes. Callers hold i_rwsem exclusive. If unset, then fall back to f_op->ioctl(). - +``get_offset_ctx`` + called to get the offset context for a directory inode. A + filesystem must define this operation to use + simple_offset_dir_operations. The Address Space Object ======================== |