diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-02-09 15:13:05 -0800 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-02-09 15:13:05 -0800 |
| commit | 9e355113f02be17db573d579515dee63621b7c8b (patch) | |
| tree | 9d6d6dea7ccae97b5439a91d30eed16b5821cfb9 | |
| parent | 3304b3fedddfb1357c7f9e25526b5a7899ee1f13 (diff) | |
| parent | 6cbfdf89470ef3c2110f376a507d135e7a7a7378 (diff) | |
| download | lwn-9e355113f02be17db573d579515dee63621b7c8b.tar.gz lwn-9e355113f02be17db573d579515dee63621b7c8b.zip | |
Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.
Scalability and performance:
- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing
- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks
- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures
- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing
- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch
- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder
Bug fixes and correctness:
- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash
- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract
- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists
- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures
- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern
API modernization:
- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems
- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling
- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns
- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int
- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag
Deprecation:
- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)
Documentation:
- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source
- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()
Testing:
- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX
Misc:
- Add missing <linux/init_task.h> include in fs_struct.c"
* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...
39 files changed, 352 insertions, 294 deletions
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst index de64d2d002a2..a01d9b9b5bc3 100644 --- a/Documentation/filesystems/nfs/exporting.rst +++ b/Documentation/filesystems/nfs/exporting.rst @@ -119,43 +119,11 @@ For a filesystem to be exportable it must: A file system implementation declares that instances of the filesystem are exportable by setting the s_export_op field in the struct -super_block. This field must point to a "struct export_operations" -struct which has the following members: - - encode_fh (mandatory) - Takes a dentry and creates a filehandle fragment which may later be used - to find or create a dentry for the same object. - - fh_to_dentry (mandatory) - Given a filehandle fragment, this should find the implied object and - create a dentry for it (possibly with d_obtain_alias). - - fh_to_parent (optional but strongly recommended) - Given a filehandle fragment, this should find the parent of the - implied object and create a dentry for it (possibly with - d_obtain_alias). May fail if the filehandle fragment is too small. - - get_parent (optional but strongly recommended) - When given a dentry for a directory, this should return a dentry for - the parent. Quite possibly the parent dentry will have been allocated - by d_alloc_anon. The default get_parent function just returns an error - so any filehandle lookup that requires finding a parent will fail. - ->lookup("..") is *not* used as a default as it can leave ".." entries - in the dcache which are too messy to work with. - - get_name (optional) - When given a parent dentry and a child dentry, this should find a name - in the directory identified by the parent dentry, which leads to the - object identified by the child dentry. If no get_name function is - supplied, a default implementation is provided which uses vfs_readdir - to find potential names, and matches inode numbers to find the correct - match. - - flags - Some filesystems may need to be handled differently than others. The - export_operations struct also includes a flags field that allows the - filesystem to communicate such information to nfsd. See the Export - Operations Flags section below for more explanation. +super_block. This field must point to a struct export_operations +which has the following members: + +.. kernel-doc:: include/linux/exportfs.h + :identifiers: struct export_operations A filehandle fragment consists of an array of 1 or more 4byte words, together with a one byte "type". diff --git a/fs/9p/acl.c b/fs/9p/acl.c index 633da5e37299..ae7e7cf7523a 100644 --- a/fs/9p/acl.c +++ b/fs/9p/acl.c @@ -167,17 +167,11 @@ int v9fs_iop_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, if (retval) goto err_out; - size = posix_acl_xattr_size(acl->a_count); - - value = kzalloc(size, GFP_NOFS); + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS); if (!value) { retval = -ENOMEM; goto err_out; } - - retval = posix_acl_to_xattr(&init_user_ns, acl, value, size); - if (retval < 0) - goto err_out; } /* @@ -257,13 +251,10 @@ static int v9fs_set_acl(struct p9_fid *fid, int type, struct posix_acl *acl) return 0; /* Set a setxattr request to server */ - size = posix_acl_xattr_size(acl->a_count); - buffer = kmalloc(size, GFP_KERNEL); + buffer = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); if (!buffer) return -ENOMEM; - retval = posix_acl_to_xattr(&init_user_ns, acl, buffer, size); - if (retval < 0) - goto err_free_out; + switch (type) { case ACL_TYPE_ACCESS: name = XATTR_NAME_POSIX_ACL_ACCESS; @@ -275,7 +266,6 @@ static int v9fs_set_acl(struct p9_fid *fid, int type, struct posix_acl *acl) BUG(); } retval = v9fs_fid_xattr_set(fid, name, buffer, size, 0); -err_free_out: kfree(buffer); return retval; } diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c index c336e2ab7f8a..e55b686fe1ab 100644 --- a/fs/btrfs/acl.c +++ b/fs/btrfs/acl.c @@ -57,7 +57,8 @@ struct posix_acl *btrfs_get_acl(struct inode *inode, int type, bool rcu) int __btrfs_set_acl(struct btrfs_trans_handle *trans, struct inode *inode, struct posix_acl *acl, int type) { - int ret, size = 0; + int ret; + size_t size = 0; const char *name; char AUTO_KFREE(value); @@ -77,20 +78,15 @@ int __btrfs_set_acl(struct btrfs_trans_handle *trans, struct inode *inode, if (acl) { unsigned int nofs_flag; - size = posix_acl_xattr_size(acl->a_count); /* * We're holding a transaction handle, so use a NOFS memory * allocation context to avoid deadlock if reclaim happens. */ nofs_flag = memalloc_nofs_save(); - value = kmalloc(size, GFP_KERNEL); + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); memalloc_nofs_restore(nofs_flag); if (!value) return -ENOMEM; - - ret = posix_acl_to_xattr(&init_user_ns, acl, value, size); - if (ret < 0) - return ret; } if (trans) diff --git a/fs/buffer.c b/fs/buffer.c index 838c0c571022..fd53b806ab7e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2354,7 +2354,7 @@ bool block_is_partially_uptodate(struct folio *folio, size_t from, size_t count) if (!head) return false; blocksize = head->b_size; - to = min_t(unsigned, folio_size(folio) - from, count); + to = min(folio_size(folio) - from, count); to = from + to; if (from < blocksize && to > folio_size(folio) - blocksize) return false; @@ -2948,6 +2948,10 @@ bool try_to_free_buffers(struct folio *folio) if (folio_test_writeback(folio)) return false; + /* Misconfigured folio check */ + if (WARN_ON_ONCE(!folio_buffers(folio))) + return true; + if (mapping == NULL) { /* can this still happen? */ ret = drop_buffers(folio, &buffers_to_free); goto out; diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index 1564eacc253d..85d3dd48b167 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -90,7 +90,8 @@ retry: int ceph_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, struct posix_acl *acl, int type) { - int ret = 0, size = 0; + int ret = 0; + size_t size = 0; const char *name = NULL; char *value = NULL; struct iattr newattrs; @@ -126,16 +127,11 @@ int ceph_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, } if (acl) { - size = posix_acl_xattr_size(acl->a_count); - value = kmalloc(size, GFP_NOFS); + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS); if (!value) { ret = -ENOMEM; goto out; } - - ret = posix_acl_to_xattr(&init_user_ns, acl, value, size); - if (ret < 0) - goto out_free; } if (new_mode != old_mode) { @@ -172,7 +168,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, struct posix_acl *acl, *default_acl; size_t val_size1 = 0, val_size2 = 0; struct ceph_pagelist *pagelist = NULL; - void *tmp_buf = NULL; + void *tmp_buf1 = NULL, *tmp_buf2 = NULL; int err; err = posix_acl_create(dir, mode, &default_acl, &acl); @@ -192,15 +188,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, if (!default_acl && !acl) return 0; - if (acl) - val_size1 = posix_acl_xattr_size(acl->a_count); - if (default_acl) - val_size2 = posix_acl_xattr_size(default_acl->a_count); - err = -ENOMEM; - tmp_buf = kmalloc(max(val_size1, val_size2), GFP_KERNEL); - if (!tmp_buf) - goto out_err; pagelist = ceph_pagelist_alloc(GFP_KERNEL); if (!pagelist) goto out_err; @@ -213,34 +201,39 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, if (acl) { size_t len = strlen(XATTR_NAME_POSIX_ACL_ACCESS); + + err = -ENOMEM; + tmp_buf1 = posix_acl_to_xattr(&init_user_ns, acl, + &val_size1, GFP_KERNEL); + if (!tmp_buf1) + goto out_err; err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8); if (err) goto out_err; ceph_pagelist_encode_string(pagelist, XATTR_NAME_POSIX_ACL_ACCESS, len); - err = posix_acl_to_xattr(&init_user_ns, acl, - tmp_buf, val_size1); - if (err < 0) - goto out_err; ceph_pagelist_encode_32(pagelist, val_size1); - ceph_pagelist_append(pagelist, tmp_buf, val_size1); + ceph_pagelist_append(pagelist, tmp_buf1, val_size1); } if (default_acl) { size_t len = strlen(XATTR_NAME_POSIX_ACL_DEFAULT); + + err = -ENOMEM; + tmp_buf2 = posix_acl_to_xattr(&init_user_ns, default_acl, + &val_size2, GFP_KERNEL); + if (!tmp_buf2) + goto out_err; err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8); if (err) goto out_err; ceph_pagelist_encode_string(pagelist, XATTR_NAME_POSIX_ACL_DEFAULT, len); - err = posix_acl_to_xattr(&init_user_ns, default_acl, - tmp_buf, val_size2); - if (err < 0) - goto out_err; ceph_pagelist_encode_32(pagelist, val_size2); - ceph_pagelist_append(pagelist, tmp_buf, val_size2); + ceph_pagelist_append(pagelist, tmp_buf2, val_size2); } - kfree(tmp_buf); + kfree(tmp_buf1); + kfree(tmp_buf2); as_ctx->acl = acl; as_ctx->default_acl = default_acl; @@ -250,7 +243,8 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, out_err: posix_acl_release(acl); posix_acl_release(default_acl); - kfree(tmp_buf); + kfree(tmp_buf1); + kfree(tmp_buf2); if (pagelist) ceph_pagelist_release(pagelist); return err; diff --git a/fs/char_dev.c b/fs/char_dev.c index c2ddb998f3c9..bf7b32650e54 100644 --- a/fs/char_dev.c +++ b/fs/char_dev.c @@ -10,6 +10,7 @@ #include <linux/kdev_t.h> #include <linux/slab.h> #include <linux/string.h> +#include <linux/cleanup.h> #include <linux/major.h> #include <linux/errno.h> @@ -97,7 +98,8 @@ static struct char_device_struct * __register_chrdev_region(unsigned int major, unsigned int baseminor, int minorct, const char *name) { - struct char_device_struct *cd, *curr, *prev = NULL; + struct char_device_struct *cd __free(kfree) = NULL; + struct char_device_struct *curr, *prev = NULL; int ret; int i; @@ -117,14 +119,14 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor, if (cd == NULL) return ERR_PTR(-ENOMEM); - mutex_lock(&chrdevs_lock); + guard(mutex)(&chrdevs_lock); if (major == 0) { ret = find_dynamic_major(); if (ret < 0) { pr_err("CHRDEV \"%s\" dynamic allocation region is full\n", name); - goto out; + return ERR_PTR(ret); } major = ret; } @@ -144,7 +146,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor, if (curr->baseminor >= baseminor + minorct) break; - goto out; + return ERR_PTR(ret); } cd->major = major; @@ -160,12 +162,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor, prev->next = cd; } - mutex_unlock(&chrdevs_lock); - return cd; -out: - mutex_unlock(&chrdevs_lock); - kfree(cd); - return ERR_PTR(ret); + return_ptr(cd); } static struct char_device_struct * @@ -343,7 +340,7 @@ void __unregister_chrdev(unsigned int major, unsigned int baseminor, kfree(cd); } -static DEFINE_SPINLOCK(cdev_lock); +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(cdev_lock); static struct kobject *cdev_get(struct cdev *p) { diff --git a/fs/dcache.c b/fs/dcache.c index 66dd1bb830d1..7088df2d042c 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3237,10 +3237,7 @@ EXPORT_SYMBOL(d_parent_ino); static __initdata unsigned long dhash_entries; static int __init set_dhash_entries(char *str) { - if (!str) - return 0; - dhash_entries = simple_strtoul(str, &str, 0); - return 1; + return kstrtoul(str, 0, &dhash_entries) == 0; } __setup("dhash_entries=", set_dhash_entries); diff --git a/fs/exec.c b/fs/exec.c index 9d5ebc9d15b0..d0606e53376f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -555,7 +555,7 @@ int copy_string_kernel(const char *arg, struct linux_binprm *bprm) return -E2BIG; while (len > 0) { - unsigned int bytes_to_copy = min_t(unsigned int, len, + unsigned int bytes_to_copy = min(len, min_not_zero(offset_in_page(pos), PAGE_SIZE)); struct page *page; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 56d50fd3310b..e817a758801d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4276,8 +4276,7 @@ void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block, * get the corresponding group metadata to work with. * For this we have goto again loop. */ - thisgrp_len = min_t(unsigned int, (unsigned int)len, - EXT4_BLOCKS_PER_GROUP(sb) - EXT4_C2B(sbi, blkoff)); + thisgrp_len = min(len, EXT4_BLOCKS_PER_GROUP(sb) - EXT4_C2B(sbi, blkoff)); clen = EXT4_NUM_B2C(sbi, thisgrp_len); if (!ext4_sb_block_valid(sb, NULL, block, thisgrp_len)) { diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 050f26168d97..76842f0957b5 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1479,7 +1479,7 @@ static void ext4_update_super(struct super_block *sb, /* Update the global fs size fields */ sbi->s_groups_count += flex_gd->count; - sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count, + sbi->s_blockfile_groups = min(sbi->s_groups_count, (EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb))); /* Update the reserved block counts only once the new group is diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a6241ffb8639..a8d2460b527a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4837,7 +4837,7 @@ static int ext4_check_geometry(struct super_block *sb, return -EINVAL; } sbi->s_groups_count = blocks_count; - sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count, + sbi->s_blockfile_groups = min(sbi->s_groups_count, (EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb))); if (((u64)sbi->s_groups_count * sbi->s_inodes_per_group) != le32_to_cpu(es->s_inodes_count)) { diff --git a/fs/fat/dir.c b/fs/fat/dir.c index 0e97ef6c2327..07d95f1442c8 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -1355,7 +1355,7 @@ found: /* Fill the long name slots. */ for (i = 0; i < long_bhs; i++) { - int copy = min_t(int, sb->s_blocksize - offset, size); + int copy = umin(sb->s_blocksize - offset, size); memcpy(bhs[i]->b_data + offset, slots, copy); mark_buffer_dirty_inode(bhs[i], dir); offset = 0; @@ -1366,7 +1366,7 @@ found: err = fat_sync_bhs(bhs, long_bhs); if (!err && i < nr_bhs) { /* Fill the short name slot. */ - int copy = min_t(int, sb->s_blocksize - offset, size); + int copy = umin(sb->s_blocksize - offset, size); memcpy(bhs[i]->b_data + offset, slots, copy); mark_buffer_dirty_inode(bhs[i], dir); if (IS_DIRSYNC(dir)) diff --git a/fs/fat/file.c b/fs/fat/file.c index afc0e3ad6536..124d9c5431c8 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -141,8 +141,7 @@ static int fat_ioctl_fitrim(struct inode *inode, unsigned long arg) if (copy_from_user(&range, user_range, sizeof(range))) return -EFAULT; - range.minlen = max_t(unsigned int, range.minlen, - bdev_discard_granularity(sb->s_bdev)); + range.minlen = max(range.minlen, bdev_discard_granularity(sb->s_bdev)); err = fat_trim_fs(inode, &range); if (err < 0) diff --git a/fs/file_table.c b/fs/file_table.c index cd4a3db4659a..34244fccf2ed 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -176,6 +176,11 @@ static int init_file(struct file *f, int flags, const struct cred *cred) f->f_flags = flags; f->f_mode = OPEN_FMODE(flags); + /* + * Disable permission and pre-content events for all files by default. + * They may be enabled later by fsnotify_open_perm_and_set_mode(). + */ + file_set_fsnotify_mode(f, FMODE_NONOTIFY_PERM); f->f_op = NULL; f->f_mapping = NULL; @@ -197,11 +202,6 @@ static int init_file(struct file *f, int flags, const struct cred *cred) * refcount bumps we should reinitialize the reused file first. */ file_ref_init(&f->f_ref, 1); - /* - * Disable permission and pre-content events for all files by default. - * They may be enabled later by fsnotify_open_perm_and_set_mode(). - */ - file_set_fsnotify_mode(f, FMODE_NONOTIFY_PERM); return 0; } diff --git a/fs/fs_struct.c b/fs/fs_struct.c index b8c46c5a38a0..394875d06fd6 100644 --- a/fs/fs_struct.c +++ b/fs/fs_struct.c @@ -6,6 +6,7 @@ #include <linux/path.h> #include <linux/slab.h> #include <linux/fs_struct.h> +#include <linux/init_task.h> #include "internal.h" /* diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c index 8f484b105f13..cbde6ac1add3 100644 --- a/fs/fuse/acl.c +++ b/fs/fuse/acl.c @@ -122,20 +122,16 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, * them to be refreshed the next time they are used, * and it also updates i_ctime. */ - size_t size = posix_acl_xattr_size(acl->a_count); + size_t size; void *value; - if (size > PAGE_SIZE) - return -E2BIG; - - value = kmalloc(size, GFP_KERNEL); + value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL); if (!value) return -ENOMEM; - ret = posix_acl_to_xattr(fc->user_ns, acl, value, size); - if (ret < 0) { + if (size > PAGE_SIZE) { kfree(value); - return ret; + return -E2BIG; } /* diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 6d59cbc877c6..a30c8b57d478 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1813,7 +1813,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size, goto out_iput; folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset; - nr_bytes = min_t(unsigned, num, folio_size(folio) - folio_offset); + nr_bytes = min(num, folio_size(folio) - folio_offset); nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT; err = fuse_copy_folio(cs, &folio, folio_offset, nr_bytes, 0); diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 78fa46cfc636..dffd454e30e2 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1323,10 +1323,8 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia, static inline unsigned int fuse_wr_pages(loff_t pos, size_t len, unsigned int max_pages) { - return min_t(unsigned int, - ((pos + len - 1) >> PAGE_SHIFT) - - (pos >> PAGE_SHIFT) + 1, - max_pages); + return min(((pos + len - 1) >> PAGE_SHIFT) - (pos >> PAGE_SHIFT) + 1, + max_pages); } static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii) @@ -1607,7 +1605,7 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii, struct folio *folio = page_folio(pages[i]); unsigned int offset = start + (folio_page_idx(folio, pages[i]) << PAGE_SHIFT); - unsigned int len = min_t(unsigned int, ret, PAGE_SIZE - start); + unsigned int len = umin(ret, PAGE_SIZE - start); ap->descs[ap->num_folios].offset = offset; ap->descs[ap->num_folios].length = len; diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c index 443640e6fb9c..a5b60778b91c 100644 --- a/fs/gfs2/acl.c +++ b/fs/gfs2/acl.c @@ -83,21 +83,14 @@ struct posix_acl *gfs2_get_acl(struct inode *inode, int type, bool rcu) int __gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) { int error; - size_t len; - char *data; + size_t len = 0; + char *data = NULL; const char *name = gfs2_acl_name(type); if (acl) { - len = posix_acl_xattr_size(acl->a_count); - data = kmalloc(len, GFP_NOFS); + data = posix_acl_to_xattr(&init_user_ns, acl, &len, GFP_NOFS); if (data == NULL) return -ENOMEM; - error = posix_acl_to_xattr(&init_user_ns, acl, data, len); - if (error < 0) - goto out; - } else { - data = NULL; - len = 0; } error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS); diff --git a/fs/inode.c b/fs/inode.c index 1d0474745e77..dae43a8de7e0 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1028,19 +1028,20 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) return freed; } -static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked); +static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked); + /* * Called with the inode lock held. */ static struct inode *find_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), - void *data, bool is_inode_hash_locked, + void *data, bool hash_locked, bool *isnew) { struct inode *inode = NULL; - if (is_inode_hash_locked) + if (hash_locked) lockdep_assert_held(&inode_hash_lock); else lockdep_assert_not_held(&inode_hash_lock); @@ -1054,7 +1055,7 @@ repeat: continue; spin_lock(&inode->i_lock); if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) { - __wait_on_freeing_inode(inode, is_inode_hash_locked); + __wait_on_freeing_inode(inode, hash_locked, true); goto repeat; } if (unlikely(inode_state_read(inode) & I_CREATING)) { @@ -1078,11 +1079,11 @@ repeat: */ static struct inode *find_inode_fast(struct super_block *sb, struct hlist_head *head, unsigned long ino, - bool is_inode_hash_locked, bool *isnew) + bool hash_locked, bool *isnew) { struct inode *inode = NULL; - if (is_inode_hash_locked) + if (hash_locked) lockdep_assert_held(&inode_hash_lock); else lockdep_assert_not_held(&inode_hash_lock); @@ -1096,7 +1097,7 @@ repeat: continue; spin_lock(&inode->i_lock); if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) { - __wait_on_freeing_inode(inode, is_inode_hash_locked); + __wait_on_freeing_inode(inode, hash_locked, true); goto repeat; } if (unlikely(inode_state_read(inode) & I_CREATING)) { @@ -1832,16 +1833,13 @@ int insert_inode_locked(struct inode *inode) while (1) { struct inode *old = NULL; spin_lock(&inode_hash_lock); +repeat: hlist_for_each_entry(old, head, i_hash) { if (old->i_ino != ino) continue; if (old->i_sb != sb) continue; spin_lock(&old->i_lock); - if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) { - spin_unlock(&old->i_lock); - continue; - } break; } if (likely(!old)) { @@ -1852,6 +1850,11 @@ int insert_inode_locked(struct inode *inode) spin_unlock(&inode_hash_lock); return 0; } + if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) { + __wait_on_freeing_inode(old, true, false); + old = NULL; + goto repeat; + } if (unlikely(inode_state_read(old) & I_CREATING)) { spin_unlock(&old->i_lock); spin_unlock(&inode_hash_lock); @@ -2522,16 +2525,18 @@ EXPORT_SYMBOL(inode_needs_sync); * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list * will DTRT. */ -static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked) +static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked) { struct wait_bit_queue_entry wqe; struct wait_queue_head *wq_head; + VFS_BUG_ON(!hash_locked && !rcu_locked); + /* * Handle racing against evict(), see that routine for more details. */ if (unlikely(inode_unhashed(inode))) { - WARN_ON(is_inode_hash_locked); + WARN_ON(hash_locked); spin_unlock(&inode->i_lock); return; } @@ -2539,23 +2544,22 @@ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_lock wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW); prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); spin_unlock(&inode->i_lock); - rcu_read_unlock(); - if (is_inode_hash_locked) + if (rcu_locked) + rcu_read_unlock(); + if (hash_locked) spin_unlock(&inode_hash_lock); schedule(); finish_wait(wq_head, &wqe.wq_entry); - if (is_inode_hash_locked) + if (hash_locked) spin_lock(&inode_hash_lock); - rcu_read_lock(); + if (rcu_locked) + rcu_read_lock(); } static __initdata unsigned long ihash_entries; static int __init set_ihash_entries(char *str) { - if (!str) - return 0; - ihash_entries = simple_strtoul(str, &str, 0); - return 1; + return kstrtoul(str, 0, &ihash_entries) == 0; } __setup("ihash_entries=", set_ihash_entries); @@ -3005,24 +3009,45 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap, EXPORT_SYMBOL(mode_strip_sgid); #ifdef CONFIG_DEBUG_VFS -/* - * Dump an inode. - * - * TODO: add a proper inode dumping routine, this is a stub to get debug off the - * ground. +/** + * dump_inode - dump an inode. + * @inode: inode to dump + * @reason: reason for dumping * - * TODO: handle getting to fs type with get_kernel_nofault()? - * See dump_mapping() above. + * If inode is an invalid pointer, we don't want to crash accessing it, + * so probe everything depending on it carefully with get_kernel_nofault(). */ void dump_inode(struct inode *inode, const char *reason) { - struct super_block *sb = inode->i_sb; + struct super_block *sb; + struct file_system_type *s_type; + const char *fs_name_ptr; + char fs_name[32] = {}; + umode_t mode; + unsigned short opflags; + unsigned int flags; + unsigned int state; + int count; + + if (get_kernel_nofault(sb, &inode->i_sb) || + get_kernel_nofault(mode, &inode->i_mode) || + get_kernel_nofault(opflags, &inode->i_opflags) || + get_kernel_nofault(flags, &inode->i_flags)) { + pr_warn("%s: unreadable inode:%px\n", reason, inode); + return; + } - pr_warn("%s encountered for inode %px\n" - "fs %s mode %ho opflags 0x%hx flags 0x%x state 0x%x count %d\n", - reason, inode, sb->s_type->name, inode->i_mode, inode->i_opflags, - inode->i_flags, inode_state_read_once(inode), atomic_read(&inode->i_count)); -} + state = inode_state_read_once(inode); + count = atomic_read(&inode->i_count); + if (!sb || + get_kernel_nofault(s_type, &sb->s_type) || !s_type || + get_kernel_nofault(fs_name_ptr, &s_type->name) || !fs_name_ptr || + strncpy_from_kernel_nofault(fs_name, fs_name_ptr, sizeof(fs_name) - 1) < 0) + strscpy(fs_name, "<unknown, sb unreadable>"); + + pr_warn("%s: inode:%px fs:%s mode:%ho opflags:%#x flags:%#x state:%#x count:%d\n", + reason, inode, fs_name, mode, opflags, flags, state, count); +} EXPORT_SYMBOL(dump_inode); #endif diff --git a/fs/jfs/acl.c b/fs/jfs/acl.c index 1de3602c98de..16b71a23ff1e 100644 --- a/fs/jfs/acl.c +++ b/fs/jfs/acl.c @@ -61,7 +61,7 @@ static int __jfs_set_acl(tid_t tid, struct inode *inode, int type, { char *ea_name; int rc; - int size = 0; + size_t size = 0; char *value = NULL; switch (type) { @@ -76,16 +76,11 @@ static int __jfs_set_acl(tid_t tid, struct inode *inode, int type, } if (acl) { - size = posix_acl_xattr_size(acl->a_count); - value = kmalloc(size, GFP_KERNEL); + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); if (!value) return -ENOMEM; - rc = posix_acl_to_xattr(&init_user_ns, acl, value, size); - if (rc < 0) - goto out; } rc = __jfs_setxattr(tid, inode, ea_name, value, size, 0); -out: kfree(value); if (!rc) diff --git a/fs/locks.c b/fs/locks.c index cf1968b01bcb..3ea25d3a780f 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -178,7 +178,6 @@ locks_get_lock_context(struct inode *inode, int type) { struct file_lock_context *ctx; - /* paired with cmpxchg() below */ ctx = locks_inode_context(inode); if (likely(ctx) || type == F_UNLCK) goto out; @@ -196,7 +195,18 @@ locks_get_lock_context(struct inode *inode, int type) * Assign the pointer if it's not already assigned. If it is, then * free the context we just allocated. */ - if (cmpxchg(&inode->i_flctx, NULL, ctx)) { + spin_lock(&inode->i_lock); + if (!(inode->i_opflags & IOP_FLCTX)) { + VFS_BUG_ON_INODE(inode->i_flctx, inode); + WRITE_ONCE(inode->i_flctx, ctx); + /* + * Paired with locks_inode_context(). + */ + smp_store_release(&inode->i_opflags, inode->i_opflags | IOP_FLCTX); + spin_unlock(&inode->i_lock); + } else { + VFS_BUG_ON_INODE(!inode->i_flctx, inode); + spin_unlock(&inode->i_lock); kmem_cache_free(flctx_cache, ctx); ctx = locks_inode_context(inode); } diff --git a/fs/namei.c b/fs/namei.c index 76bc569ace8e..b28ecb699f32 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -879,7 +879,7 @@ static bool try_to_unlazy(struct nameidata *nd) { struct dentry *parent = nd->path.dentry; - BUG_ON(!(nd->flags & LOOKUP_RCU)); + VFS_BUG_ON(!(nd->flags & LOOKUP_RCU)); if (unlikely(nd->flags & LOOKUP_CACHED)) { drop_links(nd); @@ -919,7 +919,8 @@ out: static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) { int res; - BUG_ON(!(nd->flags & LOOKUP_RCU)); + + VFS_BUG_ON(!(nd->flags & LOOKUP_RCU)); if (unlikely(nd->flags & LOOKUP_CACHED)) { drop_links(nd); @@ -1631,9 +1632,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path) struct dentry *dentry = path->dentry; unsigned int flags = dentry->d_flags; - if (likely(!(flags & DCACHE_MANAGED_DENTRY))) - return true; - if (unlikely(nd->flags & LOOKUP_NO_XDEV)) return false; diff --git a/fs/namespace.c b/fs/namespace.c index 1d2089ffb6ab..0cc8c2757500 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -49,20 +49,14 @@ static unsigned int mp_hash_shift __ro_after_init; static __initdata unsigned long mhash_entries; static int __init set_mhash_entries(char *str) { - if (!str) - return 0; - mhash_entries = simple_strtoul(str, &str, 0); - return 1; + return kstrtoul(str, 0, &mhash_entries) == 0; } __setup("mhash_entries=", set_mhash_entries); static __initdata unsigned long mphash_entries; static int __init set_mphash_entries(char *str) { - if (!str) - return 0; - mphash_entries = simple_strtoul(str, &str, 0); - return 1; + return kstrtoul(str, 0, &mphash_entries) == 0; } __setup("mphash_entries=", set_mphash_entries); diff --git a/fs/ntfs3/xattr.c b/fs/ntfs3/xattr.c index c93df55e98d0..37a69a75ce68 100644 --- a/fs/ntfs3/xattr.c +++ b/fs/ntfs3/xattr.c @@ -641,13 +641,9 @@ static noinline int ntfs_set_acl_ex(struct mnt_idmap *idmap, value = NULL; flags = XATTR_REPLACE; } else { - size = posix_acl_xattr_size(acl->a_count); - value = kmalloc(size, GFP_NOFS); + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS); if (!value) return -ENOMEM; - err = posix_acl_to_xattr(&init_user_ns, acl, value, size); - if (err < 0) - goto out; flags = 0; } diff --git a/fs/orangefs/acl.c b/fs/orangefs/acl.c index 5aefb705bcc8..a01ef0c1b1bf 100644 --- a/fs/orangefs/acl.c +++ b/fs/orangefs/acl.c @@ -90,14 +90,9 @@ int __orangefs_set_acl(struct inode *inode, struct posix_acl *acl, int type) type); if (acl) { - size = posix_acl_xattr_size(acl->a_count); - value = kmalloc(size, GFP_KERNEL); + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); if (!value) return -ENOMEM; - - error = posix_acl_to_xattr(&init_user_ns, acl, value, size); - if (error < 0) - goto out; } gossip_debug(GOSSIP_ACL_DEBUG, @@ -111,7 +106,6 @@ int __orangefs_set_acl(struct inode *inode, struct posix_acl *acl, int type) */ error = orangefs_inode_setxattr(inode, name, value, size, 0); -out: kfree(value); if (!error) set_cached_acl(inode, type, acl); diff --git a/fs/posix_acl.c b/fs/posix_acl.c index 768f027c1428..4ef6f9d2b8d6 100644 --- a/fs/posix_acl.c +++ b/fs/posix_acl.c @@ -829,19 +829,19 @@ EXPORT_SYMBOL (posix_acl_from_xattr); /* * Convert from in-memory to extended attribute representation. */ -int +void * posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl, - void *buffer, size_t size) + size_t *sizep, gfp_t gfp) { - struct posix_acl_xattr_header *ext_acl = buffer; + struct posix_acl_xattr_header *ext_acl; struct posix_acl_xattr_entry *ext_entry; - int real_size, n; + size_t size; + int n; - real_size = posix_acl_xattr_size(acl->a_count); - if (!buffer) - return real_size; - if (real_size > size) - return -ERANGE; + size = posix_acl_xattr_size(acl->a_count); + ext_acl = kmalloc(size, gfp); + if (!ext_acl) + return NULL; ext_entry = (void *)(ext_acl + 1); ext_acl->a_version = cpu_to_le32(POSIX_ACL_XATTR_VERSION); @@ -864,7 +864,8 @@ posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl, break; } } - return real_size; + *sizep = size; + return ext_acl; } EXPORT_SYMBOL (posix_acl_to_xattr); diff --git a/fs/select.c b/fs/select.c index 65019b8ba3f7..78a1508c84d3 100644 --- a/fs/select.c +++ b/fs/select.c @@ -1038,14 +1038,11 @@ static long do_restart_poll(struct restart_block *restart_block) { struct pollfd __user *ufds = restart_block->poll.ufds; int nfds = restart_block->poll.nfds; - struct timespec64 *to = NULL, end_time; + struct timespec64 *to = NULL; int ret; - if (restart_block->poll.has_timeout) { - end_time.tv_sec = restart_block->poll.tv_sec; - end_time.tv_nsec = restart_block->poll.tv_nsec; - to = &end_time; - } + if (restart_block->poll.has_timeout) + to = &restart_block->poll.end_time; ret = do_sys_poll(ufds, nfds, to); @@ -1077,8 +1074,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds, restart_block->poll.nfds = nfds; if (timeout_msecs >= 0) { - restart_block->poll.tv_sec = end_time.tv_sec; - restart_block->poll.tv_nsec = end_time.tv_nsec; + restart_block->poll.end_time = end_time; restart_block->poll.has_timeout = 1; } else restart_block->poll.has_timeout = 0; diff --git a/fs/splice.c b/fs/splice.c index d338fe56b50b..5fb07c01936f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1467,7 +1467,7 @@ static ssize_t iter_to_pipe(struct iov_iter *from, n = DIV_ROUND_UP(left + start, PAGE_SIZE); for (i = 0; i < n; i++) { - int size = min_t(int, left, PAGE_SIZE - start); + int size = umin(left, PAGE_SIZE - start); buf.page = pages[i]; buf.offset = start; diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 0864773a57e8..822085bc2d20 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -21,7 +21,7 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask) if (likely(!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode))) return 0; - if (likely(!inode->i_rdev)) + if (!inode->i_rdev) return 0; if (S_ISBLK(inode->i_mode)) diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h index f0cf2714ec52..262e24d83313 100644 --- a/include/linux/exportfs.h +++ b/include/linux/exportfs.h @@ -201,9 +201,9 @@ struct handle_to_path_ctx { * @commit_metadata: commit metadata changes to stable storage * * See Documentation/filesystems/nfs/exporting.rst for details on how to use - * this interface correctly. + * this interface correctly and the definition of the flags. * - * encode_fh: + * @encode_fh: * @encode_fh should store in the file handle fragment @fh (using at most * @max_len bytes) information that can be used by @decode_fh to recover the * file referred to by the &struct dentry @de. If @flag has CONNECTABLE bit @@ -215,7 +215,7 @@ struct handle_to_path_ctx { * greater than @max_len*4 bytes). On error @max_len contains the minimum * size(in 4 byte unit) needed to encode the file handle. * - * fh_to_dentry: + * @fh_to_dentry: * @fh_to_dentry is given a &struct super_block (@sb) and a file handle * fragment (@fh, @fh_len). It should return a &struct dentry which refers * to the same file that the file handle fragment refers to. If it cannot, @@ -227,31 +227,44 @@ struct handle_to_path_ctx { * created with d_alloc_root. The caller can then find any other extant * dentries by following the d_alias links. * - * fh_to_parent: + * @fh_to_parent: * Same as @fh_to_dentry, except that it returns a pointer to the parent * dentry if it was encoded into the filehandle fragment by @encode_fh. * - * get_name: + * @get_name: * @get_name should find a name for the given @child in the given @parent * directory. The name should be stored in the @name (with the - * understanding that it is already pointing to a %NAME_MAX+1 sized + * understanding that it is already pointing to a %NAME_MAX + 1 sized * buffer. get_name() should return %0 on success, a negative error code * or error. @get_name will be called without @parent->i_rwsem held. * - * get_parent: + * @get_parent: * @get_parent should find the parent directory for the given @child which * is also a directory. In the event that it cannot be found, or storage * space cannot be allocated, a %ERR_PTR should be returned. * - * permission: + * @permission: * Allow filesystems to specify a custom permission function. * - * open: + * @open: * Allow filesystems to specify a custom open function. * - * commit_metadata: + * @commit_metadata: * @commit_metadata should commit metadata changes to stable storage. * + * @get_uuid: + * Get a filesystem unique signature exposed to clients. + * + * @map_blocks: + * Map and, if necessary, allocate blocks for a layout. + * + * @commit_blocks: + * Commit blocks in a layout once the client is done with them. + * + * @flags: + * Allows the filesystem to communicate to nfsd that it may want to do things + * differently when dealing with it. + * * Locking rules: * get_parent is called with child->d_inode->i_rwsem down * get_name is not (which is possibly inconsistent) diff --git a/include/linux/filelock.h b/include/linux/filelock.h index 2f5e5588ee07..d2c9740e26a8 100644 --- a/include/linux/filelock.h +++ b/include/linux/filelock.h @@ -242,7 +242,14 @@ bool locks_owner_has_blockers(struct file_lock_context *flctx, static inline struct file_lock_context * locks_inode_context(const struct inode *inode) { - return smp_load_acquire(&inode->i_flctx); + /* + * Paired with smp_store_release in locks_get_lock_context(). + * + * Ensures ->i_flctx will be visible if we spotted the flag. + */ + if (likely(!(smp_load_acquire(&inode->i_opflags) & IOP_FLCTX))) + return NULL; + return READ_ONCE(inode->i_flctx); } #else /* !CONFIG_FILE_LOCKING */ @@ -469,7 +476,7 @@ static inline int break_lease(struct inode *inode, unsigned int mode) * could end up racing with tasks trying to set a new lease on this * file. */ - flctx = READ_ONCE(inode->i_flctx); + flctx = locks_inode_context(inode); if (!flctx) return 0; smp_mb(); @@ -488,7 +495,7 @@ static inline int break_deleg(struct inode *inode, unsigned int flags) * could end up racing with tasks trying to set a new lease on this * file. */ - flctx = READ_ONCE(inode->i_flctx); + flctx = locks_inode_context(inode); if (!flctx) return 0; smp_mb(); @@ -533,8 +540,11 @@ static inline int break_deleg_wait(struct delegated_inode *di) static inline int break_layout(struct inode *inode, bool wait) { + struct file_lock_context *flctx; + smp_mb(); - if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) { + flctx = locks_inode_context(inode); + if (flctx && !list_empty_careful(&flctx->flc_lease)) { unsigned int flags = LEASE_BREAK_LAYOUT; if (!wait) diff --git a/include/linux/fs.h b/include/linux/fs.h index ca31bc9308a3..73911f961c7e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -631,6 +631,7 @@ is_uncached_acl(struct posix_acl *acl) #define IOP_MGTIME 0x0020 #define IOP_CACHED_LINK 0x0040 #define IOP_FASTPERM_MAY_EXEC 0x0080 +#define IOP_FLCTX 0x0100 /* * Inode state bits. Protected by inode->i_lock diff --git a/include/linux/ns/ns_common_types.h b/include/linux/ns/ns_common_types.h index b332b019b29c..0014fbc1c626 100644 --- a/include/linux/ns/ns_common_types.h +++ b/include/linux/ns/ns_common_types.h @@ -108,11 +108,13 @@ extern const struct proc_ns_operations utsns_operations; * @ns_tree: namespace tree nodes and active reference count */ struct ns_common { + struct { + refcount_t __ns_ref; /* do not use directly */ + } ____cacheline_aligned_in_smp; u32 ns_type; struct dentry *stashed; const struct proc_ns_operations *ops; unsigned int inum; - refcount_t __ns_ref; /* do not use directly */ union { struct ns_tree; struct rcu_head ns_rcu; diff --git a/include/linux/posix_acl_xattr.h b/include/linux/posix_acl_xattr.h index e86f3b731da2..9e1892525eac 100644 --- a/include/linux/posix_acl_xattr.h +++ b/include/linux/posix_acl_xattr.h @@ -44,8 +44,9 @@ posix_acl_from_xattr(struct user_namespace *user_ns, const void *value, } #endif -int posix_acl_to_xattr(struct user_namespace *user_ns, - const struct posix_acl *acl, void *buffer, size_t size); +extern void *posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl, + size_t *sizep, gfp_t gfp); + static inline const char *posix_acl_xattr_name(int type) { switch (type) { diff --git a/include/linux/restart_block.h b/include/linux/restart_block.h index 67d2bf579942..9b262109726d 100644 --- a/include/linux/restart_block.h +++ b/include/linux/restart_block.h @@ -6,6 +6,7 @@ #define __LINUX_RESTART_BLOCK_H #include <linux/compiler.h> +#include <linux/time64.h> #include <linux/types.h> struct __kernel_timespec; @@ -50,8 +51,7 @@ struct restart_block { struct pollfd __user *ufds; int nfds; int has_timeout; - unsigned long tv_sec; - unsigned long tv_nsec; + struct timespec64 end_time; } poll; }; }; diff --git a/init/Kconfig b/init/Kconfig index fa79feb8fe57..160c1c4ef253 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -624,8 +624,9 @@ config SCHED_HW_PRESSURE arch_update_hw_pressure() and arch_scale_thermal_pressure(). config BSD_PROCESS_ACCT - bool "BSD Process Accounting" + bool "BSD Process Accounting (DEPRECATED)" depends on MULTIUSER + default n help If you say Y here, a user level program will be able to instruct the kernel (via a special system call) to write process accounting @@ -635,7 +636,9 @@ config BSD_PROCESS_ACCT command name, memory usage, controlling terminal etc. (the complete list is in the struct acct in <file:include/linux/acct.h>). It is up to the user level program to do useful things with this - information. This is generally a good idea, so say Y. + information. This mechanism is antiquated and has significant + scalability issues. You probably want to use eBPF instead. Say + N unless you really need this. config BSD_PROCESS_ACCT_V3 bool "BSD Process Accounting version 3 file format" diff --git a/init/initramfs_test.c b/init/initramfs_test.c index 5d2db455e60c..beb6e3cf7808 100644 --- a/init/initramfs_test.c +++ b/init/initramfs_test.c @@ -447,6 +447,53 @@ out: kfree(tbufs); } +static void __init initramfs_test_fname_path_max(struct kunit *test) +{ + char *err; + size_t len; + struct kstat st0, st1; + char fdata[] = "this file data will not be unpacked"; + struct test_fname_path_max { + char fname_oversize[PATH_MAX + 1]; + char fname_ok[PATH_MAX]; + char cpio_src[(CPIO_HDRLEN + PATH_MAX + 3 + sizeof(fdata)) * 2]; + } *tbufs = kzalloc(sizeof(struct test_fname_path_max), GFP_KERNEL); + struct initramfs_test_cpio c[] = { { + .magic = "070701", + .ino = 1, + .mode = S_IFDIR | 0777, + .nlink = 1, + .namesize = sizeof(tbufs->fname_oversize), + .fname = tbufs->fname_oversize, + .filesize = sizeof(fdata), + .data = fdata, + }, { + .magic = "070701", + .ino = 2, + .mode = S_IFDIR | 0777, + .nlink = 1, + .namesize = sizeof(tbufs->fname_ok), + .fname = tbufs->fname_ok, + } }; + + memset(tbufs->fname_oversize, '/', sizeof(tbufs->fname_oversize) - 1); + memset(tbufs->fname_ok, '/', sizeof(tbufs->fname_ok) - 1); + memcpy(tbufs->fname_oversize, "fname_oversize", + sizeof("fname_oversize") - 1); + memcpy(tbufs->fname_ok, "fname_ok", sizeof("fname_ok") - 1); + len = fill_cpio(c, ARRAY_SIZE(c), tbufs->cpio_src); + + /* unpack skips over fname_oversize instead of returning an error */ + err = unpack_to_rootfs(tbufs->cpio_src, len); + KUNIT_EXPECT_NULL(test, err); + + KUNIT_EXPECT_EQ(test, init_stat("fname_oversize", &st0, 0), -ENOENT); + KUNIT_EXPECT_EQ(test, init_stat("fname_ok", &st1, 0), 0); + KUNIT_EXPECT_EQ(test, init_rmdir("fname_ok"), 0); + + kfree(tbufs); +} + /* * The kunit_case/_suite struct cannot be marked as __initdata as this will be * used in debugfs to retrieve results after test has run. @@ -459,6 +506,7 @@ static struct kunit_case __refdata initramfs_test_cases[] = { KUNIT_CASE(initramfs_test_hardlink), KUNIT_CASE(initramfs_test_many), KUNIT_CASE(initramfs_test_fname_pad), + KUNIT_CASE(initramfs_test_fname_path_max), {}, }; diff --git a/kernel/pid.c b/kernel/pid.c index a31771bc89c1..f45ae56db7da 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -159,58 +159,86 @@ void free_pids(struct pid **pids) free_pid(pids[tmp]); } -struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, - size_t set_tid_size) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid, + size_t arg_set_tid_size) { + int set_tid[MAX_PID_NS_LEVEL + 1] = {}; + int pid_max[MAX_PID_NS_LEVEL + 1] = {}; struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; int retval = -ENOMEM; + bool retried_preload; /* - * set_tid_size contains the size of the set_tid array. Starting at + * arg_set_tid_size contains the size of the arg_set_tid array. Starting at * the most nested currently active PID namespace it tells alloc_pid() * which PID to set for a process in that most nested PID namespace - * up to set_tid_size PID namespaces. It does not have to set the PID - * for a process in all nested PID namespaces but set_tid_size must + * up to arg_set_tid_size PID namespaces. It does not have to set the PID + * for a process in all nested PID namespaces but arg_set_tid_size must * never be greater than the current ns->level + 1. */ - if (set_tid_size > ns->level + 1) + if (arg_set_tid_size > ns->level + 1) return ERR_PTR(-EINVAL); + /* + * Prep before we take locks: + * + * 1. allocate and fill in pid struct + */ pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) return ERR_PTR(retval); - tmp = ns; + get_pid_ns(ns); pid->level = ns->level; + refcount_set(&pid->count, 1); + spin_lock_init(&pid->lock); + for (type = 0; type < PIDTYPE_MAX; ++type) + INIT_HLIST_HEAD(&pid->tasks[type]); + init_waitqueue_head(&pid->wait_pidfd); + INIT_HLIST_HEAD(&pid->inodes); - for (i = ns->level; i >= 0; i--) { - int tid = 0; - int pid_max = READ_ONCE(tmp->pid_max); + /* + * 2. perm check checkpoint_restore_ns_capable() + * + * This stores found pid_max to make sure the used value is the same should + * later code need it. + */ + for (tmp = ns, i = ns->level; i >= 0; i--) { + pid_max[ns->level - i] = READ_ONCE(tmp->pid_max); - if (set_tid_size) { - tid = set_tid[ns->level - i]; + if (arg_set_tid_size) { + int tid = set_tid[ns->level - i] = arg_set_tid[ns->level - i]; retval = -EINVAL; - if (tid < 1 || tid >= pid_max) - goto out_free; + if (tid < 1 || tid >= pid_max[ns->level - i]) + goto out_abort; /* * Also fail if a PID != 1 is requested and * no PID 1 exists. */ if (tid != 1 && !tmp->child_reaper) - goto out_free; + goto out_abort; retval = -EPERM; if (!checkpoint_restore_ns_capable(tmp->user_ns)) - goto out_free; - set_tid_size--; + goto out_abort; + arg_set_tid_size--; } - idr_preload(GFP_KERNEL); - spin_lock(&pidmap_lock); + tmp = tmp->parent; + } + + /* + * Prep is done, id allocation goes here: + */ + retried_preload = false; + idr_preload(GFP_KERNEL); + spin_lock(&pidmap_lock); + for (tmp = ns, i = ns->level; i >= 0;) { + int tid = set_tid[ns->level - i]; if (tid) { nr = idr_alloc(&tmp->idr, NULL, tid, @@ -220,6 +248,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, * alreay in use. Return EEXIST in that case. */ if (nr == -ENOSPC) + nr = -EEXIST; } else { int pid_min = 1; @@ -235,19 +264,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, * a partially initialized PID (see below). */ nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, - pid_max, GFP_ATOMIC); + pid_max[ns->level - i], GFP_ATOMIC); + if (nr == -ENOSPC) + nr = -EAGAIN; } - spin_unlock(&pidmap_lock); - idr_preload_end(); - if (nr < 0) { - retval = (nr == -ENOSPC) ? -EAGAIN : nr; + if (unlikely(nr < 0)) { + /* + * Preload more memory if idr_alloc{,cyclic} failed with -ENOMEM. + * + * The IDR API only allows us to preload memory for one call, while we may end + * up doing several under pidmap_lock with GFP_ATOMIC. The situation may be + * salvageable with GFP_KERNEL. But make sure to not loop indefinitely if preload + * did not help (the routine unfortunately returns void, so we have no idea + * if it got anywhere). + * + * The lock can be safely dropped and picked up as historically pid allocation + * for different namespaces was *not* atomic -- we try to hold on to it the + * entire time only for performance reasons. + */ + if (nr == -ENOMEM && !retried_preload) { + spin_unlock(&pidmap_lock); + idr_preload_end(); + retried_preload = true; + idr_preload(GFP_KERNEL); + spin_lock(&pidmap_lock); + continue; + } + retval = nr; goto out_free; } pid->numbers[i].nr = nr; pid->numbers[i].ns = tmp; tmp = tmp->parent; + i--; + retried_preload = false; } /* @@ -257,25 +309,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, * is what we have exposed to userspace for a long time and it is * documented behavior for pid namespaces. So we can't easily * change it even if there were an error code better suited. + * + * This can't be done earlier because we need to preserve other + * error conditions. */ retval = -ENOMEM; - - get_pid_ns(ns); - refcount_set(&pid->count, 1); - spin_lock_init(&pid->lock); - for (type = 0; type < PIDTYPE_MAX; ++type) - INIT_HLIST_HEAD(&pid->tasks[type]); - - init_waitqueue_head(&pid->wait_pidfd); - INIT_HLIST_HEAD(&pid->inodes); - - upid = pid->numbers + ns->level; - idr_preload(GFP_KERNEL); - spin_lock(&pidmap_lock); - if (!(ns->pid_allocated & PIDNS_ADDING)) - goto out_unlock; + if (unlikely(!(ns->pid_allocated & PIDNS_ADDING))) + goto out_free; pidfs_add_pid(pid); - for ( ; upid >= pid->numbers; --upid) { + for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) { /* Make the PID visible to find_pid_ns. */ idr_replace(&upid->ns->idr, pid, upid->nr); upid->ns->pid_allocated++; @@ -286,13 +328,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, return pid; -out_unlock: - spin_unlock(&pidmap_lock); - idr_preload_end(); - put_pid_ns(ns); - out_free: - spin_lock(&pidmap_lock); while (++i <= ns->level) { upid = pid->numbers + i; idr_remove(&upid->ns->idr, upid->nr); @@ -303,7 +339,10 @@ out_free: idr_set_cursor(&ns->idr, 0); spin_unlock(&pidmap_lock); + idr_preload_end(); +out_abort: + put_pid_ns(ns); kmem_cache_free(ns->pid_cachep, pid); return ERR_PTR(retval); } |
