diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-07-03 05:48:05 -1000 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-07-03 05:48:05 -1000 |
| commit | 71dfdfb0209b43dfd6f494f84f5548e4cfd18cb5 (patch) | |
| tree | cfe70d8de248fc18924b14f05d6315282d6febc7 | |
| parent | 025d0d6221d9b060bce251427c671cd0080d9dae (diff) | |
| parent | 5c6ce05e406520290c1d89da97fb3cd70c09137d (diff) | |
| download | linux-master.tar.gz linux-master.zip | |
Pull vfs fixes from Christian Brauner:
- netfs:
- fix the decision when to disallow write-streaming with fscache in
use, handling of asynchronous cache object creation, a double fput
in cachefiles, clearing S_KERNEL_FILE without the inode lock held,
page extraction bugs in the iov_iter helpers (a potential
underflow, a missing allocation failure check, a memory leak, and
a folio offset miscalculation), writeback error and ENOMEM
handling, DIO write retry for filesystems without a
->prepare_write() method, and the replacement of the wb_lock mutex
with a bit lock plus writethrough collection offload so that
multiple asynchronous writebacks don't interfere with each other.
- Fix the barriering when walking the netfs subrequest list during
retries as it was possible to see a subrequest that was just added
by the application thread.
- iomap:
- Change iomap to submit read bios after each extent instead of
building them up across extents. The old behavior was considered
problematic for a while and now caused an actual erofs bug.
- Guard the ioend io_size EOF trim in iomap against underflow when a
concurrent truncate moves EOF below the start of the ioend,
wrapping io_size to a huge value.
- overlayfs
- Fix a stale overlayfs comment about the locking order.
- Store the linked-in upper dentry instead of the disconnected
O_TMPFILE dentry during overlayfs tmpfile copy-up. With a FUSE or
virtiofs upper layer ->d_revalidate() would try to look up "/" in
the workdir and fail, causing persistent ESTALE errors that broke
dpkg and apt.
- vfs-bpf:
Have the bpf_real_data_inode() kfunc take a struct file instead of a
dentry so it is usable from the bprm_check_security, mmap_file, and
file_mprotect hooks, and rename it from bpf_real_inode() to make the
data-inode semantics explicit. The kfunc landed this cycle so the
change is safe.
- afs:
NULL pointer dereferences in the callback service and in
afs_get_tree(), several memory and refcount leaks, missing locking
around the dynamic root inode numbers and premature cell exposure
through /afs, a netns destruction hang caused by a misplaced
increment of net->cells_outstanding, a bulk lookup malfunction caused
by the dir_emit() API change, inode (re)initialisation issues, and
assorted smaller fixes to error codes, seqlock handling, and debug
output.
- vfs:
Refuse O_TMPFILE creation with an unmapped fsuid or fsgid and add a
selftest for it.
- vboxsf:
Add Jori Koolstra as vboxsf maintainer, taking over from Hans de
Goede.
- dio:
Release the pages attached to a short atomic dio bio; the REQ_ATOMIC
size check error path leaked them.
- procfs:
Only bump the parent directory link count when registering
directories in procfs. Registering regular files inflated the count
and leaked a link on every create and remove cycle.
- minix:
Avoid an unsigned overflow in the minix bitmap block count
calculation that let crafted images with huge inode or zone counts
pass superblock validation and crash the kernel during mount.
- cachefiles:
Fix a double unlock in the cachefiles nomem_d_alloc error path left
over from the start_creating() conversion.
- fat:
Stop fat from reading directory entries past the 0x00
end-of-directory marker. If the trailing on-disk slots aren't
zero-filled the driver surfaced arbitrary garbage as directory
entries.
- freexvfs:
Don't BUG() on unknown typed-extent types in freevxfs, reachable via
ioctl(FIBMAP) on a crafted image; fail with an I/O error instead.
- orangefs:
Keep the readdir entry size 64-bit in orangefs fill_from_part().
Truncating it to __u32 bypassed the bounds check and led to
out-of-bounds reads triggerable by the userspace client.
- xfs:
Fix the error unwind in xfs_open_devices() which released the rt
device file twice and left dangling buftarg pointers behind that were
freed again when the failed mount was torn down.
- exec:
Fix an off-by-one in the comment documenting the maximum binfmt
rewrite depth in exec_binprm(). The code allows five rewrites, not
four; restricting the code would break userspace so the comment is
fixed instead.
- file handles:
Reject detached mounts in capable_wrt_mount(). A detached mount can
be dissolved concurrently, leaving a NULL mount namespace that
open_by_handle_at() would dereference.
* tag 'vfs-7.2-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (57 commits)
netfs: Fix barriering when walking subrequest list
iomap: submit read bio after each extent
fuse: call fuse_send_readpages explicitly from fuse_readahead
iomap: consolidate bio submission
fhandle: reject detached mounts in capable_wrt_mount()
netfs: Fix DIO write retry for filesystems without a ->prepare_write()
netfs: Fix folio state after ENOMEM whilst under writeback iteration
netfs: Fix writeback error handling
netfs: Fix writethrough to use collection offload
netfs: Replace wb_lock with a bit lock for asynchronicity
netfs: Fix kdoc warning
scatterlist: Fix offset in folio calc in extract_xarray_to_sg()
iov_iter: Remove unused variable in kunit_iov_iter.c
iov_iter: Fix a memory leak in iov_iter_extract_user_pages()
iov_iter: Fix missing alloc fail check in iov_iter_extract_bvec_pages()
iov_iter: Fix potential underflow in iov_iter_extract_xarray_pages()
cachefiles: Fix file burial to take lock when unsetting S_KERNEL_FILE
cachefiles: Fix double fput
netfs: Fix netfs_create_write_req() to handle async cache object creation
netfs: Fix decision whether to disallow write-streaming due to fscache use
...
52 files changed, 592 insertions, 178 deletions
diff --git a/MAINTAINERS b/MAINTAINERS index 25453040dffb..7cc4bca5a2c5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -28725,7 +28725,7 @@ F: include/linux/vbox_utils.h F: include/uapi/linux/vbox*.h VIRTUAL BOX SHARED FOLDER VFS DRIVER -M: Hans de Goede <hansg@kernel.org> +M: Jori Koolstra <jkoolstra@xs4all.nl> L: linux-fsdevel@vger.kernel.org S: Maintained F: fs/vboxsf/* diff --git a/fs/afs/callback.c b/fs/afs/callback.c index 894d2bad6b6c..61354003c006 100644 --- a/fs/afs/callback.c +++ b/fs/afs/callback.c @@ -113,16 +113,12 @@ static struct afs_volume *afs_lookup_volume_rcu(struct afs_cell *cell, { struct afs_volume *volume = NULL; struct rb_node *p; - int seq = 1; - for (;;) { + scoped_seqlock_read(&cell->volume_lock, ss_lock) { /* Unfortunately, rbtree walking doesn't give reliable results * under just the RCU read lock, so we have to check for * changes. */ - seq++; /* 2 on the 1st/lockless path, otherwise odd */ - read_seqbegin_or_lock(&cell->volume_lock, &seq); - p = rcu_dereference_raw(cell->volumes.rb_node); while (p) { volume = rb_entry(p, struct afs_volume, cell_node); @@ -138,12 +134,9 @@ static struct afs_volume *afs_lookup_volume_rcu(struct afs_cell *cell, if (volume && afs_try_get_volume(volume, afs_volume_trace_get_callback)) break; - if (!need_seqretry(&cell->volume_lock, seq)) - break; - seq |= 1; /* Want a lock next time */ + volume = NULL; } - done_seqretry(&cell->volume_lock, seq); return volume; } @@ -221,7 +214,11 @@ static void afs_break_some_callbacks(struct afs_server *server, rcu_read_lock(); volume = afs_lookup_volume_rcu(server->cell, vid); - if (cbb->fid.vnode == 0 && cbb->fid.unique == 0) { + if (!volume) { + /* Ignore breaks on unknown volumes. */ + rcu_read_unlock(); + *_count = 0; + } else if (cbb->fid.vnode == 0 && cbb->fid.unique == 0) { afs_break_volume_callback(server, volume); *_count -= 1; if (*_count) diff --git a/fs/afs/cell.c b/fs/afs/cell.c index 9738684dbdd2..47a2645768d7 100644 --- a/fs/afs/cell.c +++ b/fs/afs/cell.c @@ -206,11 +206,6 @@ static struct afs_cell *afs_alloc_cell(struct afs_net *net, cell->dns_status = vllist->status; smp_store_release(&cell->dns_lookup_count, 1); /* vs source/status */ atomic_inc(&net->cells_outstanding); - ret = idr_alloc_cyclic(&net->cells_dyn_ino, cell, - 2, INT_MAX / 2, GFP_KERNEL); - if (ret < 0) - goto error; - cell->dynroot_ino = ret; cell->debug_id = atomic_inc_return(&cell_debug_id); trace_afs_cell(cell->debug_id, 1, 0, afs_cell_trace_alloc); @@ -304,6 +299,13 @@ struct afs_cell *afs_lookup_cell(struct afs_net *net, goto cell_already_exists; } + ret = idr_alloc_cyclic(&net->cells_dyn_ino, candidate, + 2, INT_MAX / 2, GFP_KERNEL); + if (ret < 0) + goto cant_alloc_ino; + candidate->dynroot_ino = ret; + set_bit(AFS_CELL_FL_HAVE_INO, &candidate->flags); + cell = candidate; candidate = NULL; afs_use_cell(cell, trace); @@ -378,6 +380,11 @@ no_wait: _leave(" = %p [cell]", cell); return cell; +cant_alloc_ino: + up_write(&net->cells_lock); + afs_put_cell(candidate, afs_cell_trace_put_candidate); + goto error_noput; + cell_already_exists: _debug("cell exists"); cell = cursor; @@ -547,6 +554,8 @@ static int afs_update_cell(struct afs_cell *cell) rcu_assign_pointer(cell->vl_servers, vllist); cell->dns_source = vllist->source; old = p; + } else { + old = vllist; } write_unlock(&cell->vl_servers_lock); afs_put_vlserverlist(cell->net, old); @@ -577,7 +586,6 @@ static void afs_cell_destroy(struct rcu_head *rcu) afs_put_vlserverlist(net, rcu_access_pointer(cell->vl_servers)); afs_unuse_cell(cell->alias_of, afs_cell_trace_unuse_alias); key_put(cell->anonymous_key); - idr_remove(&net->cells_dyn_ino, cell->dynroot_ino); kfree(cell->name - 1); kfree(cell); @@ -592,6 +600,13 @@ static void afs_destroy_cell_work(struct work_struct *work) afs_see_cell(cell, afs_cell_trace_destroy); timer_delete_sync(&cell->management_timer); cancel_work_sync(&cell->manager); + + if (test_bit(AFS_CELL_FL_HAVE_INO, &cell->flags)) { + down_write(&cell->net->cells_lock); + idr_remove(&cell->net->cells_dyn_ino, cell->dynroot_ino); + up_write(&cell->net->cells_lock); + } + call_rcu(&cell->rcu, afs_cell_destroy); } diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c index 5540ae1cad59..db394f101fc6 100644 --- a/fs/afs/cmservice.c +++ b/fs/afs/cmservice.c @@ -334,7 +334,6 @@ static int afs_deliver_cb_init_call_back_state3(struct afs_call *call) ret = afs_extract_data(call, false); switch (ret) { case 0: break; - case -EAGAIN: return 0; default: return ret; } @@ -364,6 +363,11 @@ static int afs_deliver_cb_init_call_back_state3(struct afs_call *call) if (!afs_check_call_state(call, AFS_CALL_SV_REPLYING)) return afs_io_error(call, afs_io_error_cm_reply); + if (!call->server) { + trace_afs_cm_no_server_u(call, call->request); + return 0; + } + if (memcmp(call->request, &call->server->_uuid, sizeof(call->server->_uuid)) != 0) { pr_notice("Callback UUID does not match fileserver UUID\n"); trace_afs_cm_no_server_u(call, call->request); @@ -451,7 +455,6 @@ static int afs_deliver_cb_probe_uuid(struct afs_call *call) ret = afs_extract_data(call, false); switch (ret) { case 0: break; - case -EAGAIN: return 0; default: return ret; } diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 498b99ccdf0e..6df56fe9163f 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -28,9 +28,11 @@ static int afs_d_revalidate(struct inode *dir, const struct qstr *name, static int afs_d_delete(const struct dentry *dentry); static void afs_d_iput(struct dentry *dentry, struct inode *inode); static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, int nlen, - loff_t fpos, u64 ino, unsigned dtype); + u64 ino, u32 uniquifier); +#define AFS_LOOKUP_ONE ((filldir_t)0x123UL) static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen, - loff_t fpos, u64 ino, unsigned dtype); + u64 ino, u32 uniquifier); +#define AFS_LOOKUP ((filldir_t)0x137UL) static int afs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode, bool excl); static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir, @@ -421,11 +423,18 @@ static int afs_dir_iterate_block(struct afs_vnode *dvnode, } /* found the next entry */ - if (!dir_emit(ctx, dire->u.name, nlen, - ntohl(dire->u.vnode), - (ctx->actor == afs_lookup_filldir || - ctx->actor == afs_lookup_one_filldir)? - ntohl(dire->u.unique) : DT_UNKNOWN)) { + if (ctx->actor == AFS_LOOKUP) { + if (!afs_lookup_filldir(ctx, dire->u.name, nlen, + ntohl(dire->u.vnode), + ntohl(dire->u.unique))) + return 0; + } else if (ctx->actor == AFS_LOOKUP_ONE) { + if (!afs_lookup_one_filldir(ctx, dire->u.name, nlen, + ntohl(dire->u.vnode), + ntohl(dire->u.unique))) + return 0; + } else if (!dir_emit(ctx, dire->u.name, nlen, + ntohl(dire->u.vnode), DT_UNKNOWN)) { _leave(" = 0 [full]"); return 0; } @@ -545,6 +554,7 @@ static int afs_readdir(struct file *file, struct dir_context *ctx) { afs_dataversion_t dir_version; + ctx->dt_flags_mask = UINT_MAX; return afs_dir_iterate(file_inode(file), ctx, file, &dir_version); } @@ -554,14 +564,14 @@ static int afs_readdir(struct file *file, struct dir_context *ctx) * uniquifier through dtype */ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, - int nlen, loff_t fpos, u64 ino, unsigned dtype) + int nlen, u64 ino, u32 uniquifier) { struct afs_lookup_one_cookie *cookie = container_of(ctx, struct afs_lookup_one_cookie, ctx); _enter("{%s,%u},%s,%u,,%llu,%u", cookie->name.name, cookie->name.len, name, nlen, - (unsigned long long) ino, dtype); + (unsigned long long) ino, uniquifier); /* insanity checks first */ BUILD_BUG_ON(sizeof(union afs_xdr_dir_block) != 2048); @@ -574,7 +584,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, } cookie->fid.vnode = ino; - cookie->fid.unique = dtype; + cookie->fid.unique = uniquifier; cookie->found = 1; _leave(" = false [found]"); @@ -591,7 +601,7 @@ static int afs_do_lookup_one(struct inode *dir, const struct qstr *name, { struct afs_super_info *as = dir->i_sb->s_fs_info; struct afs_lookup_one_cookie cookie = { - .ctx.actor = afs_lookup_one_filldir, + .ctx.actor = AFS_LOOKUP_ONE, .name = *name, .fid.vid = as->volume->vid }; @@ -622,14 +632,14 @@ static int afs_do_lookup_one(struct inode *dir, const struct qstr *name, * uniquifier through dtype */ static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, - int nlen, loff_t fpos, u64 ino, unsigned dtype) + int nlen, u64 ino, u32 uniquifier) { struct afs_lookup_cookie *cookie = container_of(ctx, struct afs_lookup_cookie, ctx); _enter("{%s,%u},%s,%u,,%llu,%u", cookie->name.name, cookie->name.len, name, nlen, - (unsigned long long) ino, dtype); + (unsigned long long) ino, uniquifier); /* insanity checks first */ BUILD_BUG_ON(sizeof(union afs_xdr_dir_block) != 2048); @@ -637,7 +647,7 @@ static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, if (cookie->nr_fids < 50) { cookie->fids[cookie->nr_fids].vnode = ino; - cookie->fids[cookie->nr_fids].unique = dtype; + cookie->fids[cookie->nr_fids].unique = uniquifier; cookie->nr_fids++; } @@ -778,7 +788,7 @@ static struct inode *afs_do_lookup(struct inode *dir, struct dentry *dentry) for (i = 0; i < ARRAY_SIZE(cookie->fids); i++) cookie->fids[i].vid = dvnode->fid.vid; - cookie->ctx.actor = afs_lookup_filldir; + cookie->ctx.actor = AFS_LOOKUP; cookie->name = dentry->d_name; cookie->nr_fids = 2; /* slot 1 is saved for the fid we actually want * and slot 0 for the directory */ diff --git a/fs/afs/dynroot.c b/fs/afs/dynroot.c index 1d5e33bc7502..6e3c8c691ba9 100644 --- a/fs/afs/dynroot.c +++ b/fs/afs/dynroot.c @@ -278,7 +278,7 @@ static struct dentry *afs_lookup_atcell(struct inode *dir, struct dentry *dentry } /* - * Transcribe the cell database into readdir content under the RCU read lock. + * Transcribe the cell database into readdir content under net->cells_lock. * Each cell produces two entries, one prefixed with a dot and one not. */ static int afs_dynroot_readdir_cells(struct afs_net *net, struct dir_context *ctx) diff --git a/fs/afs/fs_operation.c b/fs/afs/fs_operation.c index c0dbbc6d3716..20801b29521d 100644 --- a/fs/afs/fs_operation.c +++ b/fs/afs/fs_operation.c @@ -348,7 +348,7 @@ int afs_put_operation(struct afs_operation *op) for (i = 0; i < op->nr_files - 2; i++) if (op->more_files[i].put_vnode) iput(&op->more_files[i].vnode->netfs.inode); - kfree(op->more_files); + kvfree(op->more_files); } if (op->estate) { diff --git a/fs/afs/inode.c b/fs/afs/inode.c index 3f48458694ba..14f39a9bea6c 100644 --- a/fs/afs/inode.c +++ b/fs/afs/inode.c @@ -52,9 +52,9 @@ static noinline void dump_vnode(struct afs_vnode *vnode, struct afs_vnode *paren /* * Set parameters for the netfs library */ -static void afs_set_netfs_context(struct afs_vnode *vnode) +static void afs_set_netfs_context(struct afs_vnode *vnode, bool is_file) { - netfs_inode_init(&vnode->netfs, &afs_req_ops, true); + netfs_inode_init(&vnode->netfs, &afs_req_ops, is_file); } /* @@ -93,6 +93,10 @@ static int afs_inode_init_from_status(struct afs_operation *op, inode->i_gid = make_kgid(&init_user_ns, status->group); set_nlink(&vnode->netfs.inode, status->nlink); + i_size_write(inode, status->size); + inode_set_bytes(inode, status->size); + afs_set_netfs_context(vnode, status->type == AFS_FTYPE_FILE); + switch (status->type) { case AFS_FTYPE_FILE: inode->i_mode = S_IFREG | (status->mode & S_IALLUGO); @@ -126,7 +130,6 @@ static int afs_inode_init_from_status(struct afs_operation *op, } inode->i_mapping->a_ops = &afs_symlink_aops; inode_nohighmem(inode); - mapping_set_release_always(inode->i_mapping); break; default: dump_vnode(vnode, op->file[0].vnode != vnode ? op->file[0].vnode : NULL); @@ -134,10 +137,6 @@ static int afs_inode_init_from_status(struct afs_operation *op, return afs_protocol_error(NULL, afs_eproto_file_type); } - i_size_write(inode, status->size); - inode_set_bytes(inode, status->size); - afs_set_netfs_context(vnode); - vnode->invalid_before = status->data_version; trace_afs_set_dv(vnode, status->data_version); inode_set_iversion_raw(&vnode->netfs.inode, status->data_version); @@ -566,7 +565,6 @@ struct inode *afs_root_iget(struct super_block *sb, struct key *key) vnode = AFS_FS_I(inode); vnode->cb_v_check = atomic_read(&as->volume->cb_v_break); - afs_set_netfs_context(vnode); op = afs_alloc_operation(key, as->volume); if (IS_ERR(op)) { @@ -682,6 +680,7 @@ void afs_evict_inode(struct inode *inode) inode->i_mapping->a_ops->writepages(inode->i_mapping, &wbc); } + flush_delayed_work(&vnode->lock_work); netfs_wait_for_outstanding_io(inode); truncate_inode_pages_final(&inode->i_data); netfs_free_folioq_buffer(vnode->directory); diff --git a/fs/afs/internal.h b/fs/afs/internal.h index 0b72a8566299..601f01e5c15f 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -388,6 +388,7 @@ struct afs_cell { #define AFS_CELL_FL_NO_GC 0 /* The cell was added manually, don't auto-gc */ #define AFS_CELL_FL_DO_LOOKUP 1 /* DNS lookup requested */ #define AFS_CELL_FL_CHECK_ALIAS 2 /* Need to check for aliases */ +#define AFS_CELL_FL_HAVE_INO 3 /* Have dynroot_ino */ enum afs_cell_state state; short error; enum dns_record_source dns_source:8; /* Latest source of data from lookup */ @@ -750,8 +751,6 @@ static inline void afs_vnode_set_cache(struct afs_vnode *vnode, { #ifdef CONFIG_AFS_FSCACHE vnode->netfs.cache = cookie; - if (cookie) - mapping_set_release_always(vnode->netfs.inode.i_mapping); #endif } diff --git a/fs/afs/super.c b/fs/afs/super.c index 942f3e9800d7..82bb713825a0 100644 --- a/fs/afs/super.c +++ b/fs/afs/super.c @@ -587,7 +587,8 @@ static int afs_get_tree(struct fs_context *fc) } fc->root = dget(sb->s_root); - trace_afs_get_tree(as->cell, as->volume); + if (!ctx->dyn_root) + trace_afs_get_tree(as->cell, as->volume); _leave(" = 0 [%p]", sb); return 0; @@ -659,7 +660,6 @@ static void afs_i_init_once(void *_vnode) INIT_LIST_HEAD(&vnode->wb_keys); INIT_LIST_HEAD(&vnode->pending_locks); INIT_LIST_HEAD(&vnode->granted_locks); - INIT_DELAYED_WORK(&vnode->lock_work, afs_lock_work); INIT_LIST_HEAD(&vnode->cb_mmap_link); seqlock_init(&vnode->cb_lock); } @@ -693,6 +693,7 @@ static struct inode *afs_alloc_inode(struct super_block *sb) init_rwsem(&vnode->rmdir_lock); INIT_WORK(&vnode->cb_work, afs_invalidate_mmap_work); + INIT_DELAYED_WORK(&vnode->lock_work, afs_lock_work); _leave(" = %p", &vnode->netfs.inode); return &vnode->netfs.inode; diff --git a/fs/afs/symlink.c b/fs/afs/symlink.c index ed5868369f37..16b4823cb7b7 100644 --- a/fs/afs/symlink.c +++ b/fs/afs/symlink.c @@ -255,11 +255,11 @@ int afs_symlink_writepages(struct address_space *mapping, } if (ret == 0) { - mutex_lock(&vnode->netfs.wb_lock); + netfs_wb_begin(&vnode->netfs, false); netfs_free_folioq_buffer(vnode->directory); vnode->directory = NULL; vnode->directory_size = 0; - mutex_unlock(&vnode->netfs.wb_lock); + netfs_wb_end(&vnode->netfs); } else if (ret == 1) { ret = 0; /* Skipped write due to lock conflict. */ } diff --git a/fs/afs/vl_list.c b/fs/afs/vl_list.c index 3e4966915ea4..c1dac5dbed0d 100644 --- a/fs/afs/vl_list.c +++ b/fs/afs/vl_list.c @@ -92,7 +92,7 @@ static struct afs_addr_list *afs_extract_vl_addrs(struct afs_net *net, { struct afs_addr_list *alist; const u8 *b = *_b; - int ret = -EINVAL; + int ret; alist = afs_alloc_addrlist(nr_addrs); if (!alist) @@ -110,6 +110,7 @@ static struct afs_addr_list *afs_extract_vl_addrs(struct afs_net *net, case DNS_ADDRESS_IS_IPV4: if (end - b < 4) { _leave(" = -EINVAL [short inet]"); + ret = -EINVAL; goto error; } memcpy(x, b, 4); @@ -122,6 +123,7 @@ static struct afs_addr_list *afs_extract_vl_addrs(struct afs_net *net, case DNS_ADDRESS_IS_IPV6: if (end - b < 16) { _leave(" = -EINVAL [short inet6]"); + ret = -EINVAL; goto error; } memcpy(x, b, 16); @@ -198,6 +200,8 @@ struct afs_vlserver_list *afs_extract_vlserver_list(struct afs_cell *cell, b += sizeof(*hdr); while (end - b >= sizeof(bs)) { + int nlen; + bs.name_len = afs_extract_le16(&b); bs.priority = afs_extract_le16(&b); bs.weight = afs_extract_le16(&b); @@ -207,10 +211,12 @@ struct afs_vlserver_list *afs_extract_vlserver_list(struct afs_cell *cell, bs.protocol = *b++; bs.nr_addrs = *b++; + nlen = min3(bs.name_len, end - b, 255); + _debug("extract %u %u %u %u %u %u %*.*s", bs.name_len, bs.priority, bs.weight, bs.port, bs.protocol, bs.nr_addrs, - bs.name_len, bs.name_len, b); + bs.name_len, nlen, b); if (end - b < bs.name_len) break; @@ -287,8 +293,20 @@ struct afs_vlserver_list *afs_extract_vlserver_list(struct afs_cell *cell, afs_put_addrlist(old, afs_alist_trace_put_vlserver_old); } + /* Check for duplicates in the server list */ + for (j = 0; j < vllist->nr_servers; j++) { + struct afs_vlserver *s = vllist->servers[j].server; - /* TODO: Might want to check for duplicates */ + if (s->name_len == server->name_len && + s->port == server->port && + strncasecmp(s->name, server->name, server->name_len) == 0) { + afs_put_vlserver(cell->net, server); + server = NULL; + break; + } + } + if (!server) + continue; /* Insertion-sort by priority and weight */ for (j = 0; j < vllist->nr_servers; j++) { diff --git a/fs/afs/volume.c b/fs/afs/volume.c index 9ae5c8ad2e04..4f79d25ec37f 100644 --- a/fs/afs/volume.c +++ b/fs/afs/volume.c @@ -40,7 +40,7 @@ static struct afs_volume *afs_insert_volume_into_cell(struct afs_cell *cell, goto found; } - set_bit(AFS_VOLUME_RM_TREE, &volume->flags); + set_bit(AFS_VOLUME_RM_TREE, &p->flags); rb_replace_node_rcu(&p->cell_node, &volume->cell_node, &cell->volumes); } } diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c index 768aca2dc0f0..f1863a891db6 100644 --- a/fs/bpf_fs_kfuncs.c +++ b/fs/bpf_fs_kfuncs.c @@ -360,18 +360,23 @@ __bpf_kfunc int bpf_cgroup_read_xattr(struct cgroup *cgroup, const char *name__s #endif /* CONFIG_CGROUPS */ /** - * bpf_real_inode - get the real inode backing a dentry - * @dentry: dentry to resolve + * bpf_real_data_inode - get the real inode hosting a file's data + * @file: file to resolve * - * If the dentry is on a union/overlay filesystem, return the underlying, real - * inode that hosts the data. Otherwise return the inode attached to the - * dentry itself. + * Resolve @file to the inode that hosts its data. For a regular file on a + * union/overlay filesystem this is the underlying (upper or lower) inode that + * stores the data, not the overlay inode. * - * Return: The real inode backing the dentry, or NULL for a negative dentry. + * Data resolution only applies to regular files. For a non-regular file (e.g. + * a device node, fifo or socket) on a union/overlay filesystem the overlay + * inode itself is returned; for any file on a non-union filesystem the inode + * attached to @file is returned. + * + * Return: The inode hosting @file's data, or NULL. */ -__bpf_kfunc struct inode *bpf_real_inode(struct dentry *dentry) +__bpf_kfunc struct inode *bpf_real_data_inode(struct file *file) { - return d_real_inode(dentry); + return d_real_inode(file_dentry(file)); } __bpf_kfunc_end_defs(); @@ -384,7 +389,7 @@ BTF_ID_FLAGS(func, bpf_get_dentry_xattr, KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_get_file_xattr, KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_set_dentry_xattr, KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_remove_dentry_xattr, KF_SLEEPABLE) -BTF_ID_FLAGS(func, bpf_real_inode, KF_SLEEPABLE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_real_data_inode, KF_SLEEPABLE | KF_RET_NULL) BTF_KFUNCS_END(bpf_fs_kfunc_set_ids) static int bpf_fs_kfuncs_filter(const struct bpf_prog *prog, u32 kfunc_id) diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c index 2937db690b40..8a9f6be15828 100644 --- a/fs/cachefiles/namei.c +++ b/fs/cachefiles/namei.c @@ -209,7 +209,6 @@ lookup_error: return ERR_PTR(ret); nomem_d_alloc: - inode_unlock(d_inode(dir)); _leave(" = -ENOMEM"); return ERR_PTR(-ENOMEM); } @@ -375,7 +374,7 @@ try_again: "Rename failed with error %d", ret); } - __cachefiles_unmark_inode_in_use(object, d_inode(rep)); + cachefiles_do_unmark_inode_in_use(object, d_inode(rep)); end_renaming(&rd); _leave(" = 0"); return 0; @@ -467,7 +466,6 @@ struct file *cachefiles_create_tmpfile(struct cachefiles_object *object) ret = -EINVAL; if (unlikely(!file->f_op->read_iter) || unlikely(!file->f_op->write_iter)) { - fput(file); pr_notice("Cache does not support read_iter and write_iter\n"); goto err_unuse; } diff --git a/fs/exec.c b/fs/exec.c index b92fe7db176c..d5993cedc829 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1717,7 +1717,7 @@ static int exec_binprm(struct linux_binprm *bprm) old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent)); rcu_read_unlock(); - /* This allows 4 levels of binfmt rewrites before failing hard. */ + /* This allows 5 levels of binfmt rewrites before failing hard. */ for (depth = 0;; depth++) { struct file *exec; if (depth > 5) diff --git a/fs/exfat/iomap.c b/fs/exfat/iomap.c index 1aac38e63fe6..190fc6471f84 100644 --- a/fs/exfat/iomap.c +++ b/fs/exfat/iomap.c @@ -253,10 +253,7 @@ static void exfat_iomap_read_end_io(struct bio *bio) static void exfat_iomap_bio_submit_read(const struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx) { - struct bio *bio = ctx->read_ctx; - - bio->bi_end_io = exfat_iomap_read_end_io; - submit_bio(bio); + iomap_bio_submit_read_endio(iter, ctx, exfat_iomap_read_end_io); } const struct iomap_read_ops exfat_iomap_bio_read_ops = { diff --git a/fs/fat/dir.c b/fs/fat/dir.c index 4f6f42f33613..c6cca5d00ffd 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -131,6 +131,31 @@ static inline int fat_get_entry(struct inode *dir, loff_t *pos, } /* + * Like fat_get_entry(), but honour the FAT end-of-directory marker: + * a dirent whose first name byte is NUL terminates iteration per the + * spec, which also guarantees that every following slot is zeroed. + * Skip straight to the end of the directory so the next call returns + * -1 from fat_bmap() without re-reading the trailing zero slots, and + * so callers that persist *pos across invocations (e.g. readdir's + * ctx->pos) keep reporting EOD. Release *bh and set it to NULL to + * match fat_get_entry()'s contract that *bh is NULL on the -1 return. + */ +static int fat_get_entry_eod(struct inode *dir, loff_t *pos, + struct buffer_head **bh, + struct msdos_dir_entry **de) +{ + int err = fat_get_entry(dir, pos, bh, de); + + if (err == 0 && (*de)->name[0] == 0) { + brelse(*bh); + *bh = NULL; + *pos = dir->i_size; + return -1; + } + return err; +} + +/* * Convert Unicode 16 to UTF-8, translated Unicode, or ASCII. * If uni_xlate is enabled and we can't get a 1:1 conversion, use a * colon as an escape character since it is normally invalid on the vfat @@ -327,7 +352,7 @@ parse_long: if (ds->id & 0x40) (*unicode)[offset + 13] = 0; - if (fat_get_entry(dir, pos, bh, de) < 0) + if (fat_get_entry_eod(dir, pos, bh, de) < 0) return PARSE_EOF; if (slot == 0) break; @@ -489,7 +514,7 @@ int fat_search_long(struct inode *inode, const unsigned char *name, err = -ENOENT; while (1) { - if (fat_get_entry(inode, &cpos, &bh, &de) == -1) + if (fat_get_entry_eod(inode, &cpos, &bh, &de) == -1) goto end_of_dir; parse_record: nr_slots = 0; @@ -601,7 +626,7 @@ static int __fat_readdir(struct inode *inode, struct file *file, bh = NULL; get_new: - if (fat_get_entry(inode, &cpos, &bh, &de) == -1) + if (fat_get_entry_eod(inode, &cpos, &bh, &de) == -1) goto end_of_dir; parse_record: nr_slots = 0; @@ -885,7 +910,7 @@ static int fat_get_short_entry(struct inode *dir, loff_t *pos, struct buffer_head **bh, struct msdos_dir_entry **de) { - while (fat_get_entry(dir, pos, bh, de) >= 0) { + while (fat_get_entry_eod(dir, pos, bh, de) >= 0) { /* free entry or long name entry or volume label */ if (!IS_FREE((*de)->name) && !((*de)->attr & ATTR_VOLUME)) return 0; @@ -1302,6 +1327,7 @@ int fat_add_entries(struct inode *dir, void *slots, int nr_slots, struct msdos_dir_entry *de; int err, free_slots, i, nr_bhs; loff_t pos; + bool saw_eod; sinfo->nr_slots = nr_slots; @@ -1310,12 +1336,15 @@ int fat_add_entries(struct inode *dir, void *slots, int nr_slots, bh = prev = NULL; pos = 0; err = -ENOSPC; + saw_eod = false; while (fat_get_entry(dir, &pos, &bh, &de) > -1) { /* check the maximum size of directory */ if (pos >= FAT_MAX_DIR_SIZE) goto error; if (IS_FREE(de->name)) { + if (de->name[0] == 0) + saw_eod = true; if (prev != bh) { get_bh(bh); bhs[nr_bhs] = prev = bh; @@ -1325,6 +1354,13 @@ int fat_add_entries(struct inode *dir, void *slots, int nr_slots, if (free_slots == nr_slots) goto found; } else { + if (saw_eod) { + fat_fs_error_ratelimit(sb, + "allocated dir entry found after end-of-directory marker (i_pos %lld)", + MSDOS_I(dir)->i_pos); + err = -EIO; + goto error; + } for (i = 0; i < nr_bhs; i++) brelse(bhs[i]); prev = NULL; diff --git a/fs/fhandle.c b/fs/fhandle.c index 1ca7eb3a6cb5..f8829231e3d7 100644 --- a/fs/fhandle.c +++ b/fs/fhandle.c @@ -295,7 +295,7 @@ static bool capable_wrt_mount(struct mount *mount) */ guard(rcu)(); mnt_ns = READ_ONCE(mount->mnt_ns); - return ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN); + return mnt_ns && ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN); } static inline int may_decode_fh(struct handle_to_path_ctx *ctx, diff --git a/fs/freevxfs/vxfs_bmap.c b/fs/freevxfs/vxfs_bmap.c index e85222892038..1b8216eb1d90 100644 --- a/fs/freevxfs/vxfs_bmap.c +++ b/fs/freevxfs/vxfs_bmap.c @@ -227,7 +227,8 @@ vxfs_bmap_typed(struct inode *ip, long iblock) return 0; } default: - BUG(); + WARN_ON_ONCE(1); + return 0; } } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index e052a0d44dee..ceada75310b8 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -981,19 +981,8 @@ static int fuse_iomap_read_folio_range_async(const struct iomap_iter *iter, return ret; } -static void fuse_iomap_submit_read(const struct iomap_iter *iter, - struct iomap_read_folio_ctx *ctx) -{ - struct fuse_fill_read_data *data = ctx->read_ctx; - - if (data->ia) - fuse_send_readpages(data->ia, data->file, data->nr_bytes, - data->fc->async_read); -} - static const struct iomap_read_ops fuse_iomap_read_ops = { .read_folio_range = fuse_iomap_read_folio_range_async, - .submit_read = fuse_iomap_submit_read, }; static int fuse_read_folio(struct file *file, struct folio *folio) @@ -1116,6 +1105,9 @@ static void fuse_readahead(struct readahead_control *rac) return; iomap_readahead(&fuse_iomap_ops, &ctx, NULL); + if (data.ia) + fuse_send_readpages(data.ia, data.file, data.nr_bytes, + fc->async_read); } static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to) diff --git a/fs/iomap/bio.c b/fs/iomap/bio.c index 4504f4633f17..dc8ac7e370a5 100644 --- a/fs/iomap/bio.c +++ b/fs/iomap/bio.c @@ -78,14 +78,24 @@ u32 iomap_finish_ioend_buffered_read(struct iomap_ioend *ioend) return __iomap_read_end_io(&ioend->io_bio, ioend->io_error); } -static void iomap_bio_submit_read(const struct iomap_iter *iter, - struct iomap_read_folio_ctx *ctx) +void iomap_bio_submit_read_endio(const struct iomap_iter *iter, + struct iomap_read_folio_ctx *ctx, bio_end_io_t end_io) { struct bio *bio = ctx->read_ctx; + bio->bi_end_io = end_io; if (iter->iomap.flags & IOMAP_F_INTEGRITY) fs_bio_integrity_alloc(bio); submit_bio(bio); + + ctx->read_ctx = NULL; +} +EXPORT_SYMBOL_GPL(iomap_bio_submit_read_endio); + +static void iomap_bio_submit_read(const struct iomap_iter *iter, + struct iomap_read_folio_ctx *ctx) +{ + return iomap_bio_submit_read_endio(iter, ctx, iomap_read_end_io); } static struct bio_set *iomap_read_bio_set(struct iomap_read_folio_ctx *ctx) @@ -127,7 +137,6 @@ static void iomap_read_alloc_bio(const struct iomap_iter *iter, if (ctx->rac) bio->bi_opf |= REQ_RAHEAD; bio->bi_iter.bi_sector = iomap_sector(iomap, iter->pos); - bio->bi_end_io = iomap_read_end_io; bio_add_folio_nofail(bio, folio, plen, offset_in_folio(folio, iter->pos)); ctx->read_ctx = bio; diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 8d4806dc46d4..276720bc18dc 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -642,12 +642,12 @@ void iomap_read_folio(const struct iomap_ops *ops, fsverity_readahead(ctx->vi, folio->index, folio_nr_pages(folio)); - while ((ret = iomap_iter(&iter, ops)) > 0) + while ((ret = iomap_iter(&iter, ops)) > 0) { iter.status = iomap_read_folio_iter(&iter, ctx, &bytes_submitted); - - if (ctx->read_ctx && ctx->ops->submit_read) - ctx->ops->submit_read(&iter, ctx); + if (ctx->read_ctx && ctx->ops->submit_read) + ctx->ops->submit_read(&iter, ctx); + } if (ctx->cur_folio) iomap_read_end(ctx->cur_folio, bytes_submitted); @@ -718,12 +718,12 @@ void iomap_readahead(const struct iomap_ops *ops, fsverity_readahead(ctx->vi, readahead_index(rac), readahead_count(rac)); - while (iomap_iter(&iter, ops) > 0) + while (iomap_iter(&iter, ops) > 0) { iter.status = iomap_readahead_iter(&iter, ctx, &cur_bytes_submitted); - - if (ctx->read_ctx && ctx->ops->submit_read) - ctx->ops->submit_read(&iter, ctx); + if (ctx->read_ctx && ctx->ops->submit_read) + ctx->ops->submit_read(&iter, ctx); + } if (ctx->cur_folio) iomap_read_end(ctx->cur_folio, cur_bytes_submitted); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index b485e3b191da..e2cd5f92babe 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -369,7 +369,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter, */ if ((op & REQ_ATOMIC) && WARN_ON_ONCE(ret != iomap_length(iter))) { ret = -EINVAL; - goto out_put_bio; + goto out_bio_release_pages; } if (iter->iomap.flags & IOMAP_F_INTEGRITY) { @@ -393,6 +393,11 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter, iomap_dio_submit_bio(iter, dio, bio, pos); return ret; +out_bio_release_pages: + if (dio->flags & IOMAP_DIO_BOUNCE) + bio_iov_iter_unbounce(bio, true, false); + else + bio_release_pages(bio, false); out_put_bio: bio_put(bio); return ret; diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c index f7c3e0c70fd7..0565328764c1 100644 --- a/fs/iomap/ioend.c +++ b/fs/iomap/ioend.c @@ -298,8 +298,12 @@ new_ioend: * appending writes. */ ioend->io_size += map_len; - if (ioend->io_offset + ioend->io_size > end_pos) - ioend->io_size = end_pos - ioend->io_offset; + if (ioend->io_offset + ioend->io_size > end_pos) { + if (ioend->io_offset >= end_pos) + ioend->io_size = 0; + else + ioend->io_size = end_pos - ioend->io_offset; + } wbc_account_cgroup_owner(wpc->wbc, folio, map_len); return map_len; diff --git a/fs/minix/minix.h b/fs/minix/minix.h index f2025c9b5825..9e52d4302f0d 100644 --- a/fs/minix/minix.h +++ b/fs/minix/minix.h @@ -97,7 +97,7 @@ static inline struct minix_inode_info *minix_i(struct inode *inode) static inline unsigned minix_blocks_needed(unsigned bits, unsigned blocksize) { - return DIV_ROUND_UP(bits, blocksize * 8); + return DIV_ROUND_UP_POW2(bits, blocksize * 8); } #if defined(CONFIG_MINIX_FS_NATIVE_ENDIAN) && \ diff --git a/fs/namei.c b/fs/namei.c index 5cc9f0f466b8..19ce43c9a6e6 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -4736,6 +4736,10 @@ int vfs_tmpfile(struct mnt_idmap *idmap, int error; int open_flag = file->f_flags; + /* A tmpfile is I_LINKABLE, so guard its owner like may_o_create(). */ + if (!fsuidgid_has_mapping(dir->i_sb, idmap)) + return -EOVERFLOW; + /* we want directory to be writable */ error = inode_permission(idmap, dir, MAY_WRITE | MAY_EXEC); if (error) diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c index 76d0f6a29aba..24a8a5418e31 100644 --- a/fs/netfs/buffered_read.c +++ b/fs/netfs/buffered_read.c @@ -659,7 +659,7 @@ retry: * within the cache granule containing the EOF, in which case we need * to preload the granule. */ - if (!netfs_is_cache_enabled(ctx) && + if (!netfs_is_cache_maybe_enabled(ctx) && netfs_skip_folio_read(folio, pos, len, false)) { netfs_stat(&netfs_n_rh_write_zskip); goto have_folio_no_wait; diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index 6bde3320bcec..2cdb68e6b16f 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -277,7 +277,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, * caching service temporarily because the backing store got * culled. */ - if (netfs_is_cache_enabled(ctx)) { + if (netfs_is_cache_maybe_enabled(ctx)) { if (finfo) { netfs_stat(&netfs_n_wh_wstream_conflict); goto flush_content; diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c index 25f8ceb15fad..c16fbad286a1 100644 --- a/fs/netfs/direct_write.c +++ b/fs/netfs/direct_write.c @@ -166,13 +166,16 @@ static int netfs_unbuffered_write(struct netfs_io_request *wreq) */ subreq->error = -EAGAIN; trace_netfs_sreq(subreq, netfs_sreq_trace_retry); - if (subreq->transferred > 0) + if (subreq->transferred > 0) { iov_iter_advance(&wreq->buffer.iter, subreq->transferred); + wreq->transferred += subreq->transferred; + } if (stream->source == NETFS_UPLOAD_TO_SERVER && wreq->netfs_ops->retry_request) wreq->netfs_ops->retry_request(wreq, stream); + __clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags); __clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags); __clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags); __clear_bit(NETFS_SREQ_FAILED, &subreq->flags); @@ -186,17 +189,10 @@ static int netfs_unbuffered_write(struct netfs_io_request *wreq) netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit); - if (stream->prepare_write) { + if (stream->prepare_write) stream->prepare_write(subreq); - __set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags); - netfs_stat(&netfs_n_wh_retry_write_subreq); - } else { - struct iov_iter source; - - netfs_reset_iter(subreq); - source = subreq->io_iter; - netfs_reissue_write(stream, subreq, &source); - } + __set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags); + netfs_stat(&netfs_n_wh_retry_write_subreq); } netfs_unbuffered_write_done(wreq); diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index 645996ecfc80..d889caa401dc 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -239,6 +239,18 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx) #endif } +static inline bool netfs_is_cache_maybe_enabled(struct netfs_inode *ctx) +{ +#if IS_ENABLED(CONFIG_FSCACHE) + struct fscache_cookie *cookie = ctx->cache; + + return fscache_cookie_valid(cookie) && + test_bit(FSCACHE_COOKIE_IS_CACHING, &cookie->flags); +#else + return false; +#endif +} + /* * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap). */ diff --git a/fs/netfs/locking.c b/fs/netfs/locking.c index 2249ecd09d0a..4e3be2b81504 100644 --- a/fs/netfs/locking.c +++ b/fs/netfs/locking.c @@ -9,6 +9,11 @@ #include <linux/netfs.h> #include "internal.h" +struct netfs_wb_waiter { + struct list_head link; /* Link in ictx->wb_queue */ + struct task_struct *waiter; /* Waiter task; cleared when lock granted */ +}; + /* * inode_dio_wait_interruptible - wait for outstanding DIO requests to finish * @inode: inode to wait for @@ -203,3 +208,93 @@ void netfs_end_io_direct(struct inode *inode) up_read(&inode->i_rwsem); } EXPORT_SYMBOL(netfs_end_io_direct); + +/* + * Wait to have exclusive access to writeback. + */ +static bool netfs_wb_begin_wait(struct netfs_inode *ictx) +{ + struct netfs_wb_waiter waiter = {}; + struct task_struct *tsk = current; + bool got = false; + + spin_lock(&ictx->lock); + + if (test_and_set_bit_lock(NETFS_ICTX_WB_LOCK, &ictx->flags)) { + get_task_struct(tsk); + waiter.waiter = tsk; + list_add_tail(&waiter.link, &ictx->wb_queue); + } else { + got = true; + } + spin_unlock(&ictx->lock); + + if (!got) { + for (;;) { + set_current_state(TASK_UNINTERRUPTIBLE); + /* Read waiter before accessing inode state. */ + if (smp_load_acquire(&waiter.waiter) == NULL) + break; + schedule(); + } + } + __set_current_state(TASK_RUNNING); + return true; +} + +/** + * netfs_wb_begin - Begin writeback, waiting if need be + * @ictx: The inode to get writeback access on + * @nowait: Return failure immediately rather than waiting if true + * + * Begin writeback to an inode, waiting for exclusive access if @nowait is + * false. This prevents collection from being done out of order with respect + * to the issuance of write subrequests. + * + * Note that writeback may be ended in a different process (e.g. the collection + * function on a workqueue) than started it. + * + * Return: True if can proceed, false if denied. + */ +bool netfs_wb_begin(struct netfs_inode *ictx, bool nowait) +{ + if (!test_and_set_bit_lock(NETFS_ICTX_WB_LOCK, &ictx->flags)) + return true; + if (nowait) { + netfs_stat(&netfs_n_wb_lock_skip); + return false; + } + netfs_stat(&netfs_n_wb_lock_wait); + return netfs_wb_begin_wait(ictx); +} +EXPORT_SYMBOL(netfs_wb_begin); + +/* netfs_wb_end - End writeback + * @ictx: The inode we have writeback access to + * + * End writeback access on an inode, waking up the next writeback request. + */ +void netfs_wb_end(struct netfs_inode *ictx) +{ + struct netfs_wb_waiter *waiter; + struct task_struct *tsk; + + WARN_ON_ONCE(!test_bit(NETFS_ICTX_WB_LOCK, &ictx->flags)); + + spin_lock(&ictx->lock); + + waiter = list_first_entry_or_null(&ictx->wb_queue, struct netfs_wb_waiter, link); + if (waiter) { + list_del(&waiter->link); + tsk = waiter->waiter; + /* Write inode state before clearing waiter. */ + smp_store_release(&waiter->waiter, NULL); + wake_up_process(tsk); + put_task_struct(tsk); + } else { + clear_bit_unlock(NETFS_ICTX_WB_LOCK, &ictx->flags); + } + + spin_unlock(&ictx->lock); +} +EXPORT_SYMBOL(netfs_wb_end); diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c index f59a70f3a086..2b42758e01ec 100644 --- a/fs/netfs/read_retry.c +++ b/fs/netfs/read_retry.c @@ -98,7 +98,12 @@ static void netfs_retry_read_subrequests(struct netfs_io_request *rreq) goto abandon; } - list_for_each_continue(next, &stream->subrequests) { + for (;;) { + /* Read pointer to subreq before reading subreq state. */ + next = smp_load_acquire(&next->next); + if (next == &stream->subrequests) + break; + subreq = list_entry(next, struct netfs_io_subrequest, rreq_link); if (subreq->start + subreq->transferred != start + len || test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) || diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c index 24fc2bb2f8a4..210eb8f3958d 100644 --- a/fs/netfs/write_collect.c +++ b/fs/netfs/write_collect.c @@ -408,6 +408,16 @@ bool netfs_write_collection(struct netfs_io_request *wreq) netfs_wake_rreq_flag(wreq, NETFS_RREQ_IN_PROGRESS, netfs_rreq_trace_wake_ip); /* As we cleared NETFS_RREQ_IN_PROGRESS, we acquired its ref. */ + switch (wreq->origin) { + case NETFS_WRITEBACK: + case NETFS_WRITEBACK_SINGLE: + case NETFS_WRITETHROUGH: + netfs_wb_end(ictx); + break; + default: + break; + } + if (wreq->iocb) { size_t written = min(wreq->transferred, wreq->len); wreq->iocb->ki_pos += written; diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c index c03c7cc45e47..f2761c99795a 100644 --- a/fs/netfs/write_issue.c +++ b/fs/netfs/write_issue.c @@ -106,7 +106,7 @@ struct netfs_io_request *netfs_create_write_req(struct address_space *mapping, _enter("R=%x", wreq->debug_id); ictx = netfs_inode(wreq->inode); - if (is_cacheable && netfs_is_cache_enabled(ictx)) + if (is_cacheable) fscache_begin_write_operation(&wreq->cache_resources, netfs_i_cookie(ictx)); if (rolling_buffer_init(&wreq->buffer, wreq->debug_id, ITER_SOURCE) < 0) goto nomem; @@ -551,14 +551,8 @@ int netfs_writepages(struct address_space *mapping, struct folio *folio; int error = 0; - if (!mutex_trylock(&ictx->wb_lock)) { - if (wbc->sync_mode == WB_SYNC_NONE) { - netfs_stat(&netfs_n_wb_lock_skip); - return 0; - } - netfs_stat(&netfs_n_wb_lock_wait); - mutex_lock(&ictx->wb_lock); - } + if (!netfs_wb_begin(ictx, wbc->sync_mode == WB_SYNC_NONE)) + return 0; /* Need the first folio to be able to set up the op. */ folio = writeback_iter(mapping, wbc, NULL, &error); @@ -588,13 +582,13 @@ int netfs_writepages(struct address_space *mapping, } error = netfs_write_folio(wreq, wbc, folio); - if (error < 0) - break; + if (error == -ENOMEM) { + folio_redirty_for_writepage(wbc, folio); + folio_unlock(folio); + } } while ((folio = writeback_iter(mapping, wbc, folio, &error))); netfs_end_issue_write(wreq); - - mutex_unlock(&ictx->wb_lock); netfs_wake_collector(wreq); netfs_put_request(wreq, netfs_rreq_trace_put_return); @@ -602,9 +596,16 @@ int netfs_writepages(struct address_space *mapping, return error; couldnt_start: - netfs_kill_dirty_pages(mapping, wbc, folio); + if (error == -ENOMEM) { + folio_redirty_for_writepage(wbc, folio); + folio_unlock(folio); + folio = writeback_iter(mapping, wbc, folio, &error); + WARN_ON_ONCE(folio != NULL); + } else { + netfs_kill_dirty_pages(mapping, wbc, folio); + } out: - mutex_unlock(&ictx->wb_lock); + netfs_wb_end(ictx); _leave(" = %d", error); return error; } @@ -618,16 +619,17 @@ struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len struct netfs_io_request *wreq = NULL; struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp)); - mutex_lock(&ictx->wb_lock); + netfs_wb_begin(ictx, false); wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, iocb->ki_pos, NETFS_WRITETHROUGH); if (IS_ERR(wreq)) { - mutex_unlock(&ictx->wb_lock); + netfs_wb_end(ictx); return wreq; } wreq->io_streams[0].avail = true; + __set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags); trace_netfs_write(wreq, netfs_write_trace_writethrough); return wreq; } @@ -685,7 +687,6 @@ int netfs_advance_writethrough(struct netfs_io_request *wreq, struct writeback_c ssize_t netfs_end_writethrough(struct netfs_io_request *wreq, struct writeback_control *wbc, struct folio *writethrough_cache) { - struct netfs_inode *ictx = netfs_inode(wreq->inode); ssize_t ret; _enter("R=%x", wreq->debug_id); @@ -699,8 +700,6 @@ ssize_t netfs_end_writethrough(struct netfs_io_request *wreq, struct writeback_c netfs_end_issue_write(wreq); - mutex_unlock(&ictx->wb_lock); - if (wreq->iocb) ret = -EIOCBQUEUED; else @@ -847,15 +846,10 @@ int netfs_writeback_single(struct address_space *mapping, if (WARN_ON_ONCE(!iov_iter_is_folioq(iter))) return -EIO; - if (!mutex_trylock(&ictx->wb_lock)) { - if (wbc->sync_mode == WB_SYNC_NONE) { - /* The VFS will have undirtied the inode. */ - netfs_single_mark_inode_dirty(&ictx->inode); - netfs_stat(&netfs_n_wb_lock_skip); - return 1; - } - netfs_stat(&netfs_n_wb_lock_wait); - mutex_lock(&ictx->wb_lock); + if (!netfs_wb_begin(ictx, wbc->sync_mode == WB_SYNC_NONE)) { + /* The VFS will have undirtied the inode. */ + netfs_single_mark_inode_dirty(&ictx->inode); + return 1; } wreq = netfs_create_write_req(mapping, NULL, 0, NETFS_WRITEBACK_SINGLE); @@ -893,7 +887,6 @@ stop: smp_wmb(); /* Write lists before ALL_QUEUED. */ set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags); - mutex_unlock(&ictx->wb_lock); netfs_wake_collector(wreq); netfs_put_request(wreq, netfs_rreq_trace_put_return); @@ -901,7 +894,7 @@ stop: return ret; couldnt_start: - mutex_unlock(&ictx->wb_lock); + netfs_wb_end(ictx); _leave(" = %d", ret); return ret; } diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c index 32735abfa03f..058bc7a166a5 100644 --- a/fs/netfs/write_retry.c +++ b/fs/netfs/write_retry.c @@ -72,7 +72,12 @@ static void netfs_retry_write_stream(struct netfs_io_request *wreq, !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) return; - list_for_each_continue(next, &stream->subrequests) { + for (;;) { + /* Read pointer to subreq before reading subreq state. */ + next = smp_load_acquire(&next->next); + if (next == &stream->subrequests) + break; + subreq = list_entry(next, struct netfs_io_subrequest, rreq_link); if (subreq->start + subreq->transferred != start + len || test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) || diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index 1fbf832ad165..f2bb56506046 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -38,11 +38,9 @@ static void ntfs_iomap_read_end_io(struct bio *bio) } static void ntfs_iomap_bio_submit_read(const struct iomap_iter *iter, - struct iomap_read_folio_ctx *ctx) + struct iomap_read_folio_ctx *ctx) { - struct bio *bio = ctx->read_ctx; - bio->bi_end_io = ntfs_iomap_read_end_io; - submit_bio(bio); + iomap_bio_submit_read_endio(iter, ctx, ntfs_iomap_read_end_io); } static const struct iomap_read_ops ntfs_iomap_bio_read_ops = { diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c index c43101cc064d..0c9bd669117d 100644 --- a/fs/ntfs3/inode.c +++ b/fs/ntfs3/inode.c @@ -608,10 +608,7 @@ static void ntfs_iomap_read_end_io(struct bio *bio) static void ntfs_iomap_bio_submit_read(const struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx) { - struct bio *bio = ctx->read_ctx; - - bio->bi_end_io = ntfs_iomap_read_end_io; - submit_bio(bio); + iomap_bio_submit_read_endio(iter, ctx, ntfs_iomap_read_end_io); } static const struct iomap_read_ops ntfs_iomap_bio_read_ops = { diff --git a/fs/orangefs/dir.c b/fs/orangefs/dir.c index 6e2ebc8b9867..115b2c2f5269 100644 --- a/fs/orangefs/dir.c +++ b/fs/orangefs/dir.c @@ -191,7 +191,8 @@ static int fill_from_part(struct orangefs_dir_part *part, { const int offset = sizeof(struct orangefs_readdir_response_s); struct orangefs_khandle *khandle; - __u32 *len, padlen; + __u32 *len; + u64 padlen; loff_t i; char *s; i = ctx->pos & ~PART_MASK; @@ -215,8 +216,8 @@ static int fill_from_part(struct orangefs_dir_part *part, * len is the size of the string itself. padlen is the * total size of the encoded string. */ - padlen = (sizeof *len + *len + 1) + - (8 - (sizeof *len + *len + 1)%8)%8; + padlen = (u64)sizeof *len + *len + 1; + padlen += (8 - padlen % 8) % 8; if (part->len < i + padlen + sizeof *khandle) goto next; s = (void *)part + offset + i + sizeof *len; diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 13cb60b52bd6..e963701b4c87 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -853,7 +853,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c) { struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb); struct inode *udir = d_inode(c->destdir); - struct dentry *temp, *upper; + struct dentry *temp, *upper, *newdentry = NULL; struct file *tmpfile; int err; @@ -889,6 +889,14 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c) err = PTR_ERR(upper); if (!IS_ERR(upper)) { err = ovl_do_link(ofs, temp, udir, upper); + if (!err) { + /* + * Record the linked dentry -- not the disconnected + * O_TMPFILE dentry -- so that ->d_revalidate() on + * the upper fs sees the real parent/name. + */ + newdentry = dget(upper); + } end_creating(upper); } @@ -903,7 +911,7 @@ static int ovl_copy_up_tmpfile(struct ovl_copy_up_ctx *c) if (!c->metacopy) ovl_set_upperdata(d_inode(c->dentry)); - ovl_inode_update(d_inode(c->dentry), dget(temp)); + ovl_inode_update(d_inode(c->dentry), newdentry); out: ovl_end_write(c->dentry); diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c index 00c69707bda9..bc71231cad53 100644 --- a/fs/overlayfs/inode.c +++ b/fs/overlayfs/inode.c @@ -783,8 +783,8 @@ static const struct address_space_operations ovl_aops = { * * This chain is valid: * - inode->i_rwsem (inode_lock[2]) - * - upper_mnt->mnt_sb->s_writers (ovl_want_write[0]) * - OVL_I(inode)->lock (ovl_inode_lock[2]) + * - upper_mnt->mnt_sb->s_writers (ovl_want_write[0]) * - OVL_I(lowerinode)->lock (ovl_inode_lock[1]) * * And this chain is valid: @@ -797,8 +797,8 @@ static const struct address_space_operations ovl_aops = { * held, because it is in reverse order of the non-nested case using the same * upper fs: * - inode->i_rwsem (inode_lock[1]) - * - upper_mnt->mnt_sb->s_writers (ovl_want_write[0]) * - OVL_I(inode)->lock (ovl_inode_lock[1]) + * - upper_mnt->mnt_sb->s_writers (ovl_want_write[0]) */ #define OVL_MAX_NESTING FILESYSTEM_MAX_STACK_DEPTH diff --git a/fs/proc/generic.c b/fs/proc/generic.c index adc9b9a092b0..26086a283672 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -112,6 +112,8 @@ static bool pde_subdir_insert(struct proc_dir_entry *dir, /* Add new node and rebalance tree. */ rb_link_node(&de->subdir_node, parent, new); rb_insert_color(&de->subdir_node, root); + if (S_ISDIR(de->mode)) + dir->nlink++; return true; } @@ -404,7 +406,6 @@ struct proc_dir_entry *proc_register(struct proc_dir_entry *dir, write_unlock(&proc_subdir_lock); goto out_free_inum; } - dir->nlink++; write_unlock(&proc_subdir_lock); return dp; @@ -706,6 +707,8 @@ static void pde_erase(struct proc_dir_entry *pde, struct proc_dir_entry *parent) { rb_erase(&pde->subdir_node, &parent->subdir); RB_CLEAR_NODE(&pde->subdir_node); + if (S_ISDIR(pde->mode)) + parent->nlink--; } /* @@ -731,8 +734,6 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent) de = NULL; } else { pde_erase(de, parent); - if (S_ISDIR(de->mode)) - parent->nlink--; } } write_unlock(&proc_subdir_lock); @@ -791,8 +792,6 @@ int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) continue; } next = de->parent; - if (S_ISDIR(de->mode)) - next->nlink--; write_unlock(&proc_subdir_lock); proc_entry_rundown(de); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 2a0c54256e93..51293b6f331f 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -764,8 +764,7 @@ xfs_bio_submit_read( /* defer read completions to the ioend workqueue */ iomap_init_ioend(iter->inode, bio, ctx->read_ctx_file_offset, 0); - bio->bi_end_io = xfs_end_bio; - submit_bio(bio); + iomap_bio_submit_read_endio(iter, ctx, xfs_end_bio); } static const struct iomap_read_ops xfs_iomap_read_ops = { diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index eac7f9503805..8531d526fc44 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -534,8 +534,11 @@ xfs_open_devices( out_free_rtdev_targ: if (mp->m_rtdev_targp) xfs_free_buftarg(mp->m_rtdev_targp); + mp->m_rtdev_targp = NULL; + rtdev_file = NULL; /* released by xfs_free_buftarg() */ out_free_ddev_targ: xfs_free_buftarg(mp->m_ddev_targp); + mp->m_ddev_targp = NULL; out_close_rtdev: if (rtdev_file) bdev_fput(rtdev_file); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 3582ed1fe236..56b43d594e6e 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -622,6 +622,8 @@ extern struct bio_set iomap_ioend_bioset; #ifdef CONFIG_BLOCK int iomap_bio_read_folio_range(const struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx, size_t plen); +void iomap_bio_submit_read_endio(const struct iomap_iter *iter, + struct iomap_read_folio_ctx *ctx, bio_end_io_t end_io); extern const struct iomap_read_ops iomap_bio_read_ops; diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 243c0f737938..1bc120d61c5b 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -61,14 +61,16 @@ struct netfs_inode { #if IS_ENABLED(CONFIG_FSCACHE) struct fscache_cookie *cache; #endif - struct mutex wb_lock; /* Writeback serialisation */ + struct list_head wb_queue; /* Queue of processes wanting to do writeback */ loff_t _remote_i_size; /* Size of the remote file */ loff_t _zero_point; /* Size after which we assume there's no data * on the server */ + spinlock_t lock; /* Lock covering wb_queue */ atomic_t io_count; /* Number of outstanding reqs */ unsigned long flags; #define NETFS_ICTX_ODIRECT 0 /* The file has DIO in progress */ #define NETFS_ICTX_UNBUFFERED 1 /* I/O should not use the pagecache */ +#define NETFS_ICTX_WB_LOCK 2 /* Writeback serialisation lock */ #define NETFS_ICTX_MODIFIED_ATTR 3 /* Indicate change in mtime/ctime */ #define NETFS_ICTX_SINGLE_NO_UPLOAD 4 /* Monolithic payload, cache but no upload */ }; @@ -462,6 +464,10 @@ int netfs_alloc_folioq_buffer(struct address_space *mapping, size_t *_cur_size, ssize_t size, gfp_t gfp); void netfs_free_folioq_buffer(struct folio_queue *fq); +/* Writeback exclusion API. */ +bool netfs_wb_begin(struct netfs_inode *ictx, bool nowait); +void netfs_wb_end(struct netfs_inode *ictx); + /** * netfs_inode - Get the netfs inode context from the inode * @inode: The inode to query @@ -743,7 +749,8 @@ static inline void netfs_inode_init(struct netfs_inode *ctx, #if IS_ENABLED(CONFIG_FSCACHE) ctx->cache = NULL; #endif - mutex_init(&ctx->wb_lock); + INIT_LIST_HEAD(&ctx->wb_queue); + spin_lock_init(&ctx->lock); /* ->releasepage() drives zero_point */ if (use_zero_point) { ctx->_zero_point = ctx->_remote_i_size; @@ -753,7 +760,7 @@ static inline void netfs_inode_init(struct netfs_inode *ctx, /** * netfs_resize_file - Note that a file got resized - * @ctx: The netfs inode being resized + * @ictx: The netfs inode being resized * @new_i_size: The new file size * @changed_on_server: The change was applied to the server * diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 273919b16161..c2484551a4e8 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1568,6 +1568,7 @@ static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i, struct folio *folio; unsigned int nr = 0, offset; loff_t pos = i->xarray_start + i->iov_offset; + bool will_alloc = !*pages; XA_STATE(xas, i->xarray, pos >> PAGE_SHIFT); offset = pos & ~PAGE_MASK; @@ -1595,6 +1596,14 @@ static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i, } rcu_read_unlock(); + if (!nr) { + if (will_alloc) { + kvfree(*pages); + *pages = NULL; + } + return 0; + } + maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize); iov_iter_advance(i, maxsize); return maxsize; @@ -1628,6 +1637,8 @@ static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i, bi.bi_bvec_done = skip; maxpages = want_pages_array(pages, maxsize, skip, maxpages); + if (!maxpages) + return -ENOMEM; while (bi.bi_size && bi.bi_idx < i->nr_segs) { struct bio_vec bv = bvec_iter_bvec(i->bvec, bi); @@ -1745,6 +1756,7 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i, unsigned long addr; unsigned int gup_flags = 0; size_t offset; + bool will_alloc = !*pages; int res; if (i->data_source == ITER_DEST) @@ -1761,8 +1773,14 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i, if (!maxpages) return -ENOMEM; res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages); - if (unlikely(res <= 0)) + if (unlikely(res <= 0)) { + if (will_alloc) { + kvfree(*pages); + *pages = NULL; + } return res; + } + maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset); iov_iter_advance(i, maxsize); return maxsize; diff --git a/lib/scatterlist.c b/lib/scatterlist.c index b7fe91ef35b8..6ea40d2e6247 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1366,6 +1366,7 @@ static ssize_t extract_xarray_to_sg(struct iov_iter *iter, sg_max--; maxsize -= len; + start += len; ret += len; if (maxsize <= 0 || sg_max == 0) break; diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c index 1e6fce9cb255..d9690ba1db88 100644 --- a/lib/tests/kunit_iov_iter.c +++ b/lib/tests/kunit_iov_iter.c @@ -283,7 +283,7 @@ static void __init iov_kunit_copy_to_bvec(struct kunit *test) struct page **spages, **bpages; u8 *scratch, *buffer; size_t bufsize, npages, size, copied; - int i, b, patt; + int i, patt; bufsize = 0x100000; npages = bufsize / PAGE_SIZE; @@ -306,10 +306,9 @@ static void __init iov_kunit_copy_to_bvec(struct kunit *test) KUNIT_EXPECT_EQ(test, iter.nr_segs, 0); /* Build the expected image in the scratch buffer. */ - b = 0; patt = 0; memset(scratch, 0, bufsize); - for (pr = bvec_test_ranges; pr->from >= 0; pr++, b++) { + for (pr = bvec_test_ranges; pr->from >= 0; pr++) { u8 *p = scratch + pr->page * PAGE_SIZE; for (i = pr->from; i < pr->to; i++) diff --git a/tools/testing/selftests/filesystems/.gitignore b/tools/testing/selftests/filesystems/.gitignore index 64ac0dfa46b7..a78f894157de 100644 --- a/tools/testing/selftests/filesystems/.gitignore +++ b/tools/testing/selftests/filesystems/.gitignore @@ -5,3 +5,4 @@ fclog file_stressor anon_inode_test kernfs_test +idmapped_tmpfile diff --git a/tools/testing/selftests/filesystems/Makefile b/tools/testing/selftests/filesystems/Makefile index 85427d7f19b9..a7ec2ba2dd83 100644 --- a/tools/testing/selftests/filesystems/Makefile +++ b/tools/testing/selftests/filesystems/Makefile @@ -2,6 +2,10 @@ CFLAGS += $(KHDR_INCLUDES) TEST_GEN_PROGS := devpts_pts file_stressor anon_inode_test kernfs_test fclog +TEST_GEN_PROGS += idmapped_tmpfile TEST_GEN_PROGS_EXTENDED := dnotify_test include ../lib.mk + +$(OUTPUT)/idmapped_tmpfile: LDLIBS += -lcap +$(OUTPUT)/idmapped_tmpfile: utils.c diff --git a/tools/testing/selftests/filesystems/idmapped_tmpfile.c b/tools/testing/selftests/filesystems/idmapped_tmpfile.c new file mode 100644 index 000000000000..bc411ab8281e --- /dev/null +++ b/tools/testing/selftests/filesystems/idmapped_tmpfile.c @@ -0,0 +1,168 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE + +#include <errno.h> +#include <fcntl.h> +#include <limits.h> +#include <sched.h> +#include <stdio.h> +#include <unistd.h> +#include <sys/fsuid.h> +#include <sys/stat.h> +#include <sys/syscall.h> + +#include <linux/mount.h> +#include <linux/types.h> + +#include "kselftest_harness.h" +#include "wrappers.h" +#include "utils.h" + +/* + * The test mount maps caller-visible ids [0, MAP_RANGE) onto the on-disk range + * [MAP_HOST, MAP_HOST + MAP_RANGE). An id outside [0, MAP_RANGE) therefore has + * no mapping in the mount and is not representable in the filesystem. + */ +#define MAP_HOST 10000 +#define MAP_RANGE 10000 +#define UNMAPPED 50000 + +#ifndef MOUNT_ATTR_IDMAP +#define MOUNT_ATTR_IDMAP 0x00100000 +#endif + +#ifndef __NR_mount_setattr +#define __NR_mount_setattr 442 +#endif + +static inline int sys_mount_setattr(int dfd, const char *path, + unsigned int flags, + struct mount_attr *attr, size_t size) +{ + return syscall(__NR_mount_setattr, dfd, path, flags, attr, size); +} + +/* + * Clone @path into a detached mount idmapped so that caller-visible ids + * [0, MAP_RANGE) map onto the on-disk ids [MAP_HOST, MAP_HOST + MAP_RANGE). + * Returns the mount fd, or -1 if idmapped mounts are not available. + */ +static int idmapped_clone(const char *path) +{ + struct mount_attr attr = { + .attr_set = MOUNT_ATTR_IDMAP, + }; + int fd_tree, userns_fd, ret; + + fd_tree = sys_open_tree(AT_FDCWD, path, + OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC); + if (fd_tree < 0) + return -1; + + userns_fd = get_userns_fd(MAP_HOST, 0, MAP_RANGE); + if (userns_fd < 0) { + close(fd_tree); + return -1; + } + + attr.userns_fd = userns_fd; + ret = sys_mount_setattr(fd_tree, "", AT_EMPTY_PATH, &attr, sizeof(attr)); + close(userns_fd); + if (ret) { + close(fd_tree); + return -1; + } + + return fd_tree; +} + +FIXTURE(idmapped_tmpfile) { + char dir[64]; /* non-idmapped path to the layer directory */ +}; + +FIXTURE_SETUP(idmapped_tmpfile) +{ + /* Private mount namespace so test mounts need no cleanup. */ + ASSERT_EQ(unshare(CLONE_NEWNS), 0); + ASSERT_EQ(sys_mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL), 0); + ASSERT_EQ(sys_mount("tmpfs", "/tmp", "tmpfs", 0, NULL), 0); + + snprintf(self->dir, sizeof(self->dir), "/tmp/d"); + ASSERT_EQ(mkdir(self->dir, 0777), 0); + /* World-writable so an unmapped caller still passes permission(). */ + ASSERT_EQ(chmod(self->dir, 0777), 0); +} + +FIXTURE_TEARDOWN(idmapped_tmpfile) +{ +} + +/* + * A caller whose fsuid/fsgid have no mapping in the idmapped mount must not be + * able to create an O_TMPFILE. Without the check in vfs_tmpfile() the inode + * would be created owned by (uid_t)-1 and could then be linked into the + * namespace. + */ +TEST_F(idmapped_tmpfile, unmapped_caller_is_refused) +{ + int mfd, fd; + + mfd = idmapped_clone(self->dir); + if (mfd < 0) + SKIP(return, "idmapped mounts not supported"); + + /* Become a caller outside the mount's [0, MAP_RANGE) range. */ + setfsgid(UNMAPPED); + setfsuid(UNMAPPED); + ASSERT_EQ(setfsuid(-1), UNMAPPED); + + fd = openat(mfd, ".", O_TMPFILE | O_WRONLY, 0644); + ASSERT_LT(fd, 0); + EXPECT_EQ(errno, EOVERFLOW); + if (fd >= 0) + close(fd); + + EXPECT_EQ(close(mfd), 0); +} + +/* + * A mapped caller can create an O_TMPFILE and link it into the namespace; the + * ownership round-trips through the mount idmap. This is what makes refusing + * the unmapped case above necessary in the first place. + */ +TEST_F(idmapped_tmpfile, mapped_caller_creates_and_links) +{ + char path[PATH_MAX]; + struct stat st; + int mfd, fd; + + mfd = idmapped_clone(self->dir); + if (mfd < 0) + SKIP(return, "idmapped mounts not supported"); + + /* Caller is uid/gid 0, which maps to MAP_HOST through the mount. */ + fd = openat(mfd, ".", O_TMPFILE | O_RDWR, 0600); + ASSERT_GE(fd, 0); + + ASSERT_EQ(fstat(fd, &st), 0); + EXPECT_EQ(st.st_uid, 0); + EXPECT_EQ(st.st_gid, 0); + + /* The tmpfile is linkable: splice it into the directory. */ + ASSERT_EQ(linkat(fd, "", mfd, "linked", AT_EMPTY_PATH), 0); + EXPECT_EQ(close(fd), 0); + + ASSERT_EQ(fstatat(mfd, "linked", &st, 0), 0); + EXPECT_EQ(st.st_uid, 0); + EXPECT_EQ(st.st_gid, 0); + + /* On the underlying, non-idmapped tmpfs it is stored as MAP_HOST. */ + snprintf(path, sizeof(path), "%s/linked", self->dir); + ASSERT_EQ(stat(path, &st), 0); + EXPECT_EQ(st.st_uid, MAP_HOST); + EXPECT_EQ(st.st_gid, MAP_HOST); + + EXPECT_EQ(close(mfd), 0); +} + +TEST_HARNESS_MAIN |
