diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-10-30 09:14:19 -1000 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-10-30 09:14:19 -1000 |
commit | 3b3f874cc1d074bdcffc224d683925fd11808fe7 (patch) | |
tree | c08647fd96aac4e9c16ca01e0aa2e47329259845 /include | |
parent | 0d63d8b2294b228147bf58def506dde35e57daef (diff) | |
parent | 61d4fb0b349ec1b33119913c3b0bd109de30142c (diff) | |
download | lwn-3b3f874cc1d074bdcffc224d683925fd11808fe7.tar.gz lwn-3b3f874cc1d074bdcffc224d683925fd11808fe7.zip |
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.
This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.
Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.
This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.
After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().
- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.
Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.
I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:
https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Cleanups:
- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().
- Annotate struct watch_filter with the new __counted_by attribute.
- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.
- Simplify file cleanup when the file has never been opened.
- Use module helper instead of open-coding it.
- Predict error unlikely for stale retry.
- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.
Fixes:
- Fix readahead on block devices.
- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.
- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...
Diffstat (limited to 'include')
-rw-r--r-- | include/linux/fdtable.h | 17 | ||||
-rw-r--r-- | include/linux/fs.h | 35 | ||||
-rw-r--r-- | include/linux/fsnotify.h | 3 | ||||
-rw-r--r-- | include/linux/mount.h | 4 | ||||
-rw-r--r-- | include/linux/namei.h | 26 | ||||
-rw-r--r-- | include/linux/pipe_fs_i.h | 22 | ||||
-rw-r--r-- | include/linux/watch_queue.h | 2 |
7 files changed, 73 insertions, 36 deletions
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index e066816f3519..bc4c3287a65e 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -98,20 +98,9 @@ static inline struct file *files_lookup_fd_locked(struct files_struct *files, un return files_lookup_fd_raw(files, fd); } -static inline struct file *files_lookup_fd_rcu(struct files_struct *files, unsigned int fd) -{ - RCU_LOCKDEP_WARN(!rcu_read_lock_held(), - "suspicious rcu_dereference_check() usage"); - return files_lookup_fd_raw(files, fd); -} - -static inline struct file *lookup_fd_rcu(unsigned int fd) -{ - return files_lookup_fd_rcu(current->files, fd); -} - -struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd); -struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *fd); +struct file *lookup_fdget_rcu(unsigned int fd); +struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd); +struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *fd); struct task_struct; diff --git a/include/linux/fs.h b/include/linux/fs.h index fb5accebdcdf..4ba1833854c6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1042,7 +1042,10 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) + +struct file *get_file_rcu(struct file __rcu **f); +struct file *get_file_active(struct file **f); + #define file_count(x) atomic_long_read(&(x)->f_count) #define MAX_NON_LFS ((1UL<<31) - 1) @@ -1119,7 +1122,7 @@ extern int send_sigurg(struct fown_struct *fown); #define SB_NOATIME BIT(10) /* Do not update access times. */ #define SB_NODIRATIME BIT(11) /* Do not update directory access times */ #define SB_SILENT BIT(15) -#define SB_POSIXACL BIT(16) /* VFS does not apply the umask */ +#define SB_POSIXACL BIT(16) /* Supports POSIX ACLs */ #define SB_INLINECRYPT BIT(17) /* Use blk-crypto for encrypted files */ #define SB_KERNMOUNT BIT(22) /* this is a kern_mount call */ #define SB_I_VERSION BIT(23) /* Update inode I_version field */ @@ -1166,6 +1169,7 @@ extern int send_sigurg(struct fown_struct *fown); #define SB_I_PERSB_BDI 0x00000200 /* has a per-sb bdi */ #define SB_I_TS_EXPIRY_WARNED 0x00000400 /* warned about timestamp range expiry */ #define SB_I_RETIRED 0x00000800 /* superblock shouldn't be reused */ +#define SB_I_NOUMASK 0x00001000 /* VFS does not apply umask */ /* Possible states of 'frozen' field */ enum { @@ -2082,7 +2086,12 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA) #define IS_APPEND(inode) ((inode)->i_flags & S_APPEND) #define IS_IMMUTABLE(inode) ((inode)->i_flags & S_IMMUTABLE) + +#ifdef CONFIG_FS_POSIX_ACL #define IS_POSIXACL(inode) __IS_FLG(inode, SB_POSIXACL) +#else +#define IS_POSIXACL(inode) 0 +#endif #define IS_DEADDIR(inode) ((inode)->i_flags & S_DEAD) #define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME) @@ -2449,24 +2458,24 @@ struct file *dentry_open(const struct path *path, int flags, const struct cred *creds); struct file *dentry_create(const struct path *path, int flags, umode_t mode, const struct cred *cred); -struct file *backing_file_open(const struct path *path, int flags, +struct file *backing_file_open(const struct path *user_path, int flags, const struct path *real_path, const struct cred *cred); -struct path *backing_file_real_path(struct file *f); +struct path *backing_file_user_path(struct file *f); /* - * file_real_path - get the path corresponding to f_inode + * file_user_path - get the path to display for memory mapped file * - * When opening a backing file for a stackable filesystem (e.g., - * overlayfs) f_path may be on the stackable filesystem and f_inode on - * the underlying filesystem. When the path associated with f_inode is - * needed, this helper should be used instead of accessing f_path - * directly. -*/ -static inline const struct path *file_real_path(struct file *f) + * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file + * stored in ->vm_file is a backing file whose f_inode is on the underlying + * filesystem. When the mapped file path is displayed to user (e.g. via + * /proc/<pid>/maps), this helper should be used to get the path to display + * to the user, which is the path of the fd that user has requested to map. + */ +static inline const struct path *file_user_path(struct file *f) { if (unlikely(f->f_mode & FMODE_BACKING)) - return backing_file_real_path(f); + return backing_file_user_path(f); return &f->f_path; } diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index ed48e4f1e755..bcb6609b54b3 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -96,8 +96,7 @@ static inline int fsnotify_file(struct file *file, __u32 mask) if (file->f_mode & FMODE_NONOTIFY) return 0; - /* Overlayfs internal files have fake f_path */ - path = file_real_path(file); + path = &file->f_path; return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH); } diff --git a/include/linux/mount.h b/include/linux/mount.h index 4f40b40306d0..ac3dd2876197 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -92,8 +92,8 @@ extern bool __mnt_is_readonly(struct vfsmount *mnt); extern bool mnt_may_suid(struct vfsmount *mnt); extern struct vfsmount *clone_private_mount(const struct path *path); -extern int __mnt_want_write(struct vfsmount *); -extern void __mnt_drop_write(struct vfsmount *); +int mnt_get_write_access(struct vfsmount *mnt); +void mnt_put_write_access(struct vfsmount *mnt); extern struct vfsmount *fc_mount(struct fs_context *fc); extern struct vfsmount *vfs_create_mount(struct fs_context *fc); diff --git a/include/linux/namei.h b/include/linux/namei.h index 1463cbda4888..3100371b5e32 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -92,6 +92,30 @@ extern struct dentry *lock_rename(struct dentry *, struct dentry *); extern struct dentry *lock_rename_child(struct dentry *, struct dentry *); extern void unlock_rename(struct dentry *, struct dentry *); +/** + * mode_strip_umask - handle vfs umask stripping + * @dir: parent directory of the new inode + * @mode: mode of the new inode to be created in @dir + * + * In most filesystems, umask stripping depends on whether or not the + * filesystem supports POSIX ACLs. If the filesystem doesn't support it umask + * stripping is done directly in here. If the filesystem does support POSIX + * ACLs umask stripping is deferred until the filesystem calls + * posix_acl_create(). + * + * Some filesystems (like NFSv4) also want to avoid umask stripping by the + * VFS, but don't support POSIX ACLs. Those filesystems can set SB_I_NOUMASK + * to get this effect without declaring that they support POSIX ACLs. + * + * Returns: mode + */ +static inline umode_t __must_check mode_strip_umask(const struct inode *dir, umode_t mode) +{ + if (!IS_POSIXACL(dir) && !(dir->i_sb->s_iflags & SB_I_NOUMASK)) + mode &= ~current_umask(); + return mode; +} + extern int __must_check nd_jump_link(const struct path *path); static inline void nd_terminate_link(void *name, size_t len, size_t maxlen) @@ -112,7 +136,7 @@ static inline void nd_terminate_link(void *name, size_t len, size_t maxlen) static inline bool retry_estale(const long error, const unsigned int flags) { - return error == -ESTALE && !(flags & LOOKUP_REVAL); + return unlikely(error == -ESTALE && !(flags & LOOKUP_REVAL)); } #endif /* _LINUX_NAMEI_H */ diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 608a9eb86bff..8ff23bf5a819 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -62,9 +62,6 @@ struct pipe_inode_info { unsigned int tail; unsigned int max_usage; unsigned int ring_size; -#ifdef CONFIG_WATCH_QUEUE - bool note_loss; -#endif unsigned int nr_accounted; unsigned int readers; unsigned int writers; @@ -72,6 +69,9 @@ struct pipe_inode_info { unsigned int r_counter; unsigned int w_counter; bool poll_usage; +#ifdef CONFIG_WATCH_QUEUE + bool note_loss; +#endif struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers; @@ -125,6 +125,22 @@ struct pipe_buf_operations { }; /** + * pipe_has_watch_queue - Check whether the pipe is a watch_queue, + * i.e. it was created with O_NOTIFICATION_PIPE + * @pipe: The pipe to check + * + * Return: true if pipe is a watch queue, false otherwise. + */ +static inline bool pipe_has_watch_queue(const struct pipe_inode_info *pipe) +{ +#ifdef CONFIG_WATCH_QUEUE + return pipe->watch_queue != NULL; +#else + return false; +#endif +} + +/** * pipe_empty - Return true if the pipe is empty * @head: The pipe ring head pointer * @tail: The pipe ring tail pointer diff --git a/include/linux/watch_queue.h b/include/linux/watch_queue.h index 45cd42f55d49..429c7b6afead 100644 --- a/include/linux/watch_queue.h +++ b/include/linux/watch_queue.h @@ -32,7 +32,7 @@ struct watch_filter { DECLARE_BITMAP(type_filter, WATCH_TYPE__NR); }; u32 nr_filters; /* Number of filters */ - struct watch_type_filter filters[]; + struct watch_type_filter filters[] __counted_by(nr_filters); }; struct watch_queue { |