summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2023-08-28Merge tag 'pstore-v6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: - Greatly simplify compression support (Ard Biesheuvel) - Avoid crashes for corrupted offsets when prz size is 0 (Enlin Mu) - Expand range of usable record sizes (Yuxiao Zhang) - Fix kernel-doc warning (Matthew Wilcox) * tag 'pstore-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Fix kernel-doc warning pstore: Support record sizes larger than kmalloc() limit pstore/ram: Check start of empty przs during init pstore: Replace crypto API compression with zlib_deflate library calls pstore: Remove worst-case compression size logic
2023-08-28Merge tag 'for-6.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial. Fixes: - scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned - handle more corner cases of transaction commit during orphan cleanup or delayed ref processing - use correct fsid/metadata_uuid when validating super block - copy directory permissions and time when creating a stub subvolume Core: - debugging feature integrity checker deprecated, to be removed in 6.7 - in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing - v0 extent handling completely removed, deprecated long time ago - error handling improvements - tests: - extent buffer bitmap tests - pinned extent splitting tests - cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling" * tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...
2023-08-28Merge tag 'affs-for-6.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull affs updates from David Sterba: "Two minor updates for AFFS: - reimplement writepage() address space callback on top of migrate_folio() - fix a build warning, local parameters 'toupper' collide with the standard ctype.h name" * tag 'affs-for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: rename local toupper() to fn() to avoid confusion affs: remove writepage implementation
2023-08-28Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds
Pull fsverity updates from Eric Biggers: "Several cleanups for fs/verity/, including two commits that make the builtin signature support more cleanly separated from the base feature" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: skip PKCS#7 parser when keyring is empty fsverity: move sysctl registration out of signature.c fsverity: simplify handling of errors during initcall fsverity: explicitly check that there is no algorithm 0
2023-08-28Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "We've got some big changes for this release -- I'm very happy to be landing willy's work to enable large folios for the page cache for general read and write IOs when the fs can make contiguous space allocations, and Ritesh's work to track sub-folio dirty state to eliminate the write amplification problems inherent in using large folios. As a bonus, io_uring can now process write completions in the caller's context instead of bouncing through a workqueue, which should reduce io latency dramatically. IOWs, XFS should see a nice performance bump for both IO paths. Summary: - Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. - Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. - Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests" * tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef ...
2023-08-28Merge tag 'erofs-for-6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, a xattr bloom filter feature is introduced to speed up negative xattr lookups, which was originally suggested by Alexander for Composefs use cases. Additionally, the DEFLATE algorithm is now supported, which can be used together with hardware accelerators for our cloud workloads. Each supported compression algorithm can be selected on a per-file basis for specific access patterns too. There are also some random fixes and cleanups as usual: - Support xattr bloom filter to optimize negative xattr lookups - Support DEFLATE compression algorithm as an alternative - Fix a regression that ztailpacking pclusters don't release properly - Avoid warning dedupe and fragments features anymore - Some folio conversions and cleanups" * tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: release ztailpacking pclusters properly erofs: don't warn dedupe and fragments features anymore erofs: adapt folios for z_erofs_read_folio() erofs: adapt folios for z_erofs_readahead() erofs: get rid of fe->backmost for cache decompression erofs: drop z_erofs_page_mark_eio() erofs: tidy up z_erofs_do_read_page() erofs: move preparation logic into z_erofs_pcluster_begin() erofs: avoid obsolete {collector,collection} terms erofs: simplify z_erofs_read_fragment() erofs: remove redundant erofs_fs_type declaration in super.c erofs: add necessary kmem_cache_create flags for erofs inode cache erofs: clean up redundant comment and adjust code alignment erofs: refine warning messages for zdata I/Os erofs: boost negative xattr lookup with bloom filter erofs: update on-disk format for xattr name filter erofs: DEFLATE compression support
2023-08-28Merge tag 'filelock-v6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: - new functionality for F_OFD_GETLK: requesting a type of F_UNLCK will find info about whatever lock happens to be first in the given range, regardless of type. - an OFD lock selftest - bugfix involving a UAF in a tracepoint - comment typo fix * tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock fs/locks: Fix typo selftests: add OFD lock tests fs/locks: F_UNLCK extension for F_OFD_GETLK
2023-08-28Merge tag 'v6.6-fs.proc.uapi' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs fixes from Christian Brauner: "Mode changes to files under /proc/<pid>/ aren't supported ever since commit 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files"). Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD, mode changes on /proc/thread-self/comm were accidently allowed. Similar, mode changes for all files beneath /proc/<pid>/net/ are blocked but mode changes on /proc/<pid>/net itself were accidently allowed. Both issues come down to not using the generic proc_setattr() helper which blocks all mode changes. This is rectified with this pull request. This also removes a strange nolibc test that abused /proc/<pid>/net for testing mode changes. Using procfs for this test never made a lot of sense given procfs has special semantics for almost everything anway. Both changes are minor user-visible changes. It is however very unlikely that mode changes on proc/<pid>/net and /proc/thread-self/comm are something that userspace relies on" * tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: procfs: block chmod on /proc/thread-self/comm proc: use generic setattr() for /proc/$PID/net selftests/nolibc: drop test chmod_net
2023-08-28Merge tag 'v6.6-vfs.autofs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull autofs fixes from Christian Brauner: "This fixes a memory leak in autofs reported by syzkaller and a missing conversion from uninterruptible to interruptible wake up when autofs is in catatonic mode" * tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: autofs: use wake_up() instead of wake_up_interruptible(() autofs: fix memory leak of waitqueues in autofs_catatonic_mode
2023-08-28Merge tag 'v6.6-vfs.fchmodat2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fchmodat2 system call from Christian Brauner: "This adds the fchmodat2() system call. It is a revised version of the fchmodat() system call, adding a missing flag argument. Support for both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included. Adding this system call revision has been a longstanding request but so far has always fallen through the cracks. While the kernel implementation of fchmodat() does not have a flag argument the libc provided POSIX-compliant fchmodat(3) version does. Both glibc and musl have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW (see [1] and [2]). The workaround is brittle because it relies not just on O_PATH and O_NOFOLLOW semantics and procfs magic links but also on our rather inconsistent symlink semantics. This gives userspace a proper fchmodat2() system call that libcs can use to properly implement fchmodat(3) and allows them to get rid of their hacks. In this case it will immediately benefit them as the current workaround is already defunct because of aformentioned inconsistencies. In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use AT_EMPTY_PATH with fchmodat2(). This is already possible with fchownat() so there's no reason to not also support it for fchmodat2(). The implementation is simple and comes with selftests. Implementation of the system call and wiring up the system call are done as separate patches even though they could arguably be one patch. But in case there are merge conflicts from other system call additions it can be beneficial to have separate patches" Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2] * tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: fchmodat2: remove duplicate unneeded defines fchmodat2: add support for AT_EMPTY_PATH selftests: Add fchmodat2 selftest arch: Register fchmodat2, usually as syscall 452 fs: Add fchmodat2() Non-functional cleanup of a "__user * filename"
2023-08-28Merge tag 'v6.6-vfs.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...
2023-08-28Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
2023-08-28Merge tag 'v6.6-vfs.tmpfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull libfs and tmpfs updates from Christian Brauner: "This cycle saw a lot of work for tmpfs that required changes to the vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this cycle. Things will go back to mm next cycle. Features ======== - By far the biggest work is the quota support for tmpfs. New tmpfs quota infrastructure is added to support it and a new QFMT_SHMEM uapi option is exposed. This offers user and group quotas to tmpfs (project quotas will be added later). Similar to other filesystems tmpfs quota are not supported within user namespaces yet. - Add support for user xattrs. While tmpfs already supports security xattrs (security.*) and POSIX ACLs for a long time it lacked support for user xattrs (user.*). With this pull request tmpfs will be able to support a limited number of user xattrs. This is accompanied by a fix (see below) to limit persistent simple xattr allocations. - Add support for stable directory offsets. Currently tmpfs relies on the libfs provided cursor-based mechanism for readdir. This causes issues when a tmpfs filesystem is exported via NFS. NFS clients do not open directories. Instead, each server-side readdir operation opens the directory, reads it, and then closes it. Since the cursor state for that directory is associated with the opened file it is discarded after each readdir operation. Such directory offsets are not just cached by NFS clients but also various userspace libraries based on these clients. As it stands there is no way to invalidate the caches when directory offsets have changed and the whole application depends on unchanging directory offsets. At LSFMM we discussed how to solve this problem and decided to support stable directory offsets. libfs now allows filesystems like tmpfs to use an xarrary to map a directory offset to a dentry. This mechanism is currently only used by tmpfs but can be supported by others as well. Fixes ===== - Change persistent simple xattrs allocations in libfs from GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory cgroup limits. Since this is a change to libfs it affects both tmpfs and kernfs. - Correctly verify {g,u}id mount options. A new filesystem context is created via fsopen() which records the namespace that becomes the owning namespace of the superblock when fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are mountable in namespaces. However, fsconfig() calls can occur in a namespace different from the namespace where fsopen() has been called. Currently, when fsconfig() is called to set {g,u}id mount options the requested {g,u}id is mapped into a k{g,u}id according to the namespace where fsconfig() was called from. The resulting k{g,u}id is not guaranteed to be resolvable in the namespace of the filesystem (the one that fsopen() was called in). This means it's possible for an unprivileged user to create files owned by any group in a tmpfs mount since it's possible to set the setid bits on the tmpfs directory. The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, tmpfs has been doing the correct thing. But since tmpfs is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock to avoid such bugs. The new mount api's cross-namespace delegation abilities are already widely used. Having talked to a bunch of userspace this is the most faithful solution with minimal regression risks" * tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs mm: invalidation check mapping before folio_contains tmpfs: trivial support for direct IO tmpfs,xattr: enable limited user extended attributes tmpfs: track free_ispace instead of free_inodes xattr: simple_xattr_set() return old_xattr to be freed tmpfs: verify {g,u}id mount options correctly shmem: move spinlock into shmem_recalc_inode() to fix quota support libfs: Remove parent dentry locking in offset_iterate_dir() libfs: Add a lock class for the offset map's xa_lock shmem: stable directory offsets shmem: Refactor shmem_symlink() libfs: Add directory operations for stable offsets shmem: fix quota lock nesting in huge hole handling shmem: Add default quota limit mount options shmem: quota support shmem: prepare shmem quota infrastructure quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks shmem: make shmem_get_inode() return ERR_PTR instead of NULL shmem: make shmem_inode_acct_block() return error
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-28Merge tag 'v6.6-vfs.fs_context' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mount API updates from Christian Brauner: "This introduces FSCONFIG_CMD_CREATE_EXCL which allows userspace to implement something like $ mount -t ext4 --exclusive /dev/sda /B which fails if a superblock for the requested filesystem does already exist instead of silently reusing an existing superblock. Without it, in the sequence $ move-mount -f xfs -o source=/dev/sda4 /A $ move-mount -f xfs -o noacl,source=/dev/sda4 /B the initial mounter will create a superblock. The second mounter will reuse the existing superblock, creating a bind-mount (see [1] for the source of the move-mount binary). The problem is that reusing an existing superblock means all mount options other than read-only and read-write will be silently ignored even if they are incompatible requests. For example, the second mount has requested no POSIX ACL support but since the existing superblock is reused POSIX ACL support will remain enabled. Such silent superblock reuse can easily become a security issue. After adding support for FSCONFIG_CMD_CREATE_EXCL to mount(8) in util-linux this can be fixed: $ move-mount -f xfs --exclusive -o source=/dev/sda4 /A $ move-mount -f xfs --exclusive -o noacl,source=/dev/sda4 /B Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed This requires the new mount api. With the old mount api it would be necessary to plumb this through every legacy filesystem's file_system_type->mount() method. If they want this feature they are most welcome to switch to the new mount api" Link: https://github.com/brauner/move-mount-beneath [1] Link: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner Link: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner Link: https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner Link: https://lore.kernel.org/lkml/20230824-anzog-allheilmittel-e8c63e429a79@brauner/ * tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add FSCONFIG_CMD_CREATE_EXCL fs: add vfs_cmd_reconfigure() fs: add vfs_cmd_create() super: remove get_tree_single_reconf()
2023-08-25Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues or aren't considered suitable for a -stable backport" * tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: shmem: fix smaps BUG sleeping while atomic selftests: cachestat: catch failing fsync test on tmpfs selftests: cachestat: test for cachestat availability maple_tree: disable mas_wr_append() when other readers are possible madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check mm: multi-gen LRU: don't spin during memcg release mm: memory-failure: fix unexpected return value in soft_offline_page() radix tree: remove unused variable mm: add a call to flush_cache_vmap() in vmap_pfn() selftests/mm: FOLL_LONGTERM need to be updated to 0x100 nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers() mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast selftests: cgroup: fix test_kmem_basic less than error mm: enable page walking API to lock vmas during the walk smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
2023-08-24Merge tag 'nfsd-6.5-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Two last-minute one-liners for v6.5-rc. One got lost in the shuffle, and the other was reported just this morning" - Close race window when handling FREE_STATEID operations - Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc" * tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix a thinko introduced by recent trace point changes nfsd: Fix race to FREE_STATEID and cl_revoked
2023-08-24NFSD: Fix a thinko introduced by recent trace point changesChuck Lever
The fixed commit erroneously removed a call to nfsd_end_grace(), which makes calls to write_v4_end_grace() a no-op. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202308241229.68396422-oliver.sang@intel.com Fixes: 39d432fc7630 ("NFSD: trace nfsctl operations") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-24locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lockWill Shiu
As following backtrace, the struct file_lock request , in posix_lock_inode is free before ftrace function using. Replace the ftrace function ahead free flow could fix the use-after-free issue. [name:report&]=============================================== BUG:KASAN: use-after-free in trace_event_raw_event_filelock_lock+0x80/0x12c [name:report&]Read at addr f6ffff8025622620 by task NativeThread/16753 [name:report_hw_tags&]Pointer tag: [f6], memory tag: [fe] [name:report&] BT: Hardware name: MT6897 (DT) Call trace: dump_backtrace+0xf8/0x148 show_stack+0x18/0x24 dump_stack_lvl+0x60/0x7c print_report+0x2c8/0xa08 kasan_report+0xb0/0x120 __do_kernel_fault+0xc8/0x248 do_bad_area+0x30/0xdc do_tag_check_fault+0x1c/0x30 do_mem_abort+0x58/0xbc el1_abort+0x3c/0x5c el1h_64_sync_handler+0x54/0x90 el1h_64_sync+0x68/0x6c trace_event_raw_event_filelock_lock+0x80/0x12c posix_lock_inode+0xd0c/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 ... [name:report&] [name:report&]Allocated by task 16752: ... slab_post_alloc_hook+0x74/0x340 kmem_cache_alloc+0x1b0/0x2f0 posix_lock_inode+0xb0/0xd60 ... [name:report&] [name:report&]Freed by task 16752: ... kmem_cache_free+0x274/0x5b0 locks_dispose_list+0x3c/0x148 posix_lock_inode+0xc40/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 do_fcntl+0x150/0xc18 ... Signed-off-by: Will Shiu <Will.Shiu@mediatek.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-08-24fs/locks: Fix typoJakub Wilk
Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-08-23erofs: release ztailpacking pclusters properlyJingbo Xu
Currently ztailpacking pclusters are chained with FOLLOWED_NOINPLACE and not recorded into the managed_pslots XArray. After commit 7674a42f35ea ("erofs: use struct lockref to replace handcrafted approach"), ztailpacking pclusters won't be freed with erofs_workgroup_put() anymore, which will cause the following issue: BUG erofs_pcluster-1 (Tainted: G OE ): Objects remaining in erofs_pcluster-1 on __kmem_cache_shutdown() Use z_erofs_free_pcluster() directly to free ztailpacking pclusters. Fixes: 7674a42f35ea ("erofs: use struct lockref to replace handcrafted approach") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230822110530.96831-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: don't warn dedupe and fragments features anymoresunshijie
The `dedupe` and `fragments` features have been merged for a year. They are mostly stable now. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: sunshijie <sunshijie@xiaomi.com> Link: https://lore.kernel.org/r/20230821041737.2673401-1-sunshijie@xiaomi.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: adapt folios for z_erofs_read_folio()Gao Xiang
It's a straight-forward conversion and no logic changes (except that it renames the corresponding tracepoint.) Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817083942.103303-1-hsiangkao@linux.alibaba.com
2023-08-23erofs: adapt folios for z_erofs_readahead()Gao Xiang
It's a straight-forward conversion except that readahead_folio() will do folio_put() in advance but it doesn't matter since folios are still locked. As before, since file-backed folios (pages for now) are locked, so we could temporarily use folio->private as an internal counter to indicate split parts of each folio for the corresponding pclusters to decompress. When such counter becomes zero, the folio will be finally unlocked (see compress.h and z_erofs_onlinepage_endio()). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-7-hsiangkao@linux.alibaba.com
2023-08-23erofs: get rid of fe->backmost for cache decompressionGao Xiang
EROFS_MAP_FULL_MAPPED is more accurate to decide if caching the last incomplete pcluster for later read or not. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-6-hsiangkao@linux.alibaba.com
2023-08-23erofs: drop z_erofs_page_mark_eio()Gao Xiang
It can be folded into z_erofs_onlinepage_endio() to simplify the code. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-5-hsiangkao@linux.alibaba.com
2023-08-23erofs: tidy up z_erofs_do_read_page()Gao Xiang
- Fix a typo: spiltted => split; - Move !EROFS_MAP_MAPPED and EROFS_MAP_FRAGMENT upwards; - Increase `split` in advance to avoid unnecessary repeats. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-4-hsiangkao@linux.alibaba.com
2023-08-23erofs: move preparation logic into z_erofs_pcluster_begin()Gao Xiang
Some preparation logic should be part of z_erofs_pcluster_begin() instead of z_erofs_do_read_page(). Let's move now. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-3-hsiangkao@linux.alibaba.com
2023-08-23erofs: avoid obsolete {collector,collection} termsGao Xiang
{collector,collection} were once reserved in order to indicate different runtime logical extent instance of multi-reference pclusters. However, de-duplicated decompression has been landed in a more flexable way, thus `struct z_erofs_collection` was formally removed in commit 87ca34a7065d ("erofs: get rid of `struct z_erofs_collection'"). Let's handle the remaining leftovers, for example: `z_erofs_collector_begin` => `z_erofs_pcluster_begin` `z_erofs_collector_end` => `z_erofs_pcluster_end` as well as some comments. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-2-hsiangkao@linux.alibaba.com
2023-08-23erofs: simplify z_erofs_read_fragment()Gao Xiang
A trivial cleanup to make the fragment handling logic more clear. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-1-hsiangkao@linux.alibaba.com
2023-08-23erofs: remove redundant erofs_fs_type declaration in super.cFerry Meng
As erofs_fs_type has been declared in internal.h, there is no use to declare repeatedly in super.c. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> eviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230815094849.53249-3-mengferry@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: add necessary kmem_cache_create flags for erofs inode cacheFerry Meng
To improve memory access efficiency and enable statistics functionality, add SLAB_MEM_SPREAD and SLAB_ACCOUNT flag during erofs_inode_cachep's allocation time. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230815094849.53249-2-mengferry@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: clean up redundant comment and adjust code alignmentFerry Meng
Remove some redundant comments in erofs/super.c, and avoid unncessary line breaks for cleanup. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230815094849.53249-1-mengferry@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: refine warning messages for zdata I/OsFerry Meng
Don't warn users since -EINTR can be returned due to user interruption. Also suppress warning messages of readmore. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230809060637.21311-1-mengferry@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23Merge tag 'vfs-6.6-merge-3' of ↵Christian Brauner
ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs online fsck update from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. * Use kernel-initated fsfreeze to fix some longstanding false negatives in online fsck of the free space and inode counters. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-23Merge tag 'vfs-6.6-merge-2' of ↵Christian Brauner
ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull filesystem freezing updates from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-22Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: - fix a use after free in nfs_direct_join_group() (Cc: stable) - fix sysfs server name memory leak - fix lock recovery hang in NFSv4.0 - fix page free in the error path for nfs42_proc_getxattr() and __nfs4_get_acl_uncached() - SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect * tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: xprtrdma: Remap Receive buffers after a reconnect NFSv4: fix out path in __nfs4_get_acl_uncached NFSv4.2: fix error handling in nfs42_proc_getxattr NFS: Fix sysfs server name memory leak NFS: Fix a use after free in nfs_direct_join_group() NFSv4: Fix dropped lock for racing OPEN and delegation return
2023-08-22affs: rename local toupper() to fn() to avoid confusionAndy Shevchenko
A compiler may see the collision with the toupper() defined in ctype.h: fs/affs/namei.c:159:19: warning: unused variable 'toupper' [-Wunused-variable] 159 | toupper_t toupper = affs_get_toupper(sb); To prevent this from happening, rename toupper local variable to fn. Initially this had been introduced by 24579a881513 ("v2.4.3.5 -> v2.4.3.6") in the history.git by history group. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22affs: remove writepage implementationMatthew Wilcox (Oracle)
If the filesystem implements migrate_folio and writepages, there is no need for a writepage implementation. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22btrfs: zoned: skip splitting and logical rewriting on pre-alloc writeNaohiro Aota
When doing a relocation, there is a chance that at the time of btrfs_reloc_clone_csums(), there is no checksum for the corresponding region. In this case, btrfs_finish_ordered_zoned()'s sum points to an invalid item and so ordered_extent's logical is set to some invalid value. Then, btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a block group and will hit an assert or a null pointer dereference as following. This can be reprodcued by running btrfs/028 several times (e.g, 4 to 16 times) with a null_blk setup. The device's zone size and capacity is set to 32 MB and the storage size is set to 5 GB on my setup. KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f] CPU: 6 PID: 3105720 Comm: kworker/u16:13 Tainted: G W 6.5.0-rc6-kts+ #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] Code: 41 54 49 89 fc 55 48 89 f5 53 e8 57 7d fc ff 48 8d b8 88 00 00 00 48 89 c3 48 b8 00 00 00 00 00 > 3c 02 00 0f 85 02 01 00 00 f6 83 88 00 00 00 01 0f 84 a8 00 00 RSP: 0018:ffff88833cf87b08 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088 RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102877b827 R10: ffff888143bdc13b R11: ffff888125b1cbc0 R12: ffff888143bdc000 R13: 0000000000007000 R14: ffff888125b1cba8 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881e500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3ed85223d5 CR3: 00000001519b4005 CR4: 00000000001706e0 Call Trace: <TASK> ? die_addr+0x3c/0xa0 ? exc_general_protection+0x148/0x220 ? asm_exc_general_protection+0x22/0x30 ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] ? btrfs_zone_finish_endio.part.0+0x19/0x160 [btrfs] btrfs_finish_one_ordered+0x7b8/0x1de0 [btrfs] ? rcu_is_watching+0x11/0xb0 ? lock_release+0x47a/0x620 ? btrfs_finish_ordered_zoned+0x59b/0x800 [btrfs] ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs] ? btrfs_finish_ordered_zoned+0x358/0x800 [btrfs] ? __smp_call_single_queue+0x124/0x350 ? rcu_is_watching+0x11/0xb0 btrfs_work_helper+0x19f/0xc60 [btrfs] ? __pfx_try_to_wake_up+0x10/0x10 ? _raw_spin_unlock_irq+0x24/0x50 ? rcu_is_watching+0x11/0xb0 process_one_work+0x8c1/0x1430 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? __pfx_do_raw_spin_lock+0x10/0x10 ? _raw_spin_lock_irq+0x52/0x60 worker_thread+0x100/0x12c0 ? __kthread_parkme+0xc1/0x1f0 ? __pfx_worker_thread+0x10/0x10 kthread+0x2ea/0x3c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> On the zoned mode, writing to pre-allocated region means data relocation write. Such write always uses WRITE command so there is no need of splitting and rewriting logical address. Thus, we can just skip the function for the case. Fixes: cbfce4c7fbde ("btrfs: optimize the logical to physical mapping for zoned writes") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-22super: use higher-level helper for {freeze,thaw}Christian Brauner
It's not necessary to use low-level locking helpers here. Use the higher-level locking helpers and log if the superblock is dying. Since the caller is assumed to already hold an active reference it isn't possible to observe a dying superblock. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-22tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrsHugh Dickins
It is particularly important for the userns mount case (when a sensible nr_inodes maximum may not be enforced) that tmpfs user xattrs be subject to memory cgroup limiting. Leave temporary buffer allocations as is, but change the persistent simple xattr allocations from GFP_KERNEL to GFP_KERNEL_ACCOUNT. This limits kernfs's cgroupfs too, but that's good. (I had intended to send this change earlier, but had been confused by shmem_alloc_inode() using GFP_KERNEL, and thought a discussion would be needed to change that too: no, I was forgetting the SLAB_ACCOUNT on that kmem_cache, which implicitly adds __GFP_ACCOUNT to all its allocations.) Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <f6953e5a-4183-8314-38f2-40be60998615@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()Ryusuke Konishi
A syzbot stress test reported that create_empty_buffers() called from nilfs_lookup_dirty_data_buffers() can cause a general protection fault. Analysis using its reproducer revealed that the back reference "mapping" from a page/folio has been changed to NULL after dirty page/folio gang lookup in nilfs_lookup_dirty_data_buffers(). Fix this issue by excluding pages/folios from being collected if, after acquiring a lock on each page/folio, its back reference "mapping" differs from the pointer to the address space struct that held the page/folio. Link: https://lkml.kernel.org/r/20230805132038.6435-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0ad741797f4565e7e2d2@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/0000000000002930a705fc32b231@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21super: wait until we passed kill superChristian Brauner
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() can end up taking s_umount again. That means we need to move the removal of the superblock from @fs_supers out of generic_shutdown_super() and into deactivate_locked_super() to ensure that concurrent mounters don't fail to open block devices that are still in use because blkdev_put() in sb->kill_sb() hasn't been called yet. We can now do this as we can make iterators through @fs_super and @super_blocks wait without holding s_umount. Concurrent mounts will wait until a dying superblock is fully dead so until sb->kill_sb() has been called and SB_DEAD been set. Concurrent iterators can already discard any SB_DYING superblock. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-4-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21super: wait for nascent superblocksChristian Brauner
Recent patches experiment with making it possible to allocate a new superblock before opening the relevant block device. Naturally this has intricate side-effects that we get to learn about while developing this. Superblock allocators such as sget{_fc}() return with s_umount of the new superblock held and lock ordering currently requires that block level locks such as bdev_lock and open_mutex rank above s_umount. Before aca740cecbe5 ("fs: open block device after superblock creation") ordering was guaranteed to be correct as block devices were opened prior to superblock allocation and thus s_umount wasn't held. But now s_umount must be dropped before opening block devices to avoid locking violations. This has consequences. The main one being that iterators over @super_blocks and @fs_supers that grab a temporary reference to the superblock can now also grab s_umount before the caller has managed to open block devices and called fill_super(). So whereas before such iterators or concurrent mounts would have simply slept on s_umount until SB_BORN was set or the superblock was discard due to initalization failure they can now needlessly spin through sget{_fc}(). If the caller is sleeping on bdev_lock or open_mutex one caller waiting on SB_BORN will always spin somewhere and potentially this can go on for quite a while. It should be possible to drop s_umount while allowing iterators to wait on a nascent superblock to either be born or discarded. This patch implements a wait_var_event() mechanism allowing iterators to sleep until they are woken when the superblock is born or discarded. This also allows us to avoid relooping through @fs_supers and @super_blocks if a superblock isn't yet born or dying. Link: aca740cecbe5 ("fs: open block device after superblock creation") Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-3-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21cachefiles: use kiocb_{start,end}_write() helpersAmir Goldstein
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-8-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21ovl: use kiocb_{start,end}_write() helpersAmir Goldstein
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-7-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21aio: use kiocb_{start,end}_write() helpersAmir Goldstein
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-6-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>