| Age | Commit message (Collapse) | Author |
|
The CONFIG_FILE_LOCKING=n stub for break_lease() takes a 'bool wait'
argument, whereas the CONFIG_FILE_LOCKING=y version and every caller pass
an openmode as an 'unsigned int mode'. The mismatch was introduced when
__break_lease() was reworked to use flags: only the stub was switched to
'bool wait', a stray leftover from the neighbouring break_layout()
helper. The real prototype kept 'unsigned int mode'.
This was harmless until O_WRONLY changed from the octal literal 00000001
to (1 << 0). clang's -Wtautological-constant-compare then fires on the
implicit shift-to-bool conversion at the first FILE_LOCKING=n caller:
fs/open.c:112:29: warning: converting the result of '<<' to a boolean
always evaluates to true [-Wtautological-constant-compare]
112 | error = break_lease(inode, O_WRONLY);
Restore the stub's parameter to 'unsigned int mode' so it matches the
real prototype and every caller. The stub still just returns 0, so there
is no functional change; it removes the type inconsistency and silences
the warning.
Root cause diagnosed by Nathan Chancellor.
Fixes: 4be9f3cc582a ("filelock: rework the __break_lease API to use flags")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606071029.DKCs8WOs-lkp@intel.com/
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
After commit 0652a3daa787 ("tracing: Fix CFI violation in probestub
being called by tprobes"), there are many build errors when building
ARCH=arm multi_v7_defconfig + CONFIG_CFI=y like:
In file included from drivers/base/devres.c:17:
In file included from drivers/base/trace.h:16:
In file included from include/linux/tracepoint.h:23:
include/linux/cfi.h:44:6: error: call to undeclared function 'get_kernel_nofault'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
44 | if (get_kernel_nofault(hash, func - cfi_get_offset()))
| ^
1 error generated.
get_kernel_nofault() is called in the generic version of
cfi_get_func_hash() but nothing ensures uaccess.h is always included for
a proper expansion and prototype. Include uaccess.h in cfi.h to clear
up the errors.
Cc: stable@vger.kernel.org
Fixes: 0652a3daa787 ("tracing: Fix CFI violation in probestub being called by tprobes")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add support for extended attributes on bpffs inodes so that user space
and BPF LSM programs can attach metadata, for example, a content hash
or a security label - to a pinned object or directory. BPF LSM or user
space tooling can then uniformly look at this (e.g. security.bpf.*) in
similar way to other fs'es. The store is in-memory and non-persistent:
it lives only for the lifetime of the mount, like everything else in
bpffs. The modelling is similar to tmpfs.
bpffs serves the trusted.* and security.* namespaces; user.* is left
unsupported. As bpffs is FS_USERNS_MOUNT, security.* is reachable by
the unprivileged mounter in a user namespace, and thus we are using
the simple_xattr_set_limited infra there (trusted.* needs global
CAP_SYS_ADMIN).
bpf_fill_super() is open-coded instead of using simple_fill_super(),
because the root inode must now be allocated through bpf_fs_alloc_inode()
i.e. carry the bpf_fs_inode wrapper and come from the right cache -
which requires s_op (and s_xattr) to be installed before the first
inode is created. While at it, also harden s_iflags with SB_I_NOEXEC
and SB_I_NODEV.
bpf_fs_listxattr() is only reachable through the filesystem via
i_op->listxattr, so the BPF token inode is left untouched. Name-based
fsetxattr()/fgetxattr() on a token fd still work since the get/set
handlers are installed at the superblock.
For security.* namespace, we use simple_xattr_set_limited() but
there was no simple_xattr_add_limited() API yet which was needed
in bpf_fs_initxattrs() to avoid underflows in the accounting. The
symlink target is freed in bpf_free_inode() rather than in
bpf_destroy_inode() so that it is released only after an RCU grace
period, as an RCU path walk following the symlink may still
dereference inode->i_link in security_inode_follow_link(). Lastly,
the bpf_symlink() allocated the symlink target is switched to
GFP_KERNEL_ACCOUNT, so the string is charged to the caller's memcg.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260602074012.416289-1-daniel@iogearbox.net
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Move the hash table to the super block to remove excessive overhead in case
of small number of xattrs per inode.
Add linked list to the inode, used for listxattr and eviction. Listxattr
uses rcu protection to iterate the list of xattrs.
Before being made per-sb, lazy allocation was protected by inode lock. Now
inode lock no longer provides sufficient exclusion, so use cmpxchg() to
ensure atomicity.
Though I haven't found a description of this pattern, after some research
it seems that cmpxchg_release() and READ_ONCE() should provide the
necessary memory barriers.
Use simple_xattr_free_rcu() in simple_xattrs_free(). This is needed because
the hash table is now shared between inodes and lookup on a different inode
might be running the compare function on the just freed element within the
RCU grace period.
Following stats are based on slabinfo diff, after creating 100k empty
files, then adding a "user.test=foo" xattr to each:
v7.0 (no rhashtable):
File creation: 993.40 bytes/file
Xattr addition: 79.99 bytes/file
v7.1-rc2 (per-inode rhashtable):
File creation: 939.73 bytes/file
Xattr addition: 1296.08 bytes/file
v7.1-rc2 + this patch (per-sb rhashtable)
File creation: 946.84 bytes/file
Xattr addition: 111.86 bytes/file
The overhead of a single xattr is reduced to nearly v7.0 levels. The per
xattr overhead is slightly larger due to the addition of three pointers to
struct simple_xattr.
Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-5-mszeredi@redhat.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Change the simple_xattr API to accept pointer-to-pointer (struct
simple_xattrs **) instead of pointer. This allows the functions to handle
lazy allocation internally without requiring callers to use
simple_xattrs_lazy_alloc().
The simple_xattr_set(), simple_xattr_set_limited() and simple_xattr_add()
functions now handle allocation when xattrs is NULL. simple_xattrs_free()
now also frees the xattrs structure itself and sets the pointer to NULL.
This simplifies callers and removes the need for most callers to explicitly
manage xattrs allocation and lifetime.
In shmem_initxattrs(), the total required space for all initial xattrs
(ispace) is pre-calculated and deducted from sbinfo->free_ispace.
Since this patch modifies the function to add new xattrs directly to the
inode's &info->xattrs list rather than using a local temporary variable, a
failure means that the partially populated info->xattrs list remains
attached to the inode.
When the VFS caller handles the -ENOMEM error, it drops the newly created
inode via iput(), shmem_free_inode() adds freed to sbinfo->free_ispace a
second time, permanently inflating the tmpfs free space quota.
Fix by substracting already added xattrs from ispace.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-4-mszeredi@redhat.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Multiple superblocks with different namespaces can share the same
kernfs_node when kernfs_test_super() finds a matching root but
different namespace. This means multiple inodes from different
superblocks can reference the same kernfs_node->iattr->xattrs
structure.
The VFS layer only holds per-inode locks during xattr operations,
which is insufficient to serialize concurrent xattr modifications on
the shared kernfs_node. This can lead to race conditions in
simple_xattr_set() where the lookup->replace/remove sequence is not
atomic with respect to operations from other superblocks.
Fix this by protecting xattr operations with the existing hashed
kernfs_locks->open_file_mutex[] array, which is already used to
protect per-node open file data. The hashed mutex array provides
scalable per-node serialization (scaled by CPU count, up to 1024 locks
on 32+ CPU systems) with zero memory overhead.
Changes:
- Rename open_file_mutex[] to node_mutex[] to reflect dual purpose
- Add kernfs_node_lock_ptr() and kernfs_node_lock() helpers
- Protect simple_xattr_set() calls in kernfs_xattr_set() and
kernfs_vfs_user_xattr_set() with the hashed mutex
- Update file.c to use new helpers via compatibility wrappers
- Update documentation to explain the extended lock usage
Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260601162454.2116375-1-mszeredi%40redhat.com
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260605135322.2632068-2-mszeredi@redhat.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
BPF_PROG_LOAD verifies the loader signature but does not record the
outcome on the BPF program. [BPF] LSMs and audit can read attr->signature
and attr->keyring_id to infer "was this signed, and if so, against which
keyring".
Add prog->aux->sig (verdict + keyring_{type,serial}), populated by
bpf_prog_load before the LSM hook. keyring_type classifies the keyring
the load referenced (builtin, secondary, platform or user), while
keyring_serial records the serial of the keyring the signature was
actually validated against. System keyrings carry a pseudo key pointer
with no user-visible serial and are reported as 0, as are unsigned loads.
Failed verifications reject the load before the hook runs, so it observes
only either UNSIGNED or VERIFIED.
Signed-off-by: KP Singh <kpsingh@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260605213518.544262-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We want to use atomic operations for lockless p->flags changes
and reads.
Add definitions for bits in addition of masks so that we can use
test_bit(), clear_bit() and set_bit() in subsequent patches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260604141343.2124500-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The four callback functions in blk_holder_ops all release the
bd_holder_lock. Annotate these functions accordingly.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/be51cf81110f691ebd5868ac2f15ceb847805bc8.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Let the thread-safety checker verify whether every start of a queue
limits update is followed by a call to a function that finishes a queue
limits update.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/8f71062b6d0fcf2b80bc8cda701c453224755439.1780682325.git.bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
apply_range_set_cb() maps the pages for a new arena allocation and returned
-EBUSY when the target PTE was already populated. Kernel-fault recovery
leaves the per-arena scratch page in unallocated arena PTEs, so a later
bpf_arena_alloc_pages() over such a page hits that -EBUSY, and every
subsequent allocation of it fails the same way. Allocation must install the
real page over scratch instead.
Overwriting the scratch PTE in place is a valid->valid change, which arm64
forbids without break-before-make. Route through an invalid entry instead:
ptep_try_set() fills only a none slot, so the PTE goes scratch->none->page.
On finding scratch, clear it and flush_tlb_before_set() before retrying. The
new flush_tlb_before_set() is a no-op except on arches like arm64 that need
the break-before-make TLB invalidate. The loop also copes with a concurrent
fault re-scratching the slot.
Arches without ptep_try_set() never install the scratch page, so keep the
must-be-empty check and set_pte_at() for them.
Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page")
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260601183728.1800490-1-tj@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast()
for deletes, and rhashtable_lookup_get_insert_fast() for inserts.
Updates modify values in place under RCU rather than allocating a
new element and swapping the pointer (as regular htab does). This
trades read consistency for performance: concurrent readers may
see partial updates. BPF_F_LOCK support and special-field
handling (timers, kptrs, etc.) follow in a later commit.
Initialize rhashtable with bpf_mem_alloc element cache. Require
BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via
rhashtable_free_and_destroy().
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260605-rhash-v7-4-5b8e05f8630d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Use irq work for automatic shrinking so that this may be called in NMI
context.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260605-rhash-v7-3-5b8e05f8630d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.
void *rhashtable_next_key(struct rhashtable *ht,
const void *prev_key);
Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.
Best-effort: in case of concurrent resize, provides no guarantees:
- may produce duplicate elements
- may skip any amount of elements
- termination of the loop is not guaranteed in case of
sustained rehash. Callers are advised to bound loop externally
or avoid inserting new elements during such loop.
Returns ERR_PTR(-ENOENT) if prev_key is not found.
Behavior on tables with duplicate keys is undefined.
rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP).
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20260605-rhash-v7-1-5b8e05f8630d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The guard constructors were annotated with an empty __nonnull_args(),
relying on __nonnull__() marking every pointer parameter as non-NULL.
Sparse cannot parse the empty argument list.
Both constructors take the lock pointer as their first parameter, so
specify the index explicitly: __nonnull_args(1).
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/all/aiJi0WcYE8FZt-jO@stanley.mountain/
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/aiKpH3cLBEj3TF2Q@shell.ilvokhin.com
|
|
This is just a wrapper around iomap_file_buffered_write() to create
necessary iterator over metadata.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-10-aalbersh@kernel.org
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Obtain fsverity info for folios with file data and fsverity metadata.
Filesystem can pass vi down to ioend and then to fsverity for
verification. This is different from other filesystems ext4, f2fs, btrfs
supporting fsverity, these filesystems don't need fsverity_info for
reading fsverity metadata. While reading merkle tree iomap requires
fsverity info to synthesize hashes for zeroed data block.
fsverity metadata has two kinds of holes - ones in merkle tree and one
after fsverity descriptor.
Merkle tree holes are blocks full of hashes of zeroed data blocks. These
are not stored on the disk but synthesized on the fly. This saves a bit
of space for sparse files. Due to this iomap also need to lookup
fsverity_info for folios with fsverity metadata. ->vi has a hash of the
zeroed data block which will be used to fill the merkle tree block.
The hole past descriptor is interpreted as end of metadata region. As we
don't have EOF here we use this hole as an indication that rest of the
folio is empty. This patch marks rest of the folio beyond fsverity
descriptor as uptodate.
For file data, fsverity needs to verify consistency of the whole file
against the root hash, hashes of holes are included in the merkle tree.
Verify them too.
Issue reading of fsverity merkle tree on the fsverity inodes. This way
metadata will be available at I/O completion time.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-9-aalbersh@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
This flag indicates that I/O is for fsverity metadata.
In the write path skip i_size check and i_size updates as metadata is
past EOF. In writeback don't update i_size and continue writeback if
even folio is beyond EOF. In read path don't zero fsverity folios, again
they are past EOF.
The iomap_block_needs_zeroing() is also called from write path. For
folios of larger order we don't want to zero out pages in the folio as
these could contain other merkle tree blocks. For fsverity, filesystem
will request to read PAGE_SIZE memory regions. For data folios, iomap
will zero the rest of the folio for anything which is beyond EOF. We
don't want this for fsverity folios.
Christian Brauner <brauner@kernel.org> says:
Changed IOMAP_F_FSVERITY from (1U << 10) to (1U << 11) to avoid colliding
with IOMAP_F_ZERO_TAIL, which already uses (1U << 10).
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-8-aalbersh@kernel.org
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Compute the hash of one filesystem block's worth of zeros. A filesystem
implementation can decide to elide merkle tree blocks containing only
this hash and synthesize the contents at read time.
Let's pretend that there's a file containing 131 data block and whose
merkle tree looks roughly like this:
root
+--leaf0
| +--data0
| +--data1
| +--...
| `--data128
`--leaf1
+--data129
+--data130
`--data131
If data[0-128] are sparse holes, then leaf0 will contain a repeating
sequence of @zero_digest. Therefore, leaf0 need not be written to disk
because its contents can be synthesized.
A subsequent xfs patch will use this to reduce the size of the merkle
tree when dealing with sparse gold master disk images and the like.
Note that this works only on the first-level (data holes). fsverity
doesn't store/generate zero_digest for any higher levels.
Add a helper to pre-fill folio with hashes of empty blocks. This will be
used by iomap to synthesize blocks full of zero hashes on the fly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-5-aalbersh@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Make sure that all KASAN page tables are emitted into the .pgtbl section
(provided that the arch has one - otherwise, fall back to page aligned
BSS)
This is needed because BSS itself is no longer accessible via the linear
map on arm64.
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: kasan-dev@googlegroups.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In filesystems that maintain a separate Valid Data Length, such as exFAT
and NTFS, a partial write may start at or beyond the current valid_size and
extend it. In this case, the region after the previous valid_size but
within the same filesystem block is considered unwritten.
This patch introduces IOMAP_F_ZERO_TAIL. When this flag is set in iomap,
__iomap_write_begin() will zero only the tail portion while preserving any
valid data before it in the same block.
Without this tail zeroing, stale data in the unwritten portion of the block
can remain in the page cache. Subsequent reads can then return incorrect
contents from that region.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Link: https://patch.msgid.link/20260518114705.9601-2-linkinjeon@kernel.org
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.
That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list. Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.
If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.
Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.
AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.
Changes:
* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
* ___d_drop() doesn't care about ->s_roots anymore.
* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Rename to_shrink_list() into __move_to_shrink_list(), document and
export it. Switch d_dispose_if_unused() users to that and kill
d_dispose_if_unused() itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Refcount of a NORCU dentry must not be incremented after having dropped
to zero. Otherwise we might end up with the following race:
CPU1: in fast_dput(d), rcu_read_lock();
CPU1: decrements refcount of d to 0
CPU1: notice that it's unhashed
CPU2: grab a reference to d
CPU2: dput(d), freeing d
CPU1: ... looks like we need to evict d, let's grab ->d_lock, recheck
the refcount, etc.
and that spin_lock(&d->d_lock) ends up a UAF, despite still being in
an RCU read-side critical area started back when the refcount had been
positive. If not for DCACHE_NORCU in d->d_flags freeing would've been
RCU-delayed, so we'd have grabbed ->d_lock, noticed the negative value
stored into refcount by __dentry_kill(), dropped the locks and that would
be it. For NORCU dentries freeing is _not_ delayed, though.
Most of the non-counting references are excluded for NORCU dentries -
they are not allowed to be hashed, they never get placed on LRU, they
never get placed into anyone's list of children and while dput_to_list()
might put them into a shrink list, nobody bumps refcount of something
that had been reached that way.
However, inode's list of aliases can be a problem - it does not contribute
to dentry refcount (for obvious reasons) and we *do* have places that
grab references to something found on that list - that's precisely what
d_find_alias() is. In case of d_find_alias() we are safe - it skips
unhashed aliases, so all NORCU ones are ignored there. d_find_any_alias()
is *not* limited to hashed ones, though, and while it's usually called
for directories (which never get NORCU dentries), there are callers that
use it to get something for non-directories with no hashed aliases.
Having d_find_any_alias() hit a NORCU dentry is not impossible - it can
be easily arranged if you have CAP_DAC_READ_SEARCH (memfd_create() + mmap()
+ name_to_handle_at() for /proc/self/map_files/<...> + munmap() +
open_by_handle_at() will do that, and adding a second memfd_create() for
mount_fd makes it possible to do that without having memfd pinned).
The race window is narrow, and it's probably not feasible on bare hardware,
but...
It's not hard to fix, fortunately:
* separate __d_find_dir_alias() (== current __d_find_any_alias()) to
be used for directory inodes.
* provide dget_alias_ilocked() that would return false for NORCU
dentries with zero refcount and return true incrementing refcount otherwise
* make __d_find_any_alias() go over the list of aliases, using
dget_alias_ilocked() and returning the alias it succeeds on (normally the
first one). Any NORCU alias with zero refcount is going to be evicted by
the thread that had dropped the final reference; this makes __d_find_any_alias()
pretend it had lost the race with eviction.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Parallel lookup starts with a call of d_alloc_parallel(). That primitive
either returns a matching hashed dentry or allocates a new one in the
in-lookup state and returns it to the caller. Once the caller is done
with lookup, it indicates so either by call of d_{splice_alias,add}()
or by call of d_done_lookup(); at that point dentry leaves the in-lookup
state.
If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for
that dentry to leave the in-lookup state, one way or another. Currently
by supplying wait_queue_head to d_alloc_parallel(). If d_alloc_parallel()
creates a new in-lookup dentry, the address of that wait_queue_head is stored
in ->d_wait of new dentry and stays there while it's in the in-lookup;
subsequent d_alloc_parallel() will wait on the queue found in the matching
in-lookup dentry. Transition out of in-lookup state wakes waiters on that
queue (if any).
That works, but the calling conventions are inconvenient - the caller must
supply wait_queue_head and make sure that it survives at least until the new
in-lookup dentry leaves the in-lookup state. That amounts to boilerplate
in the d_alloc_parallel() callers that are followed by a call of d_lookup_done()
in the same function; in cases like nfs asynchronous unlink it gets worse than
that.
This patch changes d_alloc_parallel() to use wake_up_var_locked() to
wake up waiters, and wait_var_event_spinlock() to wait. dentry->d_lock
is used for synchronisation as it is already held and the relevant
times.
That eliminates the need of caller-supplied wait_queue_head, simplifying
the calling conventions. Better yet, we only need one bit of information
stored in dentry itself: whether there are any waiters to be woken up,
and that can be easily stored in ->d_flags; ->d_wait goes away.
The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var
machinery the queues are shared with all kinds of stuff and there's
no way tell if any of the waiters have anything to do with our dentry;
most of the time none of them will be relevant, so we need to avoid the
pointless wakeups.
Another benefit of the new scheme comes from the fact that wakeups
have to be done outside of write-side critical areas of ->i_dir_seq;
with the old scheme we need to carry the value picked from ->d_wait from
__d_lookup_unhash() to the place where we actually wake the waiters up.
Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get
to doing wakeups - that's done within the same ->d_lock scope, so we
are fine; new bit is accessed only under ->d_lock and it's seen only
on dentries with DCACHE_PAR_LOOKUP in ->d_flags.
__d_lookup_unhash() no longer needs to re-init ->d_lru. That was
previously shared (in a union) with ->d_wait but ->d_wait is now gone
so it no longer corrupts ->d_lru.
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
https://gitlab.freedesktop.org/drm/rust/kernel into drm-next
DRM Rust changes for v7.2-rc1
- Driver Core (shared via signed tag dd-lifetimes-7.2-rc1):
- Introduce Higher-Ranked Lifetime Types (HRT) for Rust device
drivers, allowing driver structs to hold device resources like
pci::Bar and IoMem directly with a lifetime tied to the binding
scope, removing the need for Devres indirection and ARef<Device>.
- Replace drvdata() with scoped registration data on the auxiliary
bus, using the new ForLt trait to thread lifetimes through
registrations. Remove drvdata() and driver_type.
- DRM:
- Add GPUVM immediate mode abstraction for Rust GPU drivers:
- In immediate mode, GPU virtual address space state is updated
during job execution (in the DMA fence signalling critical path),
keeping the GPUVM and the GPU's address space always in sync.
- Provide GpuVm, GpuVa, and GpuVmBo types for managing address
spaces, virtual mappings, and GEM object backing respectively.
- Provide split-merge map/unmap operations that handle partial
overlaps with existing mappings.
- drm_exec integration for dma_resv locking and GEM object
validation based on the external/evicted object lists are not
yet covered and planned as follow-up work.
- Introduce DeviceContext type state for drm::Device, allowing
drivers to restrict operations to contexts where the device is
guaranteed to be registered (or not yet registered) with userspace.
- Add FEAT_RENDER flag to the Driver trait for render node support.
- Nova:
- Hopper/Blackwell enablement:
- Add GPU identification and architecture-based HAL selection for
Hopper (GH100) and Blackwell (GB100, GB202).
- Implement the FSP (Foundation Security Processor) boot path used by
Hopper and Blackwell, including FSP falcon engine support, EMEM
operations, MCTP/NVDM message infrastructure, and FSP Chain of
Trust boot with GSP lockdown release.
- Add support for 32-bit firmware images and auto-detection of
firmware image format.
- Add architecture-specific framebuffer, sysmem flush, PCI config
mirror, DMA mask, and WPR/non-WPR heap sizing.
- GSP boot and unload:
- Refactor the GSP boot process into a chipset-specific HAL,
keeping the SEC2 and FSP boot paths separated cleanly.
- Implement proper driver unload: send UNLOADING_GUEST_DRIVER
command, run Booter Unloader and FWSEC-SB upon unbinding, and run
the unload bundle on Gsp::boot() failure. This removes the need
for a manual GPU reset between driver unbind and re-probe.
- GA100 support:
- Add support for the GA100 GPU, including IFR header detection and
skipping, correct fwsignature selection, conditional FRTS boot,
and documentation of the IFR header layout.
- VBIOS hardening and refactoring:
- Harden VBIOS parsing with checked arithmetic, bounds-checked
accesses, and FromBytes-based structure reads throughout the FWSEC
and Falcon data paths. Simplify the overall VBIOS module
structure.
- HRT adoption:
- Use lifetime-parameterized pci::Bar directly, replacing the
Arc<Devres<Bar0>> indirection. Replace ARef<Device> with &'bound
Device in SysmemFlush and the GSP sequencer. Separate the driver
type from driver data.
- Misc:
- Rename module names to kebab-case (nova-drm, nova-core).
- Require little-endian in Kconfig, making the existing assumption
explicit.
- Tyr:
- Define comprehensive typed register blocks for GPU_CONTROL,
JOB_CONTROL, MMU_CONTROL (including per-address-space registers),
and DOORBELL_BLOCK using the kernel register!() macro. This replaces
manual bit manipulation with typed register and field accessors.
- Add shmem-backed GEM objects and set DMA mask based on GPU physical
address width.
- Adopt HRT: separate driver type from driver data, and use IoMem
directly instead of Devres for register access during probe.
- Move clock cleanup into a Drop implementation.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: "Danilo Krummrich" <dakr@kernel.org>
Link: https://patch.msgid.link/DJ0IF39U9ETK.PCCUO7ZEQ4S0@kernel.org
|
|
Cross-merge networking fixes after downstream PR (net-7.1-rc7).
Silent conflicts:
net/wireless/nl80211.c
cb9959ab5f99 ("wifi: cfg80211: enforce HE/EHT cap/oper consistency")
a384ae969902 ("wifi: cfg80211: move AP HT/VHT/... operation to beacon info")
https://lore.kernel.org/aiGJDaHV4UlCexIQ@sirena.org.uk
Conflicts:
drivers/net/wireless/intel/iwlwifi/mld/ap.c
a342c99cb70d ("wifi: iwlwifi: mld: honor BSS_CHANGED_BEACON_ENABLED")
9bf1b409afc7 ("wifi: iwlwifi: mld: send tx power constraints before link activation")
https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk
drivers/net/wireless/intel/iwlwifi/pcie/drv.c
093305d801fa ("wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used")
e2323929a68a ("wifi: iwlwifi: pcie: add debug print for resume flow if powered off")
https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk
Adjacent changes:
drivers/net/ethernet/airoha/airoha_eth.c
b38cae85d1c4 ("net: airoha: Fix use-after-free in metadata dst teardown")
ec6c391bcca7 ("net: airoha: Introduce airoha_gdm_dev struct")
drivers/net/ethernet/microchip/lan743x_main.c
8173d22b211f ("net: lan743x: permit VLAN-tagged packets up to configured MTU")
e3c6508a46f5 ("net: lan743x: avoid netdev-based logging before netdev registration")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
__ethtool_get_link_ksettings() is exported and called from sysfs
and many drivers. It invokes ethtool_ops->get_link_ksettings
so by our own docs it should be holding netdev lock for ops locked
devices. Looks like commit 2bcf4772e45a ("net: ethtool:
try to protect all callback with netdev instance lock")
missed adding the ops lock here.
There's a number of callers we need to fix up so let's add the
netif_get_link_ksettings() helper first, without any actual
locking changes (this commit is a nop).
Not treating this as a fix because I don't think any driver cares
at this point, but if we want to remove the rtnl_lock protection
this will become critical.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260603012840.2254293-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix CFI violation in probestub function
The probestub is a function to allow tprobes to hook to a tracepoint
to gain access to its parameters.
The function itself is only referenced by the tracepoint structure
which lives in the __tracepoint section. objtool explicitly ignores
that section and when processing functions in the kernel, if it
detects one that has no references it will seal it to have its ENDBR
stripped on boot up.
This means the probstub function will have its ENDBR stripped and if
a tprobe is attached to it with IBT enabled, it will go *boom*.
* tag 'trace-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix CFI violation in probestub being called by tprobes
|
|
Allow callers to easily reference these symbols in code that is built
even when the generic datastore is disabled.
As there are no good default no-op variants of these symbols, do not
provide stubs but require users to have their own fallback handling
using IS_ENABLED(CONFIG_HAVE_GENERIC_VDSO).
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260521-vdso-mips-kconfig-v1-2-2f79dcd6c78f@linutronix.de
|
|
Add a read_snapshot() callback to struct clocksource which returns the
derived clocksource value while also providing the underlying hardware
counter reading and the related clocksource ID.
This allows ktime_get_snapshot_id() to populate new hw_cycles and hw_csid
fields in struct system_time_snapshot.
For clocksources that are derived from an underlying counter (e.g., Hyper-V
TSC page scales TSC to 10MHz, kvmclock scales TSC to 1GHz), this provides
atomic access to both the derived value needed for timekeeping
calculations, and the raw hardware counter needed by consumers like KVM's
master clock and the vmclock PTP driver.
[ tglx: Reworked it slightly ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Assisted-by: Kiro:claude-opus-4.6-1m
Link: https://patch.msgid.link/20260526230635.136914-1-dwmw2@infradead.org
Link: https://patch.msgid.link/20260529195558.202568489@kernel.org
|
|
To prepare for a new PTP IOCTL, which exposes the raw counter value along
with the requested system time snapshot, switch the pre/post time stamp
sampling over to use ktime_get_snapshot_id() and fix up all usage sites.
No functional change intended.
The ptp_vmclock conversion was simplified by David Woodhouse.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260529195558.149589566@kernel.org
|
|
All users are converted to sys_systime.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195558.046694580@kernel.org
|
|
PTP device system crosstime stamps support only CLOCK_REALTIME, which is
meaningless for AUX clocks. The PTP core hands in the clock ID already, so
prepare the core code to honor it.
- Add a new sys_systime field to struct system_device_crosststamp which
aliases the sys_realtime field. Once all users are converted
sys_realtime can be removed.
- Prepare get_device_system_crosststamp() and the related code for it by
switching to sys_systime and providing the initial changes to utilize
different time keepers.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.846634842@kernel.org
|
|
All users have been converted to ktime_get_snapshot_id().
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.795510496@kernel.org
|
|
The normal capture for system/device cross timestamps is CLOCK_REALTIME,
but that's meaningless for AUX clocks.
Add a clock_id field to struct system_device_crosststamp and initialize it
with CLOCK_REALTIME at the two places which prepare for cross
timestamps.
After the related code has been cleaned up, the core code will honor the
clock_id field when calculating the system time from the system counter
snapshot.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.482153523@kernel.org
|
|
An upcoming extension to the PTP IOCTL requires to return the system counter
value and the clocksource ID to user space. get_device_system_crosststamp() has
this information already.
Extend struct system_device_crosststamp with a system_counterval_t member
and fill in the data.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.429406675@kernel.org
|
|
All users are converted over to ktime_get_snapshot_id() and
system_time_snapshot::systime and ::monoraw.
Remove the leftovers.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.330029635@kernel.org
|
|
The probestub is a function to allow tprobes to hook to a tracepoint to
gain access to its parameters. The function itself is only referenced by
the tracepoint structure which lives in the __tracepoint section. objtool
explicitly ignores that section and when processing functions in the
kernel, if it detects one that has no references it will seal it to have
its ENDBR stripped on boot up.
This means when a tprobe is attached to the sched_wakeup tracepoint, when it
is triggered it will call __probestub_sched_wakeup and due to the missing
ENDBR on a CFI-enabled machine it will take a #CP exception.
Fix this by adding CFI_NOSEAL annotation to probestub declaration.
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260603153147.573589-1-eva.kurchatova@virtuozzo.com
Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
[ Updated change log ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
!CONFIG_HOTPLUG_CPU
lockdep_is_cpus_held() and lockdep_is_cpus_write_held() are undefined when
!CONFIG_HOTPLUG_CPU. This is ok because their few callers protect the calls
with a "if (IS_ENABLED(CONFIG_HOTPLUG_CPU) ..." check.
It is error prone to require callers to protect lockdep_is_cpus_held()
and lockdep_is_cpus_write_held() with an IS_ENABLED(CONFIG_HOTPLUG_CPU)
check while the custom for equivalent functions, for example the more
prevalent lockdep_is_held(), is to not require similar protection.
It is also inconsistent with CPU hotplug lockdep code self since related
call lockdep_assert_cpus_held() does not require protection.
Create stubs for lockdep_is_cpus_held() and lockdep_is_cpus_write_held()
that returns 1 (LOCK_STATE_UNKNOWN/LOCK_STATE_HELD) when !CONFIG_HOTPLUG_CPU.
This makes the CPU hotplug lockdep checks consistent while following
existing lockdep custom. Drop the "extern" from the function declaration
as part of the move to match kernel coding style.
Keep the IS_ENABLED(CONFIG_HOTPLUG_CPU) checks in existing users since
removing them would change the logic of these expressions.
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/7484f0b58fd86153d445819cc4e172adba16cff9.1780543665.git.reinette.chatre@intel.com
Closes: https://sashiko.dev/#/patchset/cover.1780456704.git.reinette.chatre%40intel.com?part=1
|
|
It has no callers left, so delete it. Inline __end_buffer_write_sync()
into bh_end_write().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20260528173150.1093780-35-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
This shrinks buffer_head by 8 bytes, letting us pack more buffer heads
per slab. With a Debian config, it shrinks from 104 bytes to 96 bytes
which is 42 objects per 4KiB page rather than 39, a 7% reduction in the
amount of memory used.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20260528173150.1093780-33-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
No users are left; remove this API. Also remove/fix comments mentioning
it, and end_bio_bh_io_sync() as it's now unused.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20260528173150.1093780-32-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
There are no more callers of this function, so delete it.
end_buffer_async_write() then has only one caller left, so
inline it into bh_end_async_write().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20260528173150.1093780-27-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
These are the bio_end_io_t versions of end_buffer_read_sync(),
end_buffer_write_sync() and end_buffer_async_write(). They do not
contain a put_bh() call as it is no longer necessary.
Also add the helper function bio_endio_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20260528173150.1093780-5-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
bh_submit() takes a bio_end_io allowing users to avoid the indirect
function call through bh->b_end_io, and eventually allowing us to remove
bh->b_end_io.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20260528173150.1093780-3-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context. Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).
Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background. This moves writeback submission
completely off the writer's hot path.
To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back. The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.
Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations. Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.
Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.
In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.
Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.
dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):
Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:
Single-stream throughput (MB/s):
Before After Change
seq-write/dontcache 298 897 +201%
rand-write/dontcache 131 236 +80%
Tail latency improvements (seq-write/dontcache):
p99: 135,266 us -> 23,986 us (-82%)
p99.9: 8,925,479 us -> 28,443 us (-99.7%)
Multi-writer (4 jobs, sequential write):
Before After Change
dontcache aggregate (MB/s) 2,529 4,532 +79%
dontcache p99 (us) 8,553 1,002 -88%
dontcache p99.9 (us) 109,314 1,057 -99%
Dontcache multi-writer throughput now matches buffered (4,532 vs
4,616 MB/s).
32-file write (Axboe test):
Before After Change
dontcache aggregate (MB/s) 1,548 3,499 +126%
dontcache p99 (us) 10,170 602 -94%
Peak dirty pages (MB) 1,837 213 -88%
Dontcache now reaches 81% of buffered throughput (was 35%).
Competing writers (dontcache vs buffered, separate files):
Before After
buffered writer 868 433 MB/s
dontcache writer 415 433 MB/s
Aggregate 1,284 866 MB/s
Previously the buffered writer starved the dontcache writer 2:1.
With per-bdi_writeback tracking, both writers now receive equal
bandwidth. The aggregate matches the buffered-vs-buffered baseline
(863 MB/s), indicating fair sharing regardless of I/O mode.
The dontcache writer's p99.9 latency collapsed from 119 ms to
33 ms (-73%), eliminating the severe periodic stalls seen in the
baseline. Both writers now share identical latency profiles,
matching the buffered-vs-buffered pattern.
The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).
Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
to prevent concurrent lookups from double-decrementing the counter, and
guarding the decrement with mapping_can_writeback() to match the increment
path.
Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
that the stat is properly migrated when an inode switches cgroup writeback
domains.
The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.
Suggested-by: Jan Kara <jack@suse.cz>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
soc/drivers
arm64: Xilinx SOC changes for 7.2
firmware:
- Add CSU register discovery with sysfs interface
zynqmp_power:
- Fix race condition in event registration
- Fix shutdown and free rx mailbox channel
* tag 'zynqmp-soc-for-7.2' of https://github.com/Xilinx/linux-xlnx:
firmware: zynqmp: Add dynamic CSU register discovery and sysfs interface
Documentation: ABI: add sysfs interface for ZynqMP CSU registers
soc: xilinx: Shutdown and free rx mailbox channel
soc: xilinx: Fix race condition in event registration
Signed-off-by: Linus Walleij <linusw@kernel.org>
|