summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-05-24Merge tag 'trace-tracefs-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracefs/eventfs updates from Steven Rostedt: "Bug fixes: - The eventfs directories need to have unique inode numbers. Make sure that they do not get the default file inode number. - Update the inode uid and gid fields on remount. When a remount happens where a uid and/or gid is specified, all the tracefs files and directories should get the specified uid and/or gid. But this can be sporadic when some uids were assigned already. There's already a list of inodes that are allocated. Just update their uid and gid fields at the time of remount. - Update the eventfs_inodes on remount from the top level "events" descriptor. There was a bug where not all the eventfs files or directories where getting updated on remount. One fix was to clear the SAVED_UID/GID flags from the inode list during the iteration of the inodes during the remount. But because the eventfs inodes can be freed when the last referenced is released, not all the eventfs_inodes were being updated. This lead to the ownership selftest to fail if it was run a second time (the first time would leave eventfs_inodes with no corresponding tracefs_inode). Instead, for eventfs_inodes, only process the "events" eventfs_inode from the list iteration, as it is guaranteed to have a tracefs_inode (it's never freed while the "events" directory exists). As it has a list of its children, and the children have a list of their children, just iterate all the eventfs_inodes from the "events" descriptor and it is guaranteed to get all of them. - Clear the EVENT_INODE flag from the tracefs_drop_inode() callback. Currently the EVENTFS_INODE FLAG is cleared in the tracefs_d_iput() callback. But this is the wrong location. The iput() callback is called when the last reference to the dentry inode is hit. There could be a case where two dentry's have the same inode, and the flag will be cleared prematurely. The flag needs to be cleared when the last reference of the inode is dropped and that happens in the inode's drop_inode() callback handler. Cleanups: - Consolidate the creation of a tracefs_inode for an eventfs_inode A tracefs_inode is created for both files and directories of the eventfs system. It is open coded. Instead, consolidate it into a single eventfs_get_inode() function call. - Remove the eventfs getattr and permission callbacks. The permissions for the eventfs files and directories are updated when the inodes are created, on remount, and when the user sets them (via setattr). The inodes hold the current permissions so there is no need to have custom getattr or permissions callbacks as they will more likely cause them to be incorrect. The inode's permissions are updated when they should be updated. Remove the getattr and permissions inode callbacks. - Do not update eventfs_inode attributes on creation of inodes. The eventfs_inodes attribute field is used to store the permissions of the directories and files for when their corresponding inodes are freed and are created again. But when the creation of the inodes happen, the eventfs_inode attributes are recalculated. The recalculation should only happen when the permissions change for a given file or directory. Currently, the attribute changes are just being set to their current files so this is not a bug, but it's unnecessary and error prone. Stop doing that. - The events directory inode is created once when the events directory is created and deleted when it is deleted. It is now updated on remount and when the user changes the permissions. There's no need to use the eventfs_inode of the events directory to store the events directory permissions. But using it to store the default permissions for the files within the directory that have not been updated by the user can simplify the code" * tag 'trace-tracefs-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Do not use attributes for events directory eventfs: Cleanup permissions in creation of inodes eventfs: Remove getattr and permission callbacks eventfs: Consolidate the eventfs_inode update in eventfs_get_inode() tracefs: Clear EVENT_INODE flag in tracefs_drop_inode() eventfs: Update all the eventfs_inodes from the events descriptor tracefs: Update inode permissions on remount eventfs: Keep the directories from having the same inode number as files
2024-05-23Merge tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default" * tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix undefined behavior in nfs_block_bits() pNFS: rework pnfs_generic_pg_check_layout to check IO range pNFS/filelayout: check layout segment range pNFS/filelayout: fixup pNfs allocation modes rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL NFS: Don't enable NFS v2 by default NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS sunrpc: fix NFSACL RPC retry on soft mount SUNRPC: fix handling expired GSS context nfs: keep server info for remounts NFSv4: Fixup smatch warning for ambiguous return NFS: make sure lock/nolock overriding local_lock mount option NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly. pNFS/filelayout: Specify the layout segment range in LAYOUTGET pNFS/filelayout: Remove the whole file layout requirement
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-23eventfs: Do not use attributes for events directorySteven Rostedt (Google)
The top "events" directory has a static inode (it's created when it is and removed when the directory is removed). There's no need to use the events ei->attr to determine its permissions. But it is used for saving the permissions of the "events" directory for when it is created, as that is needed for the default permissions for the files and directories underneath it. For example: # cd /sys/kernel/tracing # mkdir instances/foo # chown 1001 instances/foo/events The files under instances/foo/events should still have the same owner as instances/foo (which the instances/foo/events ei->attr will hold), but the events directory now has owner 1001. Link: https://lore.kernel.org/lkml/20240522165032.104981011@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23eventfs: Cleanup permissions in creation of inodesSteven Rostedt (Google)
The permissions being set during the creation of the inodes was updating eventfs_inode attributes as well. Those attributes should only be touched by the setattr or remount operations, not during the creation of inodes. The eventfs_inode attributes should only be used to set the inodes and should not be modified during the inode creation. Simplify the code and fix the situation by: 1) Removing the eventfs_find_events() and doing a simple lookup for the events descriptor in eventfs_get_inode() 2) Remove update_events_attr() as the attributes should only be used to update the inode and should not be modified here. 3) Add update_inode_attr() that uses the attributes to determine what the inode permissions should be. 4) As the parent_inode of the eventfs_root_inode structure is no longer needed, remove it. Now on creation, the inode gets the proper permissions without causing side effects to the ei->attr field. Link: https://lore.kernel.org/lkml/20240522165031.944088388@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23eventfs: Remove getattr and permission callbacksSteven Rostedt (Google)
Now that inodes have their permissions updated on remount, the only other places to update the inode permissions are when they are created and in the setattr callback. The getattr and permission callbacks are not needed as the inodes should already be set at their proper settings. Remove the callbacks, as it not only simplifies the code, but also allows more flexibility to fix the inconsistencies with various corner cases (like changing the permission of an instance directory). Link: https://lore.kernel.org/lkml/20240522165031.782066021@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23eventfs: Consolidate the eventfs_inode update in eventfs_get_inode()Steven Rostedt (Google)
To simplify the code, create a eventfs_get_inode() that is used when an eventfs file or directory is created. Have the internal tracefs_inode updated the appropriate flags in this function and update the inode's mode as well. Link: https://lore.kernel.org/lkml/20240522165031.624864160@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23tracefs: Clear EVENT_INODE flag in tracefs_drop_inode()Steven Rostedt (Google)
When the inode is being dropped from the dentry, the TRACEFS_EVENT_INODE flag needs to be cleared to prevent a remount from calling eventfs_remount() on the tracefs_inode private data. There's a race between the inode is dropped (and the dentry freed) to where the inode is actually freed. If a remount happens between the two, the eventfs_inode could be accessed after it is freed (only the dentry keeps a ref count on it). Currently the TRACEFS_EVENT_INODE flag is cleared from the dentry iput() function. But this is incorrect, as it is possible that the inode has another reference to it. The flag should only be cleared when the inode is really being dropped and has no more references. That happens in the drop_inode callback of the inode, as that gets called when the last reference of the inode is released. Remove the tracefs_d_iput() function and move its logic to the more appropriate tracefs_drop_inode() callback function. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.908205106@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23eventfs: Update all the eventfs_inodes from the events descriptorSteven Rostedt (Google)
The change to update the permissions of the eventfs_inode had the misconception that using the tracefs_inode would find all the eventfs_inodes that have been updated and reset them on remount. The problem with this approach is that the eventfs_inodes are freed when they are no longer used (basically the reason the eventfs system exists). When they are freed, the updated eventfs_inodes are not reset on a remount because their tracefs_inodes have been freed. Instead, since the events directory eventfs_inode always has a tracefs_inode pointing to it (it is not freed when finished), and the events directory has a link to all its children, have the eventfs_remount() function only operate on the events eventfs_inode and have it descend into its children updating their uid and gids. Link: https://lore.kernel.org/all/CAK7LNARXgaWw3kH9JgrnH4vK6fr8LDkNKf3wq8NhMWJrVwJyVQ@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.754424703@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23tracefs: Update inode permissions on remountSteven Rostedt (Google)
When a remount happens, if a gid or uid is specified update the inodes to have the same gid and uid. This will allow the simplification of the permissions logic for the dynamically created files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.592429986@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23eventfs: Keep the directories from having the same inode number as filesSteven Rostedt (Google)
The directories require unique inode numbers but all the eventfs files have the same inode number. Prevent the directories from having the same inode numbers as the files as that can confuse some tooling. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.428826685@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: 834bf76add3e6 ("eventfs: Save directory inodes in the eventfs_inode structure") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-22Merge tag 'mm-nonmm-stable-2024-05-22-17-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-mm updates from Andrew Morton: - A series ("kbuild: enable more warnings by default") from Arnd Bergmann which enables a number of additional build-time warnings. We fixed all the fallout which we could find, there may still be a few stragglers. - Samuel Holland has developed the series "Unified cross-architecture kernel-mode FPU API". This does a lot of consolidation of per-architecture kernel-mode FPU usage and enables the use of newer AMD GPUs on RISC-V. - Tao Su has fixed some selftests build warnings in the series "Selftests: Fix compilation warnings due to missing _GNU_SOURCE definition". - This pull also includes a nilfs2 fixup from Ryusuke Konishi. * tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits) nilfs2: make block erasure safe in nilfs_finish_roll_forward() selftests/harness: use 1024 in place of LINE_MAX Revert "selftests/harness: remove use of LINE_MAX" selftests/fpu: allow building on other architectures selftests/fpu: move FP code to a separate translation unit drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT drm/amd/display: only use hard-float, not altivec on powerpc riscv: add support for kernel-mode FPU x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT LoongArch: implement ARCH_HAS_KERNEL_FPU_SUPPORT lib/raid6: use CC_FLAGS_FPU for NEON CFLAGS arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS arm64: implement ARCH_HAS_KERNEL_FPU_SUPPORT ARM: crypto: use CC_FLAGS_FPU for NEON CFLAGS ARM: implement ARCH_HAS_KERNEL_FPU_SUPPORT arch: add ARCH_HAS_KERNEL_FPU_SUPPORT x86/fpu: fix asm/fpu/types.h include guard kbuild: enable -Wcast-function-type-strict unconditionally kbuild: enable -Wformat-truncation on clang ...
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22Merge tag 'driver-core-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the small set of driver core and kernfs changes for 6.10-rc1. Nothing major here at all, just a small set of changes for some driver core apis, and minor fixups. Included in here are: - sysfs_bin_attr_simple_read() helper added and used - device_show_string() helper added and used All usages of these were acked by the various maintainers. Also in here are: - kernfs minor cleanup - removed unused functions - typo fix in documentation - pay attention to sysfs_create_link() failures in module.c finally All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: device property: Fix a typo in the description of device_get_child_node_count() kernfs: mount: Remove unnecessary ‘NULL’ values from knparent scsi: Use device_show_string() helper for sysfs attributes platform/x86: Use device_show_string() helper for sysfs attributes perf: Use device_show_string() helper for sysfs attributes IB/qib: Use device_show_string() helper for sysfs attributes hwmon: Use device_show_string() helper for sysfs attributes driver core: Add device_show_string() helper for sysfs attributes treewide: Use sysfs_bin_attr_simple_read() helper sysfs: Add sysfs_bin_attr_simple_read() helper module: don't ignore sysfs_create_link() failures driver core: Remove unused platform_notify, platform_notify_remove
2024-05-22Merge tag 'ovl-update-6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs updates from Miklos Szeredi: - Add tmpfile support - Clean up include * tag 'ovl-update-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: remove duplicate included header ovl: remove upper umask handling from ovl_create_upper() ovl: implement tmpfile
2024-05-22Merge tag 'fuse-update-6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add fs-verity support (Richard Fung) - Add multi-queue support to virtio-fs (Peter-Jan Gootzen) - Fix a bug in NOTIFY_RESEND handling (Hou Tao) - page -> folio cleanup (Matthew Wilcox) * tag 'fuse-update-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtio-fs: add multi-queue support virtio-fs: limit number of request queues fuse: clear FR_SENT when re-adding requests into pending list fuse: set FR_PENDING atomically in fuse_resend() fuse: Add initial support for fs-verity fuse: Convert fuse_readpages_end() to use folio_end_read()
2024-05-22vfs: Delete the associated dentry when deleting a fileYafang Shao
Our applications, built on Elasticsearch[0], frequently create and delete files. These applications operate within containers, some with a memory limit exceeding 100GB. Over prolonged periods, the accumulation of negative dentries within these containers can amount to tens of gigabytes. Upon container exit, directories are deleted. However, due to the numerous associated dentries, this process can be time-consuming. Our users have expressed frustration with this prolonged exit duration, which constitutes our first issue. Simultaneously, other processes may attempt to access the parent directory of the Elasticsearch directories. Since the task responsible for deleting the dentries holds the inode lock, processes attempting directory lookup experience significant delays. This issue, our second problem, is easily demonstrated: - Task 1 generates negative dentries: $ pwd ~/test $ mkdir es && cd es/ && ./create_and_delete_files.sh [ After generating tens of GB dentries ] $ cd ~/test && rm -rf es [ It will take a long duration to finish ] - Task 2 attempts to lookup the 'test/' directory $ pwd ~/test $ ls The 'ls' command in Task 2 experiences prolonged execution as Task 1 is deleting the dentries. We've devised a solution to address both issues by deleting associated dentry when removing a file. Interestingly, we've noted that a similar patch was proposed years ago[1], although it was rejected citing the absence of tangible issues caused by negative dentries. Given our current challenges, we're resubmitting the proposal. All relevant stakeholders from previous discussions have been included for reference. Some alternative solutions are also under discussion[2][3], such as shrinking child dentries outside of the parent inode lock or even asynchronously shrinking child dentries. However, given the straightforward nature of the current solution, I believe this approach is still necessary. [ NOTE! This is a pretty fundamental change in how we deal with unlinking dentries, and it doesn't change the fact that you can have lots of negative dentries from just doing negative lookups. But the kernel test robot is at least initially happy with this from a performance angle, so I'm applying this ASAP just to get more testing and as a "known fix for an issue people hit in real life". Put another way: we should still look at the alternatives, and this patch may get reverted if somebody finds a performance regression on some other load. - Linus ] Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://github.com/elastic/elasticsearch [0] Link: https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com [1] Link: https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/ [2] Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/ [3] Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Wangkai <wangkai86@huawei.com> Cc: Colin Walters <walters@verbum.org> Tested-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-22fuse: virtio: drop owner assignmentKrzysztof Kozlowski
virtio core already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-24-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-05-22kernel: Remove signal hacks for vhost_tasksMike Christie
This removes the signal/coredump hacks added for vhost_tasks in: Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") When that patch was added vhost_tasks did not handle SIGKILL and would try to ignore/clear the signal and continue on until the device's close function was called. In the previous patches vhost_tasks and the vhost drivers were converted to support SIGKILL by cleaning themselves up and exiting. The hacks are no longer needed so this removes them. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20240316004707.45557-10-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-21Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs updates from Al Viro: "Assorted commits that had missed the last merge window..." * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove call_{read,write}_iter() functions do_dentry_open(): kill inode argument kernel_file_open(): get rid of inode argument get_file_rcu(): no need to check for NULL separately fd_is_open(): move to fs/file.c close_on_exec(): pass files_struct instead of fdtable
2024-05-21Merge tag 'pull-bd_inode-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number
2024-05-21Merge tag 'pull-set_blocksize' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
2024-05-21fs/pidfs: make 'lsof' happy with our inode changesLinus Torvalds
pidfs started using much saner inodes in commit b28ddcc32d8f ("pidfs: convert to path_from_stashed() helper"), but that exposed the fact that lsof had some knowledge of just how odd our old anon_inode usage was. For example, legacy anon_inodes hadn't even initialized the inode type in the inode mode, so everything had a type of zero. So sane tools like 'stat' would report these files as "weird file", but 'lsof' instead used that (together with the name of the link in proc) to notice that it's an anonymous inode, and used it to detect pidfd files. Let's keep our internal new sane inode model, but mask the file type bits at 'stat()' time in the getattr() function we already have, and by making the dentry name match what lsof expects too. This keeps our internal models sane, but should make user space see the same old odd behavior. Reported-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/all/a15b1050-4b52-4740-a122-a4d055c17f11@kernel.org/ Link: https://github.com/lsof-org/lsof/issues/317 Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Seth Forshee <sforshee@kernel.org> Cc: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-21nfs: fix undefined behavior in nfs_block_bits()Sergey Shtylyov
Shifting *signed int* typed constant 1 left by 31 bits causes undefined behavior. Specify the correct *unsigned long* type by using 1UL instead. Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Cc: stable@vger.kernel.org Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21pNFS: rework pnfs_generic_pg_check_layout to check IO rangeOlga Kornievskaia
All callers of pnfs_generic_pg_check_layout() also want to do a call to check that the layout's range covers the IO range. Merge the functionality of the pnfs_generic_pg_check_range() into that of pnfs_generic_pg_check_layout(). Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21pNFS/filelayout: check layout segment rangeOlga Kornievskaia
Before doing the IO, check that we have the layout covering the range of IO. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21pNFS/filelayout: fixup pNfs allocation modesOlga Kornievskaia
Change left over allocation flags. Fixes: a245832aaa99 ("pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20Merge tag 'f2fs-for-6.10.rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've tried to address some performance issues on zoned storage such as direct IO and write_hints. In addition, we've migrated some IO paths using folio. Meanwhile, there are multiple bug fixes in the compression paths, sanity check conditions, and error handlers. Enhancements: - allow direct io of pinned files for zoned storage - assign the write hint per stream by default - convert read paths and test_writeback to folio - avoid allocating WARM_DATA segment for direct IO Bug fixes: - fix false alarm on invalid block address - fix to add missing iput() in gc_data_segment() - fix to release node block count in error path of f2fs_new_node_page() - compress: - don't allow unaligned truncation on released compress inode - cover {reserve,release}_compress_blocks() w/ cp_rwsem lock - fix error path of inc_valid_block_count() - fix to update i_compr_blocks correctly - fix block migration when section is not aligned to pow2 - don't trigger OPU on pinfile for direct IO - fix to do sanity check on i_xattr_nid in sanity_check_inode() - write missing last sum blk of file pinning section - clear writeback when compression failed - fix to adjust appropirate defragment pg_end As usual, there are several minor code clean-ups, and fixes to manage missing corner cases in the error paths" * tag 'f2fs-for-6.10.rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (50 commits) f2fs: initialize last_block_in_bio variable f2fs: Add inline to f2fs_build_fault_attr() stub f2fs: fix some ambiguous comments f2fs: fix to add missing iput() in gc_data_segment() f2fs: allow dirty sections with zero valid block for checkpoint disabled f2fs: compress: don't allow unaligned truncation on released compress inode f2fs: fix to release node block count in error path of f2fs_new_node_page() f2fs: compress: fix to cover {reserve,release}_compress_blocks() w/ cp_rwsem lock f2fs: compress: fix error path of inc_valid_block_count() f2fs: compress: fix typo in f2fs_reserve_compress_blocks() f2fs: compress: fix to update i_compr_blocks correctly f2fs: check validation of fault attrs in f2fs_build_fault_attr() f2fs: fix to limit gc_pin_file_threshold f2fs: remove unused GC_FAILURE_PIN f2fs: use f2fs_{err,info}_ratelimited() for cleanup f2fs: fix block migration when section is not aligned to pow2 f2fs: zone: fix to don't trigger OPU on pinfile for direct IO f2fs: fix to do sanity check on i_xattr_nid in sanity_check_inode() f2fs: fix to avoid allocating WARM_DATA segment for direct IO f2fs: remove redundant parameter in is_next_segment_free() ...
2024-05-20Merge tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: "Online repair feature continues to be expanded. Also, we now support delayed allocation for realtime devices which have an extent size that is equal to filesystem's block size. New code: - Introduce Parent Pointer extended attribute for inodes - Bring back delalloc support for realtime devices which have an extent size that is equal to filesystem's block size - Improve performance of log incompat feature handling Online Repair: - Implement atomic file content exchanges i.e. exchange ranges of bytes between two files atomically - Create temporary files to repair file-based metadata. This uses atomic file content exchange facility to swap file fork mappings between the temporary file and the metadata inode - Allow callers of directory/xattr code to set an explicit owner number to be written into the header fields of any new blocks that are created. This is required to avoid walking every block of the new structure and modify their ownership during online repair - Repair more data structures: - Extended attributes - Inode unlinked state - Directories - Symbolic links - AGI's unlinked inode list - Parent pointers - Move Orphan files to lost and found directory - Fixes for Inode repair functionality - Introduce a new sub-AG FITRIM implementation to reduce the duration for which the AGF lock is held - Updates for the design documentation - Use Parent Pointers to assist in checking directories, parent pointers, extended attributes, and link counts Fixes: - Prevent userspace from reading invalid file data due to incorrect. updation of file size when performing a non-atomic clone operation - Minor fixes to online repair - Fix confusing return values from xfs_bmapi_write() - Fix an out of bounds access due to incorrect h_size during log recovery - Defer upgrading the extent counters in xfs_reflink_end_cow_extent() until we know we are going to modify the extent mapping - Remove racy access to if_bytes check in xfs_reflink_end_cow_extent() - Fix sparse warnings Cleanups: - Hold inode locks on all files involved in a rename until the completion of the operation. This is in preparation for the parent pointers patchset where parent pointers are applied in a separate chained update from the actual directory update - Compile out v4 support when disabled - Cleanup xfs_extent_busy_clear() - Remove unused flags and fields from struct xfs_da_args - Remove definitions of unused functions - Improve extended attribute validation - Add higher level directory operations helpers to remove duplication of code - Cleanup quota (un)reservation interfaces" * tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (221 commits) xfs: simplify iext overflow checking and upgrade xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later xfs: xfs_quota_unreserve_blkres can't fail xfs: consolidate the xfs_quota_reserve_blkres definitions xfs: clean up buffer allocation in xlog_do_recovery_pass xfs: fix log recovery buffer allocation for the legacy h_size fixup xfs: widen flags argument to the xfs_iflags_* helpers xfs: minor cleanups of xfs_attr3_rmt_blocks xfs: create a helper to compute the blockcount of a max sized remote value xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c xfs: do not allocate the entire delalloc extent in xfs_bmapi_write xfs: fix xfs_bmap_add_extent_delay_real for partial conversions xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate xfs: fix error returns from xfs_bmapi_write ...
2024-05-20Merge tag 'fs_for_v6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull isofs, udf, quota, ext2, and reiserfs updates from Jan Kara: - convert isofs to the new mount API - cleanup isofs Makefile - udf conversion to folios - some other small udf cleanups and fixes - ext2 cleanups - removal of reiserfs .writepage method - update reiserfs README file * tag 'fs_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: isofs: Use *-y instead of *-objs in Makefile ext2: Remove LEGACY_DIRECT_IO dependency isofs: Remove calls to set/clear the error flag ext2: Remove call to folio_set_error() udf: Use a folio in udf_write_end() udf: Convert udf_page_mkwrite() to use a folio udf: Convert udf_symlink_getattr() to use a folio udf: Convert udf_adinicb_readpage() to udf_adinicb_read_folio() udf: Convert udf_expand_file_adinicb() to use a folio udf: Convert udf_write_begin() to use a folio udf: Convert udf_symlink_filler() to use a folio reiserfs: Trim some README bits quota: fix to propagate error of mark_dquot_dirty() to caller reiserfs: Convert to writepages udf: udftime: prevent overflow in udf_disk_stamp_to_time() ext2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method udf: replace deprecated strncpy/strcpy with strscpy udf: Remove second semicolon isofs: convert isofs to use the new mount API fs: quota: use group allocation of per-cpu counters API
2024-05-20Revert "fanotify: remove unneeded sub-zero check for unsigned value"Linus Torvalds
This reverts commit e6595224464b692ddae193d783402130d1625147. These kinds of patches are only making the code worse. Compilers don't care about the unnecessary check, but removing it makes the code less obvious to a human. The declaration of 'len' is more than 80 lines earlier, so a human won't easily see that 'len' is of an unsigned type, so to a human the range check that checks against zero is much more explicit and obvious. Any tool that complains about a range check like this just because the variable is unsigned is actively detrimental, and should be ignored. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-20Merge tag 'fsnotify_for_v6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - reduce overhead of fsnotify infrastructure when no permission events are in use - a few small cleanups * tag 'fsnotify_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix UAF from FS_ERROR event on a shutting down filesystem fsnotify: optimize the case of no permission event watchers fsnotify: use an enum for group priority constants fsnotify: move s_fsnotify_connectors into fsnotify_sb_info fsnotify: lazy attach fsnotify_sb_info state to sb fsnotify: create helper fsnotify_update_sb_watchers() fsnotify: pass object pointer and type to fsnotify mark helpers fanotify: merge two checks regarding add of ignore mark fsnotify: create a wrapper fsnotify_find_inode_mark() fsnotify: create helpers to get sb and connp from object fsnotify: rename fsnotify_{get,put}_sb_connectors() fsnotify: Avoid -Wflex-array-member-not-at-end warning fanotify: remove unneeded sub-zero check for unsigned value
2024-05-20NFS: Don't enable NFS v2 by defaultAnna Schumaker
This came up during one of the Bake-a-thon discussions. NFS v2 support was dropped from nfs-utils/mount.nfs in December 2021. Let's turn it off by default in the kernel too, since this means there isn't a way to mount and test it. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Jeffrey Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUSAnna Schumaker
Olga showed me a case where the client was sending multiple READ_PLUS calls to the server in parallel, and the server replied NFS4ERR_OPNOTSUPP to each. The client would fall back to READ for the first reply, but fail to retry the other calls. I fix this by removing the test for NFS_CAP_READ_PLUS in nfs4_read_plus_not_supported(). This allows us to reschedule any READ_PLUS call that has a NFS4ERR_OPNOTSUPP return value, even after the capability has been cleared. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c567552612ec ("NFS: Add READ_PLUS data segment support") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20nfs: keep server info for remountsMartin Kaiser
With newer kernels that use fs_context for nfs mounts, remounts fail with -EINVAL. $ mount -t nfs -o nolock 10.0.0.1:/tmp/test /mnt/test/ $ mount -t nfs -o remount /mnt/test/ mount: mounting 10.0.0.1:/tmp/test on /mnt/test failed: Invalid argument For remounts, the nfs server address and port are populated by nfs_init_fs_context and later overwritten with 0x00 bytes by nfs23_parse_monolithic. The remount then fails as the server address is invalid. Fix this by not overwriting nfs server info in nfs23_parse_monolithic if we're doing a remount. Fixes: f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Martin Kaiser <martin@kaiser.cx> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFSv4: Fixup smatch warning for ambiguous returnBenjamin Coddington
Dan Carpenter reports smatch warning for nfs4_try_migration() when a memory allocation failure results in a zero return value. In this case, a transient allocation failure error will likely be retried the next time the server responds with NFS4ERR_MOVED. We can fixup the smatch warning with a small refactor: attempt all three allocations before testing and returning on a failure. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: c3ed222745d9 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: make sure lock/nolock overriding local_lock mount optionChen Hanxiao
Currently, mount option lock/nolock and local_lock option may override NFS_MOUNT_LOCAL_FLOCK NFS_MOUNT_LOCAL_FCNTL flags when passing in different order: mount -o vers=3,local_lock=all,lock: local_lock=none mount -o vers=3,lock,local_lock=all: local_lock=all This patch will let lock/nolock override local_lock option as nfs(5) suggested. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.NeilBrown
With two clients, each with NFSv3 mounts of the same directory, the sequence: client1 client2 ls -l afile echo hello there > afile echo HELLO > afile cat afile will show HELLO there because the O_TRUNC requested in the final 'echo' doesn't take effect. This is because the "Negative dentry, just create a file" section in lookup_open() assumes that the file *does* get created since the dentry was negative, so it sets FMODE_CREATED, and this causes do_open() to clear O_TRUNC and so the file doesn't get truncated. Even mounting with -o lookupcache=none does not help as nfs_neg_need_reval() always returns false if LOOKUP_CREATE is set. This patch fixes the problem by providing an atomic_open inode operation for NFSv3 (and v2). The code is largely the code from the branch in lookup_open() when atomic_open is not provided. The significant change is that the O_TRUNC flag is passed a new nfs_do_create() which add 'trunc' handling to nfs_create(). With this change we also optimise away an unnecessary LOOKUP before the file is created. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20pNFS/filelayout: Specify the layout segment range in LAYOUTGETAnna Schumaker
Move from only requesting full file layout segments to requesting layout segments that match our I/O size. This means the server is still free to return a full file layout if it wants, but partial layouts will no longer cause an error. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20pNFS/filelayout: Remove the whole file layout requirementAnna Schumaker
Layout segments have been supported in pNFS for years, so remove the requirement that the server always sends whole file layouts. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-19nilfs2: make block erasure safe in nilfs_finish_roll_forward()Ryusuke Konishi
The implementation of writing a zero-fill block in nilfs_finish_roll_forward() is not safe. The buffer is being cleared without acquiring a lock or setting the uptodate flag, so theoretically, between the time the buffer's data is cleared and the time it is written back to the block device using sync_dirty_buffer(), that zero data can be undone by concurrent block device reads. Since this buffer points to a location that has been read from disk once, the uptodate flag will most likely remain, but since it was obtained with __getblk(), that is not guaranteed. In other words, this is exceptional, and this function itself is not normally called (only once when mounting after a specific pattern of unclean shutdown), so it is highly unlikely that this will actually cause a problem. Anyway, eliminate this potential race issue by protecting the clearing of buffer data with a buffer lock and setting the buffer's uptodate flag within the protected section. Link: https://lkml.kernel.org/r/20240511002942.9608-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-19Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
2024-05-19Merge tag 'bcachefs-2024-05-19' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull bcachefs updates from Kent Overstreet: - More safety fixes, primarily found by syzbot - Run the upgrade/downgrade paths in nochnages mode. Nochanges mode is primarily for testing fsck/recovery in dry run mode, so it shouldn't change anything besides disabling writes and holding dirty metadata in memory. The idea here was to reduce the amount of activity if we can't write anything out, so that bringing up a filesystem in "super ro" mode would be more lilkely to work for data recovery - but norecovery is the correct option for this. - btree_trans->locked; we now track whether a btree_trans has any btree nodes locked, and this is used for improved assertions related to trans_unlock() and trans_relock(). We'll also be using it for improving how we work with lockdep in the future: we don't want lockdep to be tracking individual btree node locks because we take too many for lockdep to track, and it's not necessary since we have a cycle detector. - Trigger improvements that are prep work for online fsck - BTREE_TRIGGER_check_repair; this regularizes how we do some repair work for extents that goes with running triggers in fsck, and fixes some subtle issues with transaction restarts there. - bch2_snapshot_equiv() has now been ripped out of fsck.c; snapshot equivalence classes are for when snapshot deletion leaves behind redundant snapshot nodes, but snapshot deletion now cleans this up right away, so the abstraction doesn't need to leak. - Improvements to how we resume writing to the journal in recovery. The code for picking the new place to write when reading the journal is greatly simplified and we also store the position in the superblock for when we don't read the journal; this means that we preserve more of the journal for list_journal debugging. - Improvements to sysfs btree_cache and btree_node_cache, for debugging memory reclaim. - We now detect when we've blocked for 10 seconds on the allocator in the write path and dump some useful info. - Safety fixes for devices references: this is a big series that changes almost all device lookups to properly check if the device exists and take a reference to it. Previously we assumed that if a bkey exists that references a device then the device must exist, and this was enforced in .invalid methods, but this was incorrect because it meant device removal relied on accounting being correct to not leave keys pointing to invalid devices, and that's not something we can assume. Getting the "pointer to invalid device" checks out of our .invalid() methods fixes some long standing device removal bugs; the only outstanding bug with device removal now is a race between the discard path and deleting alloc info, which should be easily fixed. - The allocator now prefers not to expand the new member_info.btree_allocated bitmap, meaning if repair ever requires scanning for btree nodes (because of a corrupt interior nodes) we won't have to scan the whole device(s). - New coding style document, which among other things talks about the correct usage of assertions * tag 'bcachefs-2024-05-19' of https://evilpiepirate.org/git/bcachefs: (155 commits) bcachefs: add no_invalid_checks flag bcachefs: add counters for failed shrinker reclaim bcachefs: Fix sb_field_downgrade validation bcachefs: Plumb bch_validate_flags to sb_field_ops.validate() bcachefs: s/bkey_invalid_flags/bch_validate_flags bcachefs: fsync() should not return -EROFS bcachefs: Invalid devices are now checked for by fsck, not .invalid methods bcachefs: kill bch2_dev_bkey_exists() in bch2_check_fix_ptrs() bcachefs: kill bch2_dev_bkey_exists() in bch2_read_endio() bcachefs: bch2_dev_get_ioref() checks for device not present bcachefs: bch2_dev_get_ioref2(); io_read.c bcachefs: bch2_dev_get_ioref2(); debug.c bcachefs: bch2_dev_get_ioref2(); journal_io.c bcachefs: bch2_dev_get_ioref2(); io_write.c bcachefs: bch2_dev_get_ioref2(); btree_io.c bcachefs: bch2_dev_get_ioref2(); backpointers.c bcachefs: bch2_dev_get_ioref2(); alloc_background.c bcachefs: for_each_bset() declares loop iter bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.h bcachefs: Improve sysfs internal/btree_cache ...
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-18Merge tag '6.10-rc-smb-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fix from Steve French: "An important fix to address recent netfs regression (data corruption)" * tag '6.10-rc-smb-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix data corruption in read after invalidate
2024-05-18Merge tag 'ext4_for_linus-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - more folio conversion patches - add support for FS_IOC_GETFSSYSFSPATH - mballoc cleaups and add more kunit tests - sysfs cleanups and bug fixes - miscellaneous bug fixes and cleanups * tag 'ext4_for_linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: fix error pointer dereference in ext4_mb_load_buddy_gfp() jbd2: add prefix 'jbd2' for 'shrink_type' jbd2: use shrink_type type instead of bool type for __jbd2_journal_clean_checkpoint_list() ext4: fix uninitialized ratelimit_state->lock access in __ext4_fill_super() ext4: remove calls to to set/clear the folio error flag ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find() ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() jbd2: remove redundant assignement to variable err ext4: remove the redundant folio_wait_stable() ext4: fix potential unnitialized variable ext4: convert ac_buddy_page to ac_buddy_folio ext4: convert ac_bitmap_page to ac_bitmap_folio ext4: convert ext4_mb_init_cache() to take a folio ext4: convert bd_buddy_page to bd_buddy_folio ext4: convert bd_bitmap_page to bd_bitmap_folio ext4: open coding repeated check in next_linear_group ext4: use correct criteria name instead stale integer number in comment ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used ext4: keep "prefetch_grp" and "nr" consistent ...
2024-05-18Merge tag 'nfsd-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "This is a light release containing mostly optimizations, code clean- ups, and minor bug fixes. This development cycle has focused on non- upstream kernel work: 1. Continuing to build upstream CI for NFSD, based on kdevops 2. Backporting NFSD filecache-related fixes to selected LTS kernels One notable new feature in v6.10 NFSD is the addition of a new netlink protocol dedicated to configuring NFSD. A new user space tool, nfsdctl, is to be added to nfs-utils. Lots more to come here. As always I am very grateful to NFSD contributors, reviewers, testers, and bug reporters who participated during this cycle" * tag 'nfsd-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (29 commits) NFSD: Force all NFSv4.2 COPY requests to be synchronous SUNRPC: Fix gss_free_in_token_pages() NFS/knfsd: Remove the invalid NFS error 'NFSERR_OPNOTSUPP' knfsd: LOOKUP can return an illegal error value nfsd: set security label during create operations NFSD: Add COPY status code to OFFLOAD_STATUS response NFSD: Record status of async copy operation in struct nfsd4_copy SUNRPC: Remove comment for sp_lock NFSD: add listener-{set,get} netlink command SUNRPC: add a new svc_find_listener helper SUNRPC: introduce svc_xprt_create_from_sa utility routine NFSD: add write_version to netlink command NFSD: convert write_threads to netlink command NFSD: allow callers to pass in scope string to nfsd_svc NFSD: move nfsd_mutex handling into nfsd_svc callers lockd: host: Remove unnecessary statements'host = NULL;' nfsd: don't create nfsv4recoverydir in nfsdfs when not used. nfsd: optimise recalculate_deny_mode() for a common case nfsd: add tracepoint in mark_client_expired_locked nfsd: new tracepoint for check_slot_seqid ...
2024-05-18Merge tag 'kbuild-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...
2024-05-18Merge tag 'landlock-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux Pull landlock updates from Mickaël Salaün: "This brings ioctl control to Landlock, contributed by Günther Noack. This also adds him as a Landlock reviewer, and fixes an issue in the sample" * tag 'landlock-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: MAINTAINERS: Add Günther Noack as Landlock reviewer fs/ioctl: Add a comment to keep the logic in sync with LSM policies MAINTAINERS: Notify Landlock maintainers about changes to fs/ioctl.c landlock: Document IOCTL support samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV selftests/landlock: Exhaustive test for the IOCTL allow-list selftests/landlock: Check IOCTL restrictions for named UNIX domain sockets selftests/landlock: Test IOCTLs on named pipes selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH) selftests/landlock: Test IOCTL with memfds selftests/landlock: Test IOCTL support landlock: Add IOCTL access right for character and block devices samples/landlock: Fix incorrect free in populate_ruleset_net