summaryrefslogtreecommitdiff
path: root/fs/netfs
AgeCommit message (Collapse)Author
2024-09-12netfs: Use new folio_queue data type and iterator instead of xarray iterDavid Howells
Make the netfs write-side routines use the new folio_queue struct to hold a rolling buffer of folios, with the issuer adding folios at the tail and the collector removing them from the head as they're processed instead of using an xarray. This will allow a subsequent patch to simplify the write collector. The primary mark (as tested by folioq_is_marked()) is used to note if the corresponding folio needs putting. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-06Merge tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - fix potential mount hang - fix retry problem in two types of compound operations - important netfs integration fix in SMB1 read paths - fix potential uninitialized zero point of inode - minor patch to improve debugging for potential crediting problems * tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: netfs, cifs: Improve some debugging bits cifs: Fix SMB1 readv/writev callback in the same way as SMB2/3 cifs: Fix zero_point init on inode initialisation smb: client: fix double put of @cfile in smb2_set_path_size() smb: client: fix double put of @cfile in smb2_rename_path() smb: client: fix hang in wait_for_response() for negproto
2024-09-05netfs: Use bh-disabling spinlocks for rreq->lockDavid Howells
Use bh-disabling spinlocks when accessing rreq->lock because, in the future, it may be twiddled from softirq context when cleanup is driven from cache backend DIO completion. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-12-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs: Set the request work function upon allocationDavid Howells
Set the work function in the netfs_io_request work_struct when we allocate the request rather than doing this later. This reduces the number of places we need to set it in future code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-11-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs: Remove NETFS_COPY_TO_CACHEDavid Howells
Remove NETFS_COPY_TO_CACHE as it isn't used anymore. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-10-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_streamDavid Howells
Move max_len/max_nr_segs from struct netfs_io_subrequest to struct netfs_io_stream as we only issue one subreq at a time and then don't need these values again for that subreq unless and until we have to retry it - in which case we want to renegotiate them. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-8-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs, cifs: Move CIFS_INO_MODIFIED_ATTR to netfs_inodeDavid Howells
Move CIFS_INO_MODIFIED_ATTR to netfs_inode as NETFS_ICTX_MODIFIED_ATTR and then make netfs_perform_write() set it. This means that cifs doesn't need to implement the ->post_modify() hook. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-7-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs: Reduce number of conditional branches in netfs_perform_write()David Howells
Reduce the number of conditional branches in netfs_perform_write() by merging in netfs_how_to_modify() and then creating a separate if-statement for each way we might modify a folio. Note that this means replicating the data copy in each path. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-6-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs: Record contention stats for writeback lockDavid Howells
Record statistics for contention upon the writeback serialisation lock that prevents racing writeback calls from causing each other to interleave their writebacks. These can be viewed in /proc/fs/netfs/stats on the WbLock line, with skip=N indicating the number of non-SYNC writebacks skipped and wait=N indicating the number of SYNC writebacks that waited. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-5-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05netfs: Adjust labels in /proc/fs/netfs/statsDavid Howells
Adjust the labels in /proc/fs/netfs/stats that refer to netfs-specific counters. These currently all begin with "Netfs", but change them to begin with more specific labels. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240814203850.2240469-4-dhowells@redhat.com/ # v2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-04Merge tag 'vfs-6.11-rc7.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "Two netfs fixes for this merge window: - Ensure that fscache_cookie_lru_time is deleted when the fscache module is removed to prevent UAF - Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range() Before it used truncate_inode_pages_partial() which causes copy_file_range() to fail on cifs" * tag 'vfs-6.11-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscache: delete fscache_cookie_lru_timer when fscache exits to avoid UAF mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()
2024-09-03netfs, cifs: Improve some debugging bitsDavid Howells
Improve some debugging bits: (1) The netfslib _debug() macro doesn't need a newline in its format string. (2) Display the request debug ID and subrequest index in messages emitted in smb2_adjust_credits() to make it easier to reference in traces. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-01fscache: delete fscache_cookie_lru_timer when fscache exits to avoid UAFBaokun Li
The fscache_cookie_lru_timer is initialized when the fscache module is inserted, but is not deleted when the fscache module is removed. If timer_reduce() is called before removing the fscache module, the fscache_cookie_lru_timer will be added to the timer list of the current cpu. Afterwards, a use-after-free will be triggered in the softIRQ after removing the fscache module, as follows: ================================================================== BUG: unable to handle page fault for address: fffffbfff803c9e9 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 21ffea067 P4D 21ffea067 PUD 21ffe6067 PMD 110a7c067 PTE 0 Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G W 6.11.0-rc3 #855 Tainted: [W]=WARN RIP: 0010:__run_timer_base.part.0+0x254/0x8a0 Call Trace: <IRQ> tmigr_handle_remote_up+0x627/0x810 __walk_groups.isra.0+0x47/0x140 tmigr_handle_remote+0x1fa/0x2f0 handle_softirqs+0x180/0x590 irq_exit_rcu+0x84/0xb0 sysvec_apic_timer_interrupt+0x6e/0x90 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:default_idle+0xf/0x20 default_idle_call+0x38/0x60 do_idle+0x2b5/0x300 cpu_startup_entry+0x54/0x60 start_secondary+0x20d/0x280 common_startup_64+0x13e/0x148 </TASK> Modules linked in: [last unloaded: netfs] ================================================================== Therefore delete fscache_cookie_lru_timer when removing the fscahe module. Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240826112056.2458299-1-libaokun@huaweicloud.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30netfs: Delete subtree of 'fs/netfs' when netfs module exitsBaokun Li
In netfs_init() or fscache_proc_init(), we create dentry under 'fs/netfs', but in netfs_exit(), we only delete the proc entry of 'fs/netfs' without deleting its subtree. This triggers the following WARNING: ================================================================== remove_proc_entry: removing non-empty directory 'fs/netfs', leaking at least 'requests' WARNING: CPU: 4 PID: 566 at fs/proc/generic.c:717 remove_proc_entry+0x160/0x1c0 Modules linked in: netfs(-) CPU: 4 UID: 0 PID: 566 Comm: rmmod Not tainted 6.11.0-rc3 #860 RIP: 0010:remove_proc_entry+0x160/0x1c0 Call Trace: <TASK> netfs_exit+0x12/0x620 [netfs] __do_sys_delete_module.isra.0+0x14c/0x2e0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== Therefore use remove_proc_subtree() instead of remove_proc_entry() to fix the above problem. Fixes: 7eb5b3e3a0a5 ("netfs, fscache: Move /proc/fs/fscache to /proc/fs/netfs and put in a symlink") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240826113404.3214786-1-libaokun@huaweicloud.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30inode: remove __I_DIO_WAKEUPChristian Brauner
Afaict, we can just rely on inode->i_dio_count for waiting instead of this awkward indirection through __I_DIO_WAKEUP. This survives LTP dio and xfstests dio tests. Link: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28netfs, cifs: Fix handling of short DIO readDavid Howells
Short DIO reads, particularly in relation to cifs, are not being handled correctly by cifs and netfslib. This can be tested by doing a DIO read of a file where the size of read is larger than the size of the file. When it crosses the EOF, it gets a short read and this gets retried, and in the case of cifs, the retry read fails, with the failure being translated to ENODATA. Fix this by the following means: (1) Add a flag, NETFS_SREQ_HIT_EOF, for the filesystem to set when it detects that the read did hit the EOF. (2) Make the netfslib read assessment stop processing subrequests when it encounters one with that flag set. (3) Return rreq->transferred, the accumulated contiguous amount read to that point, to userspace for a DIO read. (4) Make cifs set the flag and clear the error if the read RPC returned ENODATA. (5) Make cifs set the flag and clear the error if a short read occurred without error and the read-to file position is now at the remote inode size. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-28cifs: Fix lack of credit renegotiation on read retryDavid Howells
When netfslib asks cifs to issue a read operation, it prefaces this with a call to ->clamp_length() which cifs uses to negotiate credits, providing receive capacity on the server; however, in the event that a read op needs reissuing, netfslib doesn't call ->clamp_length() again as that could shorten the subrequest, leaving a gap. This causes the retried read to be done with zero credits which causes the server to reject it with STATUS_INVALID_PARAMETER. This is a problem for a DIO read that is requested that would go over the EOF. The short read will be retried, causing EINVAL to be returned to the user when it fails. Fix this by making cifs_req_issue_read() negotiate new credits if retrying (NETFS_SREQ_RETRYING now gets set in the read side as well as the write side in this instance). This isn't sufficient, however: the new credits might not be sufficient to complete the remainder of the read, so also add an additional field, rreq->actual_len, that holds the actual size of the op we want to perform without having to alter subreq->len. We then rely on repeated short reads being retried until we finish the read or reach the end of file and make a zero-length read. Also fix a couple of places where the subrequest start and length need to be altered by the amount so far transferred when being used. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-24netfs: Fix interaction of streaming writes with zero-point trackerDavid Howells
When a folio that is marked for streaming write (dirty, but not uptodate, with partial content specified in the private data) is written back, the folio is effectively switched to the blank state upon completion of the write. This means that if we want to read it in future, we need to reread the whole folio. However, if the folio is above the zero_point position, when it is read back, it will just be cleared and the read skipped, leading to apparent local corruption. Fix this by increasing the zero_point to the end of the dirty data in the folio when clearing the folio state after writeback. This is analogous to the folio having ->release_folio() called upon it. This was causing the config.log generated by configuring a cpython tree on a cifs share to get corrupted because the scripts involved were appending text to the file in small pieces. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/563286.1724500613@warthog.procyon.org.uk cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-24netfs: Fix missing iterator reset on retry of short readDavid Howells
Fix netfs_rreq_perform_resubmissions() to reset before retrying a short read, otherwise the wrong part of the output buffer will be used. Fixes: 92b6cc5d1e7c ("netfs: Add iov_iters to (sub)requests to describe various buffers") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240823200819.532106-6-dhowells@redhat.com cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-24netfs: Fix trimming of streaming-write folios in netfs_inval_folio()David Howells
When netfslib writes to a folio that it doesn't have data for, but that data exists on the server, it will make a 'streaming write' whereby it stores data in a folio that is marked dirty, but not uptodate. When it does this, it attaches a record to folio->private to track the dirty region. When truncate() or fallocate() wants to invalidate part of such a folio, it will call into ->invalidate_folio(), specifying the part of the folio that is to be invalidated. netfs_invalidate_folio(), on behalf of the filesystem, must then determine how to trim the streaming write record. In a couple of cases, however, it does this incorrectly (the reduce-length and move-start cases are switched over and don't, in any case, calculate the value correctly). Fix this by making the logic tree more obvious and fixing the cases. Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-24netfs: Fix netfs_release_folio() to say no if folio dirtyDavid Howells
Fix netfs_release_folio() to say no (ie. return false) if the folio is dirty (analogous with iomap's behaviour). Without this, it will say yes to the release of a dirty page by split_huge_page_to_list_to_order(), which will result in the loss of untruncated data in the folio. Without this, the generic/075 and generic/112 xfstests (both fsx-based tests) fail with minimum folio size patches applied[1]. Fixes: c1ec4d7c2e13 ("netfs: Provide invalidate_folio and release_folio calls") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240815090849.972355-1-kernel@pankajraghav.com/ [1] Link: https://lore.kernel.org/r/20240823200819.532106-4-dhowells@redhat.com cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-21netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting ↵David Howells
folio->private and marking dirty" This partially reverts commit 2ff1e97587f4d398686f52c07afde3faf3da4e5c. In addition to reverting the removal of PG_private_2 wrangling from the buffered read code[1][2], the removal of the waits for PG_private_2 from netfs_release_folio() and netfs_invalidate_folio() need reverting too. It also adds a wait into ceph_evict_inode() to wait for netfs read and copy-to-cache ops to complete. Fixes: 2ff1e97587f4 ("netfs: Replace PG_fscache by setting folio->private and marking dirty") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1] Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2] Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com cc: Max Kellermann <max.kellermann@ionos.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-139p: Fix DIO read through netfsDominique Martinet
If a program is watching a file on a 9p mount, it won't see any change in size if the file being exported by the server is changed directly in the source filesystem, presumably because 9p doesn't have change notifications, and because netfs skips the reads if the file is empty. Fix this by attempting to read the full size specified when a DIO read is requested (such as when 9p is operating in unbuffered mode) and dealing with a short read if the EOF was less than the expected read. To make this work, filesystems using netfslib must not set NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF. I don't want to mandatorily clear this flag in netfslib for DIO because, say, ceph might make a read from an object that is not completely filled, but does not reside at the end of file - and so we need to clear the excess. This can be tested by watching an empty file over 9p within a VM (such as in the ktest framework): while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo then writing something into the empty file. The watcher should immediately display the file content and break out of the loop. Without this fix, it remains in the loop indefinitely. Fixes: 80105ed2fd27 ("9p: Use netfslib read/write_iter") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flagsDavid Howells
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used correctly. The problem is that we try to set them up in the request initialisation, but we the cache may be in the process of setting up still, and so the state may not be correct. Further, we secondarily sample the cache state and make contradictory decisions later. The issue arises because we set up the cache resources, which allows the cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which triggers cache writing even if we didn't set the flags when allocating. Fix this in the following way: (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in ->init_request() rather than trying to juggle that in netfs_alloc_request(). (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is to be done, then PG_private_2 is to be used rather than only setting it if we decide to cache and then having netfs_rreq_unlock_folios() set the non-PG_private_2 writeback-to-cache if it wasn't set. (3) Split netfs_rreq_unlock_folios() into two functions, one of which contains the deprecated code for using PG_private_2 to avoid accidentally doing the writeback path - and always use it if USE_PGPRIV2 is set. (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always wait for PG_private_2. This function is deprecated and only used by ceph anyway, and so label it so. (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use fscache_operation_valid() on the cache_resources instead. This has the advantage of picking up the result of netfs_begin_cache_read() and fscache_begin_write_operation() - which are called after the object is initialised and will wait for the cache to come to a usable state. Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be applied on top of that. Without this as well, things like: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { and: WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386 may happen, along with some UAFs due to PG_private_2 not getting used to wait on writeback completion. Fixes: 2ff1e97587f4 ("netfs: Replace PG_fscache by setting folio->private and marking dirty") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Hristo Venev <hristo@venev.name> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a ↵David Howells
second writeback flag" This reverts commit ae678317b95e760607c7b20b97c9cd4ca9ed6e1a. Revert the patch that removes the deprecated use of PG_private_2 in netfslib for the moment as Ceph is actually still using this to track data copied to the cache. Fixes: ae678317b95e ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12netfs: clean up after renaming FSCACHE_DEBUG configLukas Bulwahn
Commit 6b8e61472529 ("netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG") renames the config, but introduces two issues: First, NETFS_DEBUG mistakenly depends on the non-existing config NETFS, whereas the actual intended config is called NETFS_SUPPORT. Second, the config renaming misses to adjust the documentation of the functionality of this config. Clean up those two points. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Link: https://lore.kernel.org/r/20240731073902.69262-1-lukas.bulwahn@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12fs/netfs/fscache_cookie: add missing "n_accesses" checkMax Kellermann
This fixes a NULL pointer dereference bug due to a data race which looks like this: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018 Workqueue: events_unbound netfs_rreq_write_to_cache_work RIP: 0010:cachefiles_prepare_write+0x30/0xa0 Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10 RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286 RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000 RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438 RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001 R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68 R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00 FS: 0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x1f/0x70 ? page_fault_oops+0x15d/0x440 ? search_module_extables+0xe/0x40 ? fixup_exception+0x22/0x2f0 ? exc_page_fault+0x5f/0x100 ? asm_exc_page_fault+0x22/0x30 ? cachefiles_prepare_write+0x30/0xa0 netfs_rreq_write_to_cache_work+0x135/0x2e0 process_one_work+0x137/0x2c0 worker_thread+0x2e9/0x400 ? __pfx_worker_thread+0x10/0x10 kthread+0xcc/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Modules linked in: CR2: 0000000000000008 ---[ end trace 0000000000000000 ]--- This happened because fscache_cookie_state_machine() was slow and was still running while another process invoked fscache_unuse_cookie(); this led to a fscache_cookie_lru_do_one() call, setting the FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by fscache_cookie_state_machine(), withdrawing the cookie via cachefiles_withdraw_cookie(), clearing cookie->cache_priv. At the same time, yet another process invoked cachefiles_prepare_write(), which found a NULL pointer in this code line: struct cachefiles_object *object = cachefiles_cres_object(cres); The next line crashes, obviously: struct cachefiles_cache *cache = object->volume->cache; During cachefiles_prepare_write(), the "n_accesses" counter is non-zero (via fscache_begin_operation()). The cookie must not be withdrawn until it drops to zero. The counter is checked by fscache_cookie_state_machine() before switching to FSCACHE_COOKIE_STATE_RELINQUISHING and FSCACHE_COOKIE_STATE_WITHDRAWING (in "case FSCACHE_COOKIE_STATE_FAILED"), but not for FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case FSCACHE_COOKIE_STATE_ACTIVE"). This patch adds the missing check. With a non-zero access counter, the function returns and the next fscache_end_cookie_access() call will queue another fscache_cookie_state_machine() call to handle the still-pending FSCACHE_COOKIE_DO_LRU_DISCARD. Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12netfs: Fault in smaller chunks for non-large folio mappingsMatthew Wilcox (Oracle)
As in commit 4e527d5841e2 ("iomap: fault in smaller chunks for non-large folio mappings"), we can see a performance loss for filesystems which have not yet been converted to large folios. Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240527201735.1898381-1-willy@infradead.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24netfs: Fix writeback that needs to go to both server and cacheDavid Howells
When netfslib is performing writeback (ie. ->writepages), it maintains two parallel streams of writes, one to the server and one to the cache, but it doesn't mark either stream of writes as active until it gets some data that needs to be written to that stream. This is done because some folios will only be written to the cache (e.g. copying to the cache on read is done by marking the folios and letting writeback do the actual work) and sometimes we'll only be writing to the server (e.g. if there's no cache). Now, since we don't actually dispatch uploads and cache writes in parallel, but rather flip between the streams, depending on which has the lowest so-far-issued offset, and don't wait for the subreqs to finish before flipping, we can end up in a situation where, say, we issue a write to the server and this completes before we start the write to the cache. But because we only activate a stream when we first add a subreq to it, the result collection code may run before we manage to activate the stream - resulting in the folio being cleaned and having the writeback-in-progress mark removed. At this point, the folio no longer belongs to us. This is only really a problem for folios that need to be written to both streams - and in that case, the upload to the server is started first, followed by the write to the cache - and the cache write may see a bad folio. Fix this by activating the cache stream up front if there's a cache available. If there's a cache, then all data is going to be written to it. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1599053.1721398818@warthog.procyon.org.uk cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUGDavid Howells
CONFIG_FSCACHE_DEBUG should have been renamed to CONFIG_NETFS_DEBUG, so do that now. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1410796.1721333406@warthog.procyon.org.uk cc: Uwe Kleine-König <ukleinek@kernel.org> cc: Christian Brauner <brauner@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24netfs: Revert "netfs: Switch debug logging to pr_debug()"David Howells
Revert commit 163eae0fb0d4c610c59a8de38040f8e12f89fd43 to get back the original operation of the debugging macros. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org Link: https://lore.kernel.org/r/1410685.1721333252@warthog.procyon.org.uk cc: Uwe Kleine-König <ukleinek@kernel.org> cc: Christian Brauner <brauner@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-21Merge tag 'mm-stable-2024-07-21-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
2024-07-11Merge tag 'vfs-6.10-rc8.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "cachefiles: - Export an existing and add a new cachefile helper to be used in filesystems to fix reference count bugs - Use the newly added fscache_ty_get_volume() helper to get a reference count on an fscache_volume to handle volumes that are about to be removed cleanly - After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN wait for all ongoing cookie lookups to complete and for the object count to reach zero - Propagate errors from vfs_getxattr() to avoid an infinite loop in cachefiles_check_volume_xattr() because it keeps seeing ESTALE - Don't send new requests when an object is dropped by raising CACHEFILES_ONDEMAND_OJBSTATE_DROPPING - Cancel all requests for an object that is about to be dropped - Wait for the ondemand_boject_worker to finish before dropping a cachefiles object to prevent use-after-free - Use cyclic allocation for message ids to better handle id recycling - Add missing lock protection when iterating through the xarray when polling netfs: - Use standard logging helpers for debug logging VFS: - Fix potential use-after-free in file locks during trace_posix_lock_inode(). The tracepoint could fire while another task raced it and freed the lock that was requested to be traced - Only increment the nr_dentry_negative counter for dentries that are present on the superblock LRU. Currently, DCACHE_LRU_LIST list is used to detect this case. However, the flag is also raised in combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru is used. So checking only DCACHE_LRU_LIST will lead to wrong nr_dentry_negative count. Fix the check to not count dentries that are on a shrink related list Misc: - hfsplus: fix an uninitialized value issue in copy_name - minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even though we switched it to kmap_local_page() a while ago" * tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: minixfs: Fix minixfs_rename with HIGHMEM hfsplus: fix uninit-value in copy_name vfs: don't mod negative dentry count when on shrinker list filelock: fix potential use-after-free in posix_lock_inode cachefiles: add missing lock protection when polling cachefiles: cyclic allocation of msg_id to avoid reuse cachefiles: wait for ondemand_object_worker to finish when dropping object cachefiles: cancel all requests for the object that is being dropped cachefiles: stop sending new request when dropping object cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie() cachefiles: fix slab-use-after-free in fscache_withdraw_volume() netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume() netfs: Switch debug logging to pr_debug()
2024-07-05Merge patch series "cachefiles: random bugfixes"Christian Brauner
libaokun@huaweicloud.com <libaokun@huaweicloud.com> says: This is the third version of this patch series, in which another patch set is subsumed into this one to avoid confusing the two patch sets. (https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914) We've been testing ondemand mode for cachefiles since January, and we're almost done. We hit a lot of issues during the testing period, and this patch series fixes some of the issues. The patches have passed internal testing without regression. The following is a brief overview of the patches, see the patches for more details. Patch 1-2: Add fscache_try_get_volume() helper function to avoid fscache_volume use-after-free on cache withdrawal. Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache() concurrency causing cachefiles_volume use-after-free. Patch 4: Propagate error codes returned by vfs_getxattr() to avoid endless loops. Patch 5-7: A read request waiting for reopen could be closed maliciously before the reopen worker is executing or waiting to be scheduled. So ondemand_object_worker() may be called after the info and object and even the cache have been freed and trigger use-after-free. So use cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the reopen worker or wait for it to finish. Since it makes no sense to wait for the daemon to complete the reopen request, to avoid this pointless operation blocking cancel_work_sync(), Patch 1 avoids request generation by the DROPPING state when the request has not been sent, and Patch 2 flushes the requests of the current object before cancel_work_sync(). Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading the daemon to cause hung. Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing use-after-free. This issue was triggered frequently in our tests, and we found that anolis 5.10 had fixed it. So to avoid failing the test, this patch is pushed upstream as well. Baokun Li (7): netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume() cachefiles: fix slab-use-after-free in fscache_withdraw_volume() cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie() cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop cachefiles: stop sending new request when dropping object cachefiles: cancel all requests for the object that is being dropped cachefiles: cyclic allocation of msg_id to avoid reuse Hou Tao (1): cachefiles: wait for ondemand_object_worker to finish when dropping object Jingbo Xu (1): cachefiles: add missing lock protection when polling fs/cachefiles/cache.c | 45 ++++++++++++++++++++++++++++- fs/cachefiles/daemon.c | 4 +-- fs/cachefiles/internal.h | 3 ++ fs/cachefiles/ondemand.c | 52 ++++++++++++++++++++++++++++++---- fs/cachefiles/volume.c | 1 - fs/cachefiles/xattr.c | 5 +++- fs/netfs/fscache_volume.c | 14 +++++++++ fs/netfs/internal.h | 2 -- include/linux/fscache-cache.h | 6 ++++ include/trace/events/fscache.h | 4 +++ 10 files changed, 123 insertions(+), 13 deletions(-) Link: https://lore.kernel.org/r/20240628062930.2467993-1-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03netfs: drop usage of folio_file_posKairui Song
folio_file_pos is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio_pos instead. It can't be a swap cache page here. Swap mapping may only call into fs through swap_rw and that is not supported for netfs. So just drop it and use folio_pos instead. Link: https://lkml.kernel.org/r/20240521175854.96038-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()Baokun Li
Export fscache_put_volume() and add fscache_try_get_volume() helper function to allow cachefiles to get/put fscache_volume via linux/fscache-cache.h. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-2-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26netfs: Fix netfs_page_mkwrite() to flush conflicting data, not waitDavid Howells
Fix netfs_page_mkwrite() to use filemap_fdatawrite_range(), not filemap_fdatawait_range() to flush conflicting data. Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/614300.1719228243@warthog.procyon.org.uk cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26netfs: Fix netfs_page_mkwrite() to check folio->mapping is validDavid Howells
Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has taken the folio lock (as filemap_page_mkwrite() does). Without this, generic/247 occasionally oopses with something like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0 ... Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x6e/0xa0 ? exc_page_fault+0xc2/0xe0 ? asm_exc_page_fault+0x22/0x30 ? trace_event_raw_event_netfs_folio+0x61/0xc0 trace_netfs_folio+0x39/0x40 netfs_page_mkwrite+0x14c/0x1d0 do_page_mkwrite+0x50/0x90 do_pte_missing+0x184/0x200 __handle_mm_fault+0x42d/0x500 handle_mm_fault+0x121/0x1f0 do_user_addr_fault+0x23e/0x3c0 exc_page_fault+0xc2/0xe0 asm_exc_page_fault+0x22/0x30 This is due to the invalidate_inode_pages2_range() issued at the end of the DIO write interfering with the mmap'd writes. Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/780211.1719318546@warthog.procyon.org.uk Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26netfs: Delete some xarray-wangling functions that aren't usedDavid Howells
Delete some xarray-based buffer wangling functions that are intended for use with bounce buffering, but aren't used because bounce-buffering got deferred to a later patch series. Now, however, the intention is to use something other than an xarray to do this. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-9-dhowells@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26netfs: Fix early issue of write op on partial write to folio tailDavid Howells
During the writeback procedure, at the end of netfs_write_folio(), pending write operations are flushed if the amount of write-streaming data stored in a page is less than the size of the folio because if we haven't modified a folio to the end, it cannot be contiguous with the following folio... except if the dirty region of the folio is right at the end of the folio space. Fix the test to take the offset into the folio into account as well, such that if the dirty region runs right up to the end of the folio, we leave the flushing for later. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> (DFS, global name space) cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-26netfs: Fix io_uring based write-throughDavid Howells
[This was included in v2 of 9b038d004ce95551cb35381c49fe896c5bc11ffe, but v1 got pushed instead] Fix netfs_unbuffered_write_iter_locked() to set the total request length in the netfs_io_request struct rather than leaving it as zero. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <stfrench@microsoft.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Christian Brauner <brauner@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-2-dhowells@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-12netfs: Switch debug logging to pr_debug()Uwe Kleine-König
Instead of inventing a custom way to conditionally enable debugging, just make use of pr_debug(), which also has dynamic debugging facilities and is more likely known to someone who hunts a problem in the netfs code. Also drop the module parameter netfs_debug which didn't have any effect without further source changes. (The variable netfs_debug was only used in #ifdef blocks for cpp vars that don't exist; Note that CONFIG_NETFS_DEBUG isn't settable via kconfig, a variable with that name never existed in the mainline and is probably just taken over (and renamed) from similar custom debug logging implementations.) Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org> Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-27Merge tag 'vfs-6.10-rc2.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix io_uring based write-through after converting cifs to use the netfs library - Fix aio error handling when doing write-through via netfs library - Fix performance regression in iomap when used with non-large folio mappings - Fix signalfd error code - Remove obsolete comment in signalfd code - Fix async request indication in netfs_perform_write() by raising BDP_ASYNC when IOCB_NOWAIT is set - Yield swap device immediately to prevent spurious EBUSY errors - Don't cross a .backup mountpoint from backup volumes in afs to avoid infinite loops - Fix a race between umount and async request completion in 9p after 9p was converted to use the netfs library * tag 'vfs-6.10-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs, 9p: Fix race between umount and async request completion afs: Don't cross .backup mountpoint from backup volume swap: yield device immediately netfs: Fix setting of BDP_ASYNC from iocb flags signalfd: drop an obsolete comment signalfd: fix error return code iomap: fault in smaller chunks for non-large folio mappings filemap: add helper mapping_max_folio_size() netfs: Fix AIO error handling when doing write-through netfs: Fix io_uring based write-through
2024-05-27netfs, 9p: Fix race between umount and async request completionDavid Howells
There's a problem in 9p's interaction with netfslib whereby a crash occurs because the 9p_fid structs get forcibly destroyed during client teardown (without paying attention to their refcounts) before netfslib has finished with them. However, it's not a simple case of deferring the clunking that p9_fid_put() does as that requires the p9_client record to still be present. The problem is that netfslib has to unlock pages and clear the IN_PROGRESS flag before destroying the objects involved - including the fid - and, in any case, nothing checks to see if writeback completed barring looking at the page flags. Fix this by keeping a count of outstanding I/O requests (of any type) and waiting for it to quiesce during inode eviction. Reported-by: syzbot+df038d463cca332e8414@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000005be0aa061846f8d6@google.com/ Reported-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000b86c5e06130da9c6@google.com/ Reported-by: syzbot+1527696d41a634cc1819@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000041f960618206d7e@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/755891.1716560771@warthog.procyon.org.uk Tested-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com Reviewed-by: Dominique Martinet <asmadeus@codewreck.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Hillf Danton <hdanton@sina.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Reported-and-tested-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-24netfs: Fix setting of BDP_ASYNC from iocb flagsDavid Howells
Fix netfs_perform_write() to set BDP_ASYNC if IOCB_NOWAIT is set rather than if IOCB_SYNC is not set. It reflects asynchronicity in the sense of not waiting rather than synchronicity in the sense of not returning until the op is complete. Without this, generic/590 fails on cifs in strict caching mode with a complaint that one of the writes fails with EAGAIN. The test can be distilled down to: mount -t cifs /my/share /mnt -ostuff xfs_io -i -c 'falloc 0 8191M -c fsync -f /mnt/file xfs_io -i -c 'pwrite -b 1M -W 0 8191M' /mnt/file Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/316306.1716306586@warthog.procyon.org.uk Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Jeff Layton <jlayton@kernel.org> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-24netfs: Fix AIO error handling when doing write-throughDavid Howells
If an error occurs whilst we're doing an AIO write in write-through mode, we may end up calling ->ki_complete() *and* returning an error from ->write_iter(). This can result in either a UAF (the ->ki_complete() func pointer may get overwritten, for example) or a refcount underflow in io_submit() as ->ki_complete is called twice. Fix this by making netfs_end_writethrough() - and thus netfs_perform_write() - unconditionally return -EIOCBQUEUED if we're doing an AIO write and wait for completion if we're not. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/295052.1716298587@warthog.procyon.org.uk cc: Jeff Layton <jlayton@kernel.org> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-24netfs: Fix io_uring based write-throughDavid Howells
This can be triggered by mounting a cifs filesystem with a cache=strict mount option and then, using the fsx program from xfstests, doing: ltp/fsx -A -d -N 1000 -S 11463 -P /tmp /cifs-mount/foo \ --replay-ops=gen112-fsxops Where gen112-fsxops holds: fallocate 0x6be7 0x8fc5 0x377d3 copy_range 0x9c71 0x77e8 0x2edaf 0x377d3 write 0x2776d 0x8f65 0x377d3 The problem is that netfs_io_request::len is being used for two purposes and ends up getting set to the amount of data we transferred, not the amount of data the caller asked to be transferred (for various reasons, such as mmap'd writes, we might end up rounding out the data written to the server to include the entire folio at each end). Fix this by keeping the amount we were asked to write in ->len and using ->submitted to track what we issued ops for. Then, when we come to calling ->ki_complete(), ->len is the right size. This also required netfs_cleanup_dio_write() to change since we're no longer advancing wreq->len. Use wreq->transferred instead as we might have done a short read. With this, the generic/112 xfstest passes if cifs is forced to put all non-DIO opens into write-through mode. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/295086.1716298663@warthog.procyon.org.uk cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <stfrench@microsoft.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-21smb3: reenable swapfiles over SMB3 mountsSteve French
With the changes to folios/netfs it is now easier to reenable swapfile support over SMB3 which fixes various xfstests Reviewed-by: David Howells <dhowells@redhat.com> Suggested-by: David Howells <dhowells@redhat.com> Fixes: e1209d3a7a67 ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space") Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-13cifs: Fix locking in cifs_strict_readv()Steve French
Fix to take the i_rwsem (through the netfs locking wrappers) before taking cinode->lock_sem. Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib") Reported-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-01cifs: Cut over to using netfslibDavid Howells
Make the cifs filesystem use netfslib to handle reading and writing on behalf of cifs. The changes include: (1) Various read_iter/write_iter type functions are turned into wrappers around netfslib API functions or are pointed directly at those functions: cifs_file_direct{,_nobrl}_ops switch to use netfs_unbuffered_read_iter and netfs_unbuffered_write_iter. Large pieces of code that will be removed are #if'd out and will be removed in subsequent patches. [?] Why does cifs mark the page dirty in the destination buffer of a DIO read? Should that happen automatically? Does netfs need to do that? Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org