lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2024-10-29	bcachefs: Fix unhandled transaction restart in fallocate	Kent Overstreet
	This used to not matter, but now we're being more strict. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20	bcachefs: Don't use wait_event_interruptible() in recovery	Kent Overstreet
	Fix a bug where mount was failing with -ERESTARTSYS: https://github.com/koverstreet/bcachefs/issues/741 We only want the interruptible wait when called from fsync. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: quota_reserve_range() -> for_each_btree_key_in_subvolume_upto	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: range_has_data() -> for_each_btree_key_in_subvolume_upto	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: bch2_seek_hole() -> for_each_btree_key_in_subvolume_upto	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: bch2_seek_data() -> for_each_btree_key_in_subvolume_upto	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09	bcachefs: switch to rhashtable for vfs inodes hash	Kent Overstreet
	the standard vfs inode hash table suffers from painful lock contention - this is long overdue Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: support REMAP_FILE_DEDUP in bch2_remap_file_range	Reed Riley
	By removing the early-exit when REMAP_FILE_DEDUP is set, we should be able to support the fideduperange ioctl, albeit less efficiently than if we handled some of the extent locking and comparison logic inside bcachefs. Extent comparison logic already exists inside of `__generic_remap_file_range_prep`. Signed-off-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: Add tracepoints for bch2_sync_fs() and bch2_fsync()	Youling Tang
	Add trace_bch2_sync_fs() and trace_bch2_fsync() implementations. The output in trace is as follows: sync-29779 [000] ..... 193.700935: bch2_sync_fs: dev 254,16 wait 1 <...>-40027 [002] ..... 342.535227: bch2_fsync: dev 254,32 ino 4099 parent 4096 datasync 1 Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: track writeback errors using the generic tracking infrastructure	Youling Tang
	We already using mapping_set_error() in bch2_writepage_io_done(), so all we need to do is to use file_check_and_advance_wb_err() when handling fsync() requests in bch2_fsync(). Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14	bcachefs: uninline fallocate functions	Kent Overstreet
	better stack traces Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09	bcachefs: fsync() should not return -EROFS	Kent Overstreet
	fsync has a slightly odd usage of -EROFS, where it means "does not support fsync". I didn't choose it... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08	bcachefs: iter/update/trigger/str_hash flag cleanup	Kent Overstreet
	Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13	bcachefs: Fix missing write refs in fs fio paths	Kent Overstreet
	bch2_journal_flush_seq requires us to have a write ref Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-22	bcachefs: fix incorrect usage of REQ_OP_FLUSH	Christoph Hellwig
	REQ_OP_FLUSH is only for internal use in the blk-mq and request based drivers. File systems and other block layer consumers must use REQ_OP_WRITE \| REQ_PREFLUSH as documented in Documentation/block/writeback_cache_control.rst. While REQ_OP_FLUSH appears to work for blk-mq drivers it does not get the proper flush state machine handling, and completely fails for any bio based drivers, including all the stacking drivers. The block layer will also get a check in 6.8 to reject this use case entirely. [Note: completely untested, but as this never got fixed since the original bug report in November: https://bugzilla.kernel.org/show_bug.cgi?id=218184 and the the discussion in December: https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/ this seems to be best way to force it] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21	bcachefs: Fix excess transaction restarts in __bchfs_fallocate()	Kent Overstreet
	drop_locks_do() should not be used in a fastpath without first trying the do in nonblocking mode - the unlock and relock will cause excessive transaction restarts and potentially livelocking with other threads that are contending for the same locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: return from fsync on writeback error to avoid early shutdown	Brian Foster
	When investigating transient failures of generic/441 on bcachefs, it was determined that the cause of the failure was a combination of unconditional emergency shutdown and racing between background journal activity and the test switchover from a working device mapper table to an error injecting table. Part of the reason for this sequence of events is that bcachefs aggressively flushes as much as possible during fsync(), regardless of errors. While this is reasonable behavior, it is technically unnecessary because once an error is returned from fsync(), the caller cannot make any assumptions about the resilience of data. Tweak the bch2_fsync() logic to return an error on failure of any of the steps involved in the flush. Note that this change alone does not prevent generic/441 failure, but in combination with a test tweak to avoid racing during the dm-error table switchover it avoids the unnecessary shutdowns and allows the test to pass reliably on bcachefs. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: kill INODE_LOCK, use lock_two_nondirectories()	Kent Overstreet
	In an ideal world, we'd have a common helper that could be used for sorting a list of inodes into the correct lock order, and then the same lock ordering could be used for any type of inode lock, not just i_rwsem. But the lock ordering rules for i_rwsem are a bit complicated, so - abandon that dream for now and do it the more standard way. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Heap allocate btree_trans	Kent Overstreet
	We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: remove redundant initializations of variables start_offset and ↵	Colin Ian King
	end_offset The variables start_offset and end_offset are being initialized with values that are never read, they being re-assigned later on. The initializations are redundant and can be removed. Cleans up clang-scan build warnings: fs/bcachefs/fs-io.c:243:11: warning: Value stored to 'start_offset' during its initialization is never read [deadcode.DeadStores] fs/bcachefs/fs-io.c:244:11: warning: Value stored to 'end_offset' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: New io_misc.c helpers	Kent Overstreet
	This pulls the non vfs specific parts of truncate and finsert/fcollapse out of fs-io.c, and moves them to io_misc.c. This is prep work for logging these operations, to make them atomic in the event of a crash. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Break up io.c	Kent Overstreet
	More reorganization, this splits up io.c into - io_read.c - io_misc.c - fallocate, fpunch, truncate - io_write.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add btree_trans* to inode_set_fn	Joshua Ashton
	This will be used when we need to re-hash a directory tree when setting flags. It is not possible to have concurrent btree_trans on a thread. Signed-off-by: Joshua Ashton <joshua@froggi.es> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Split up fs-io.[ch]	Kent Overstreet
	fs-io.c is too big - time for some reorganization - fs-dio.c: direct io - fs-pagecache.c: pagecache data structures (bch_folio), utility code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix assorted checkpatch nits	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix lock thrashing in __bchfs_fallocate()	Kent Overstreet
	We've observed significant lock thrashing on fstests generic/083 in fallocate, due to dropping and retaking btree locks when checking the pagecache for data. This adds a nonblocking mode to bch2_clamp_data_hole(), where we only use folio_trylock(), and can thus be used safely while btree locks are held - thus we only have to drop btree locks as a fallback, on actual lock contention. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix folio leak in folio_hole_offset()	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fallocate now checks page cache	Kent Overstreet
	Previously, fallocate would only check the state of the extents btree when determining if we need to create a reservation. But the page cache might already have dirty data or a disk reservation. This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to check for this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Delete redundant log messages	Kent Overstreet
	Now that we have distinct error codes for different memory allocation failures, the early init log messages are no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Assorted sparse fixes	Kent Overstreet
	- endianness fixes - mark some things static - fix a few __percpu annotations - fix silent enum conversions Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Check for ERR_PTR() from filemap_lock_folio()	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: fs-io: Eliminate GFP_NOFS usage	Kent Overstreet
	GFP_NOFS doesn't ever make sense. If we're allocatingc memory it should be GFP_NOWAIT if btree locks are held, GFP_KERNEL otherwise. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Avoid __GFP_NOFAIL	Kent Overstreet
	We've been using __GFP_NOFAIL for allocating struct bch_folio, our private per-folio state. However, that struct is variable size - it holds state for each sector in the folio, and folios can be quite large now, which means it's possible for bch_folio to be larger than PAGE_SIZE now. __GFP_NOFAIL allocations are undesirable in normal circumstances, but particularly so at >= PAGE_SIZE, and warnings are emitted for that. So, this patch adds proper error paths and eliminates most uses of __GFP_NOFAIL. Also, do some more cleanup of gfp flags w.r.t. btree node locks: we can use GFP_KERNEL, but only if we're not holding btree locks, and if we are holding btree locks we should be using GFP_NOWAIT. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix quotas + snapshots	Kent Overstreet
	Now that we can reliably designate and find the master subvolume out of a tree of snapshots, we can finally make quotas work with snapshots: That is - quotas will now _ignore_ snapshot subvolumes, and only be in effect for the master (non snapshot) subvolume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: folio pos to bch_folio_sector index helper	Brian Foster
	Create a small helper to translate from file offset to the associated bch_folio_sector index in the underlying bch_folio. The helper assumes the file offset is covered by the passed folio. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: use u64 for folio end pos to avoid overflows	Brian Foster
	Some of the folio_end_() helpers are prone to overflow of signed 64-bit types because the mapping is only limited by the max value of loff_t and the associated helpers return the start offset of the next folio. Therefore, a folio_end_pos() of the max allowable folio in a mapping returns a value that overflows loff_t. This makes it hard to rely on such values when doing folio processing across a range of a file, as bcachefs attempts to do with the recent folio changes. For example, generic/564 causes problems in the buffered write path when testing writes at max boundary conditions. The current understanding is that the pagecache historically limited the mapping to one less page to avoid this problem and this was dropped with some of the folio conversions, but may be reinstated to properly address the problem. In the meantime, update the internal folio_end_() helpers in bcachefs to return a u64, and all of the associated code to use or cast to u64 to avoid overflow problems. This allows generic/564 to pass and can be reverted back to using loff_t if at any point the pagecache subsystem can guarantee these boundary conditions will not overflow. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: clean up post-eof folios on -ENOSPC	Brian Foster
	The buffered write path batches folio creations in the file mapping based on the requested size of the write. Under low free space conditions, it is possible to add a bunch of folios to the mapping and then return a short write or -ENOSPC due to lack of space. If this occurs on an extending write, the file size is updated based on the amount of data successfully written to the file. If folios were added beyond the final i_size, they may hang around until reclaimed, truncated or encountered unexpectedly by another operation. For example, generic/083 reproduces a sequence of events where a short write leaves around one or more post-EOF folios on an inode, a subsequent zero range request extends beyond i_size and overlaps with an aforementioned folio, and __bch2_truncate_folio() happens across it and complains. Update __bch2_buffered_write() to keep track of the start offset of the last folio added to the mapping for a prospective write. After i_size is updated, check whether this offset starts beyond EOF. If so, truncate pagecache beyond the latest EOF to clean up any folios that don't reside at least partially within EOF upon completion of the write. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: fix truncate overflow if folio is beyond EOF	Brian Foster
	generic/083 occasionally reproduces a panic caused by an overflow when accessing the bch_folio_sector array of the folio being processed by __bch2_truncate_folio(). The immediate cause of the overflow is that the folio offset is beyond i_size, and therefore the sector index calculation underflows on subtraction of the folio offset. One cause of this is mainly observed on nocow mounts. When nocow is enabled, fallocate performs physical block allocation (as opposed to block reservation in cow mode), which range_has_data() then interprets as valid data that requires partial zeroing on truncate. Therefore, if a post-eof zero range request lands across post-eof preallocated blocks, __bch2_truncate_folio() may actually create a post-eof folio in order to perform zeroing. To avoid this problem, update range_has_data() to filter out unwritten blocks from folio creation and partial zeroing. Even though we should never create folios beyond EOF like this, the mere existence of such folios is not necessarily a fatal error. Fix up the truncate code to warn about this condition and not overflow the sector array and possibly crash the system. The addition of this warning without the corresponding unwritten extent fix has shown that various other fstests are able to reproduce this problem fairly frequently, but often in ways that doesn't necessarily result in a kernel panic or a change in user observable behavior, and therefore the problem goes undetected. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Check for folios that don't have bch_folio attached	Kent Overstreet
	With large folios, it's now incidentally possible to end up with a clean, uptodate folio in the page cache that doesn't have a bch_folio attached, if a folio has to be split. This patch fixes __bch2_truncate_folio() to check for this; other code paths appear to handle it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_readahead() large folio conversion	Kent Overstreet
	Readahead now uses the new filemap_get_contig_folios_d() helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: filemap_get_contig_folios_d()	Kent Overstreet
	Add a new helper for getting a range of contiguous folios and returning them in a darray. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch_folio_sector_state improvements	Kent Overstreet
	- X-macro-ize the bch_folio_sector_state enum: this means we can easily generate strings, which is helpful for debugging. - Add helpers for state transitions: folio_sector_dirty(), folio_sector_undirty(), folio_sector_reserve() - Add folio_sector_set(), a single helper for changing folio sector state just so that we have a single place to instrument when we're debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_truncate_page() large folio conversion	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_buffered_write large folio conversion	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch_folio can now handle multi-order folios	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: More assorted large folio conversion	Kent Overstreet
	Various misc small conversions in fs-io.c for large folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_seek_pagecache_data() folio conversion	Kent Overstreet
	This converts bch2_seek_pagecache_data() to handle large folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_seek_pagecache_hole() folio conversion	Kent Overstreet
	This converts bch2_seek_pagecache_hole() to handle large folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bio_for_each_segment_all() -> bio_for_each_folio_all()	Kent Overstreet
	This converts the writepage end_io path to folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Initial folio conversion	Kent Overstreet
	This converts fs-io.c to pass folios, not pages. We're not handling large folios yet, there's no functional changes in this patch - just a lot of churn doing the initial type conversions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>