lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2023-10-22	bcachefs: Improve trans_restart_split_race tracepoint	Kent Overstreet
	Seeing occasional test failures where we get stuck in a livelock that involves this event - this will help track it down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Data update path no longer leaves cached replicas	Kent Overstreet
	It turns out that it's currently impossible to invalidate buckets containing only cached data if they're part of a stripe. The normal bucket invalidate path can't do it because we have to be able to incerement the bucket's gen, which isn't correct becasue it's still a member of the stripe - and the bucket invalidate path makes the bucket availabel for reuse right away, which also isn't correct for buckets in stripes. What would work is invalidating cached data by following backpointers, except that cached replicas don't currently get backpointers - because they would be awkward for the existing bucket invalidate path to delete and they haven't been needed elsewhere. So for the time being, to prevent running out of space in stripes, switch the data update path to not leave cached replicas; we may revisit this in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Rhashtable based buckets_in_flight for copygc	Kent Overstreet
	Previously, copygc used a fifo for tracking buckets in flight - this had the disadvantage of being fixed size, since we pass references to elements into the move code. This restructures it to be a hash table and linked list, since with erasure coding we need to be able to pipeline across an arbitrary number of buckets. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Use BTREE_ITER_INTENT in ec_stripe_update_extent()	Kent Overstreet
	This adds a flags param to bch2_backpointer_get_key() so that we can pass BTREE_ITER_INTENT, since ec_stripe_update_extent() is updating the extent immediately. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: move snapshot_t to subvolume_types.h	Kent Overstreet
	this doesn't need to be in bcachefs.h Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix bch2_get_key_or_hole()	Kent Overstreet
	This fixes an off by one error, due to confusing closed vs. half open intervals. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Check return code from need_whiteout_for_snapshot()	Kent Overstreet
	This could return a transaction restart; we need to check for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_dev_freespace_init() Print out status every 10 seconds	Kent Overstreet
	It appears freespace init can still take awhile, and we've had a report or two of it getting stuck - let's have it print out where it's at every 10 seconds. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Run freespace init in device hot add path	Kent Overstreet
	Like in the recovery, and device add, we have to check if devices don't have the freespace btree initialized - this was missed in the device hot add path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improved copygc wait debugging	Kent Overstreet
	This just adds a line for how long copygc has been waiting to sysfs copygc_wait, helpful for debugging why copygc isn't running. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Call bch2_path_put_nokeep() before bch2_path_put()	Kent Overstreet
	bch2_path_put_nokeep() is sketchy, and we should consider removing it: it unconditionally frees btree_paths once their ref hits 0. The assumption is that we only use it for paths that have never been visible outside the btree core btree code; i.e. higher level code will never be making assumptions about locking based on these paths. However, there's subtle brokenness with this approach: - If we call bch2_path_put(), then bch2_path_put_nokeep(), bch2_path_put() may free the first path on the assumption that we we have another path keeping a node locked - but then bch2_path_put_nokeep() just unconditionally frees it. The same bug may arise if we're calling bch2_path_put() and bch2_path_put_nokeep() on the same (refcounted) path, or two adjacent paths that point to the same btree node. This patch hacks around one of these bugs by calling bch2_path_put_nokeep() first in bch2_trans_iter_exit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: drop unnecessary journal stuck check from space calculation	Brian Foster
	The journal stucking check in bch2_journal_space_available() is particularly aggressive and can lead to premature shutdown in some rare cases. This is difficult to reproduce, but also comes along with a fatal error and so is worthwhile to be cautious. For example, we've seen instances where the journal is under heavy reservation pressure, the journal allocation path transitions into the final available journal bucket, the journal write path immediately consumes that bucket and calls into bch2_journal_space_available(), which then in turn flags the journal as stuck because there is no available space and shuts down the filesystem instead of submitting the journal write (that would have otherwise succeeded). To avoid this problem, simplify the journal stuck checking by just relying on the higher level logic in the journal reservation path. This produces more useful debug output and is a more reliable indicator that things have bogged down. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: refactor journal stuck checking into standalone helper	Brian Foster
	bcachefs checks for journal stuck conditions both in the journal space calculation code and the journal reservation slow path. The logic in both places is rather tricky and can result in non-deterministic failure characteristics and debug output. In preparation to condense journal stuck handling to a single place, refactor the __journal_res_get() logic into a standalone helper. Since multiple callers into the reservation code can result in duplicate reports, use the ->err_seq field as a serialization mechanism for the debug dump. Finally, add some comments to help explain the logic and hopefully facilitate further improvements in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: gracefully unwind journal res slowpath on shutdown	Brian Foster
	bcachefs detects journal stuck conditions in a couple different places. If the logic in the journal reservation slow path happens to detect the problem, I've seen instances where the filesystem remains deadlocked even though it has been shut down. This is occasionally reproduced by generic/333, and usually manifests as one or more tasks stuck in the journal reservation slow path. To help avoid this problem, repeat the journal error check in __journal_res_get() once under spinlock to cover the case where the previous lock holder might have triggered shutdown. This also helps avoid spurious/duplicate stuck reports. Also, wake the journal from the halt code to make sure blocked callers of the journal res slowpath have a chance to wake up and observe the pending error. This survives an overnight looping run of generic/333 without the aforementioned lockups. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: more aggressive fast path write buffer key flushing	Brian Foster
	The btree write buffer flush code is prone to causing journal deadlock due to inefficient use and release of reservation space. Reservation is not pre-reserved for write buffered keys (as is done for key cache keys, for example), because the write buffer flush side uses a fast path that attempts insertion without need for any reservation at all. The write buffer flush attempts to deal with this by inserting keys using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on journal reservations that require blocking. Upon first error, it falls back to a slow path that inserts in journal order and supports moving the associated journal pin forward. The problem is that under pathological conditions (i.e. smaller log, larger write buffer and journal reservation pressure), we've seen instances where the fast path fails fairly quickly without having completed many insertions, and then the slow path is unable to push the journal pin forward enough to free up the space it needs to completely flush the buffer. This problem is occasionally reproduced by fstest generic/333. To avoid this problem, update the fast path algorithm to skip key inserts that fail due to inability to acquire needed journal reservation without immediately breaking out of the loop. Instead, insert as many keys as possible, zap the sequence numbers to mark them as processed, and then fall back to the slow path to process the remaining set in journal order. This reduces the amount of journal reservation that might be required to flush the entire buffer and increases the odds that the slow path is able to move the journal pin forward and free up space as keys are processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: use dedicated workqueue for tasks holding write refs	Brian Foster
	A workqueue resource deadlock has been observed when running fsck on a filesystem with a full/stuck journal. fsck is not currently able to repair the fs due to fairly rapid emergency shutdown, but rather than exit gracefully the fsck process hangs during the shutdown sequence. Fortunately this is easily recoverable from userspace, but the root cause involves code shared between the kernel and userspace and so should be addressed. The deadlock scenario involves the main task in the bch2_fs_stop() -> bch2_fs_read_only() path waiting on write references to drain with the fs state lock held. A bch2_read_only_work() workqueue task is scheduled on the system_long_wq, blocked on the state lock. Finally, various other write ref holding workqueue tasks are scheduled to run on the same workqueue and must complete in order to release references that the initial task is waiting on. To avoid this problem, we can split the dependent workqueue tasks across different workqueues. It's a bit of a waste to create a dedicated wq for the read-only worker, but there are several tasks throughout the fs that follow the pattern of acquiring a write reference and then scheduling to the system wq. Use a local wq for such tasks to break the subtle dependency between these and the read-only worker. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: remove unused bch2_trans_log_msg()	Brian Foster
	Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix bch2_verify_bucket_evacuated()	Kent Overstreet
	We were going into an infinite loop when printing out backpointers, due to never incrementing bp_offset - whoops. Also limit the number of backpointers we print to 10; this is debug code and we only need to print a sample, not all of them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: verify_bucket_evacuated() -> set_btree_iter_dontneed()	Kent Overstreet
	This should help with excessive 'would deadlock' transaction restarts. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Make reconstruct_alloc quieter	Kent Overstreet
	We shouldn't be printing out fsck errors for expected errors - this helps make test logs more readable, and makes it easier to see what the actual failure was. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix an unhandled transaction restart error	Kent Overstreet
	This is a bit awkward: we're passing around a btree_trans, but we're not in a context where transaction restarts are handled - we should try to come up with a better way to denote situations like this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix nocow write path closure bug	Kent Overstreet
	With regular waitlists, we need to ensure we always call finish_wait(). With closures, the equivalent is that we need to call closure_sync() before returning with a stack-allocated closure. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Nocow write error path fix	Kent Overstreet
	The nocow write error path was iterating over pointers in an extent, aftre we'd dropped btree locks - oops. Fortunately we'd already stashed what we need in nocow_lock_bucket, so use that instead. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix bch2_extent_fallocate() in nocow mode	Kent Overstreet
	When we allocate disk space, we need to be incrementing the WRITE io clock, which perhaps should be renamed to sectors allocated - copygc uses this io clock to know when to run. Also, we should be incrementing the same clock when allocating btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add an assert in inode_write for -ENOENT	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix bch2_evict_subvolume_inodes()	Kent Overstreet
	This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache() doesn't handle the case where i_count is already 0, we need to grab and put the inode in order for it to be dropped. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improve error handling in bch2_ioctl_subvolume_destroy()	Kent Overstreet
	Pure style fixes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix for 'missing subvolume' error	Kent Overstreet
	Subvolumes, including their root inodes, get deleted asynchronously after an unlink. But we still need to ensure that we tell the VFS the inode has been deleted, otherwise VFS writeback could fire after asynchronous deletion has finished, and try to write to an inode/subvolume that no longer exists. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Don't run transaction hooks multiple times	Kent Overstreet
	transaction hooks aren't supposed to run unless we know the transaction is going to commit succesfully: this fixes a bug with attempting to delete a subvolume multiple times. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add a fallback when journal_keys doesn't fit in ram	Kent Overstreet
	We may end up in a situation where allocating the buffer for the sorted journal_keys fails - but it would likely succeed, post compaction where we drop duplicates. We've had reports of this allocation failing, so this adds a slowpath to do the compaction incrementally. This is only a band-aid fix; we need to look at limiting the number of keys in the journal based on the amount of system RAM. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improve the backpointer to missing extent message	Kent Overstreet
	We now print the pos where the backpointer was found in the btree, as well as the exact bucket:bucket_offset of the data, to aid in grepping through logs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add error message for failing to allocate sorted journal keys	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: New erasure coding shutdown path	Kent Overstreet
	This implements a new shutdown path for erasure coding, which is needed for the upcoming BCH_WRITE_WAIT_FOR_EC write path. The process is: - Cancel new stripes being built up - Close out/cancel open buckets on write points or the partial list that are for stripes - Shutdown rebalance/copygc - Then wait for in flight new stripes to finish With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill up before they complete; the new ec shutdown path is needed for shutting down copygc/rebalance without deadlocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_fs_moving_ctxts_to_text()	Kent Overstreet
	This also adds bch2_write_op_to_text(): now we can see outstand moves, useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC and likely for other things in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Private error codes: ENOMEM	Kent Overstreet
	This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix bch2_check_extents_to_backpointers()	Kent Overstreet
	In rare cases, bch2_check_extents_to_backpointers() would incorrectly flag an extent has having a missing backpointer when we just needed to flush the btree write buffer - we weren't tracking the last flushed position correctly. This adds a level field to the last_flushed pos, fixing a bug where we'd sometimes fail on a new root node. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix an assert in copygc thread shutdown path	Kent Overstreet
	We're not supposed to have nested (locked) btree_trans on the stack: this means copygc shutdown needs to exit our btree_trans before exiting the move_ctxt, which calls bch2_write(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_bucket_is_movable() -> BTREE_ITER_CACHED	Kent Overstreet
	BTREE_ITER_CACHED should really be the default for cached btrees - this is an easy mistake to make. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Don't use BTREE_ITER_INTENT in make_extent_indirect()	Kent Overstreet
	This is a workaround for a btree path overflow - searching with BTREE_ITER_INTENT periodically saves the iterator position for updates, which eventually overflows. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix stripe create error path	Kent Overstreet
	If we errored out on a new stripe before fully allocating it, we shouldn't be zeroing out unwritten data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Mark new snapshots earlier in create path	Kent Overstreet
	This fixes a null ptr deref when creating new snapshots: bch2_create_trans() will lookup the subvolume and find the _new_ snapshot in the BCH_CREATE_SUBVOL path that's being created in that transaction. We have to call bch2_mark_snapshot() earlier so that it's properly initialized, instead of leaving it for transaction commit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improve bch2_new_stripes_to_text()	Kent Overstreet
	Print out the alloc reserve, and format it a bit more nicely. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Kill bch_write_op->btree_update_ready	Kent Overstreet
	This changes the write path to not add write ops to to the write_point's list of pending work items until it's ready; this means we have to change the lock protecting it to an irq-safe lock, but means bch2_write_point_do_index_updates() no longer has to iterate over the list, which is beneficial with the way the new BCH_WRITE_WAIT_FOR_EC code works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Simplify stripe_idx_to_delete	Kent Overstreet
	This is not technically correct - it's subject to a race if we ever end up with a stripe with all empty blocks (that needs to be deleted) being held open. But the "correct" version was much too inefficient, and soon we'll be adding a stripes LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix next_bucket()	Kent Overstreet
	This fixes an infinite loop in bch2_get_key_or_real_bucket_hole(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Second layer of refcounting for new stripes	Kent Overstreet
	This will be used for move writes, which will be waiting until the stripe is created to do the index update. They need to prevent the stripe from being reclaimed until their index update is done, so we need another refcount that just keeps the stripe open. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> # Conflicts: # fs/bcachefs/ec.c # fs/bcachefs/io.c
2023-10-22	bcachefs: ec: fall back to creating new stripes for copygc	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Rework __bch2_data_update_index_update()	Kent Overstreet
	This makes some improvements to the logic for adding/removing replicas, as part of the larger erasure coding improvements. We now directly consider number of replicas desired for the given inode, and extent/pointer durability: this ensures that the extent ends up with the desired number of replicas when we're replacing multiple pointers with one that has higher durability (e.g. erasure coded). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Extent helper improvements	Kent Overstreet
	- __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available outside extents. - Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and non const versions - bch2_extent_has_ptr() now returns the pointer it found Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: evacuate_bucket() no longer moves cached ptrs	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>