lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2024-01-01	bcachefs: Fix locking when checking freespace btree	Kent Overstreet
	On transaction restart, we weren't re-validating the hole we saw. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Check for unlinked inodes not on deleted list	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: kill INODE_LOCK, use lock_two_nondirectories()	Kent Overstreet
	In an ideal world, we'd have a common helper that could be used for sorting a list of inodes into the correct lock order, and then the same lock ordering could be used for any type of inode lock, not just i_rwsem. But the lock ordering rules for i_rwsem are a bit complicated, so - abandon that dream for now and do it the more standard way. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Improved backpointer messages in fsck	Kent Overstreet
	When we have a key to print, we should print it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Add extra verbose logging for ro path	Kent Overstreet
	Also log time waiting for c->writes references to be dropped; this will help in debugging why unmounts are taking longer than they should. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Flush fsck errors before running twice	Kent Overstreet
	It's confusing if we run fsck a second time (in debug mode, to verify the second run is clean), but errors are still ratelimited from the first run. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: make RO snapshots actually RO	Kent Overstreet
	Add checks to all the VFS paths for "are we in a RO snapshot?". Note - we don't check this when setting inode options via our xattr interface, since those generally only affect data placement, not contents of data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
2024-01-01	bcachefs: bch_sb_field_downgrade	Kent Overstreet
	Add a new superblock section that contains a list of { minor version, recovery passes, errors_to_fix } that is - a list of recovery passes that must be run when downgrading past a given version, and a list of errors to silently fix. The upcoming disk accounting rewrite is not going to be fully compatible: we're going to have to regenerate accounting both when upgrading to the new version, and also from downgrading from the new version, since the new method of doing disk space accounting is a completely different architecture based on deltas, and synchronizing them for every jounal entry write to maintain compatibility is going to be too expensive and impractical. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: bch_sb.recovery_passes_required	Kent Overstreet
	Add two new superblock fields. Since the main section of the superblock is now fully, we have to add a new variable length section for them - bch_sb_field_ext. - recovery_passes_requried: recovery passes that must be run on the next mount - errors_silent: errors that will be silently fixed These are to improve upgrading and dwongrading: these fields won't be cleared until after recovery successfully completes, so there won't be any issues with crashing partway through an upgrade or a downgrade. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Add persistent identifiers for recovery passes	Kent Overstreet
	The next patch will start to refer to recovery passes from the superblock; naturally, we now need identifiers that don't change, since the existing enum is in the order in which they are run and is not fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: prt_bitflags_vector()	Kent Overstreet
	similar to prt_bitflags(), but for ulong arrays Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: move BCH_SB_ERRS() to sb-errors_types.h	Kent Overstreet
	we need BCH_SB_ERR_MAX in bcachefs.h Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: fix buffer overflow in nocow write path	Kent Overstreet
	BCH_REPLICAS_MAX isn't the actual maximum number of pointers in an extent, it's the maximum number of dirty pointers. We don't have a real restriction on the number of cached pointers, and we don't want a fixed size array here anyways - so switch to DARRAY_PREALLOCATED(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-and-tested-by: Daniel J Blueman <daniel@quora.org>
2024-01-01	bcachefs: DARRAY_PREALLOCATED()	Kent Overstreet
	Add support to darray for preallocating some number of elements. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Switch darray to kvmalloc()	Kent Overstreet
	We sometimes use darrays for quite large buffers - the btree write buffer in particular needs large buffers, since it must be sized to hold all the write buffer keys outstanding in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: Factor out darray resize slowpath	Kent Overstreet
	Move the slowpath (actually growing the darray) to an out-of-line function; also, add some helpers for the upcoming btree write buffer rewrite. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: fix setting version_upgrade_complete	Kent Overstreet
	If a superblock write hasn't happened (i.e. we never had to go rw), then c->sb.version will be out of date w.r.t. c->disk_sb.sb->version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01	bcachefs: fix invalid free in dio write path	Kent Overstreet
	turns out iterate_iovec() mutates __iov, we need to save our own copy Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: Marcin Mirosław <marcin@mejor.pl>
2024-01-01	bcachefs: Fix extents iteration + snapshots interaction	Kent Overstreet
	peek_upto() checks against the end position and bails out before FILTER_SNAPSHOTS checks; this is because if we end up at a different inode number than the original search key none of the keys we see might be visibile in the current snapshot - we might be looking at inode in a completely different subvolume. But this is broken, because when we're iterating over extents we're checking against the extent start position to decide when to bail out, and the extent start position isn't monotonically increasing until after we've run FILTER_SNAPSHOTS. Fix this by adding a simple inode number check where the old bailout check was, and moving the main check to the correct position. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
2023-12-26	bcachefs: Fix promotes	Kent Overstreet
	The recent work to fix data moves w.r.t. durability broke promotes, because the caused us to bail out when the extent minus pointers being dropped still has enough pointers to satisfy the current number of replicas. Disable this check when we're adding cached replicas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-21	bcachefs: Fix leakage of internal error code	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-21	bcachefs: Fix insufficient disk reservation with compression + snapshots	Kent Overstreet
	When overwriting and splitting existing extents, we weren't correctly accounting for a 3 way split of a compressed extent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-19	bcachefs: fix BCH_FSCK_ERR enum	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-19	bcachefs: Fix bch2_alloc_sectors_start_trans() error handling	Kent Overstreet
	When we fail to allocate because of insufficient open buckets, we don't want to retry from the full set of devices - we just want to retry in blocking mode. But if the retry in blocking mode fails with a different error code, we end up squashing the -BCH_ERR_open_buckets_empty error with an error that makes us thing we won't be able to allocate (insufficient_devices) - which is incorrect when we didn't try to allocate from the full set of devices, and causes the write to fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-19	bcachefs; guard against overflow in btree node split	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-19	bcachefs: btree_node_u64s_with_format() takes nr keys	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-17	bcachefs: print explicit recovery pass message only once	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-14	bcachefs: improve modprobe support by providing softdeps	Daniel Hill
	We need to help modprobe load architecture specific modules so we don't fall back to generic software implementations, this should help performance when building as a module. Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-14	bcachefs: fix invalid memory access in bch2_fs_alloc() error path	Thomas Bertschinger
	When bch2_fs_alloc() gets an error before calling bch2_fs_btree_iter_init(), bch2_fs_btree_iter_exit() makes an invalid memory access because btree_trans_list is uninitialized. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Fixes: 6bd68ec266ad ("bcachefs: Heap allocate btree_trans") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-13	bcachefs: Fix determining required file handle length	Jan Kara
	The ->encode_fh method is responsible for setting amount of space required for storing the file handle if not enough space was provided. bch2_encode_fh() was not setting required length in that case which breaks e.g. fanotify. Fix it. Reported-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-11	bcachefs: Fix nocow locks deadlock	Kent Overstreet
	On trylock failure we were waiting for outstanding reads to complete - but nocow locks need to be held until the whole move is finished. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-10	bcachefs: Close journal entry if necessary when flushing all pins	Kent Overstreet
	Since outstanding journal buffers hold a journal pin, when flushing all pins we need to close the current journal entry if necessary so its pin can be released. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-10	bcachefs: Fix uninitialized var in bch2_journal_replay()	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-08	bcachefs: Fix deleted inode check for dirs	Kent Overstreet
	We could delete directories transactionally on rmdir()/unlink(), but we don't; instead, like with regular files we wait for the VFS to call evict(). That means that our check for directories in the deleted inodes btree is wrong - the check should be for non-empty directories. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-06	bcachefs: rebalance shouldn't attempt to compress unwritten extents	Daniel Hill
	This fixes a bug where rebalance would loop repeatedly on the same extents. Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-06	bcachefs: don't attempt rw on unfreeze when shutdown	Brian Foster
	The internal freeze mechanism in bcachefs mostly reuses the generic rw<->ro transition code. If the fs happens to shutdown during or after freeze, a transition back to rw can fail. This is expected, but returning an error from the unfreeze callout prevents the filesystem from being unfrozen. Skip the read write transition if the fs is shutdown. This allows the fs to unfreeze at the vfs level so writes will no longer block, but will still fail due to the emergency read-only state of the fs. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-06	bcachefs: Fix creating snapshot with implict source	Kent Overstreet
	When creating a snapshot without specifying the source subvolume, we use the subvolume containing the new snapshot. Previously, this worked if the directory containing the new snapshot was the subvolume root - but we were using the incorrect helper, and got a subvolume ID of 0 when the parent directory wasn't the root of the subvolume, causing an emergency read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-04	bcachefs: Don't run indirect extent trigger unless inserting/deleting	Kent Overstreet
	This fixes a transaction path overflow reported in the snapshot deletion path, when moving extents to the correct snapshot. The root of the issue is that creating/deleting a reflink pointer can generate an unbounded number of updates, if it is allowed to reference an unbounded number of indirect extents; to prevent this, merging of reflink pointers has been disabled. But there's a hole, which is that copygc/rebalance may fragment existing extents in the course of moving them around, and if an indirect extent becomes too fragmented we'll then become unable to delete the reflink pointer. The eventual solution is going to be to tweak trigger handling so that we can process large reflink pointers incrementally when necessary, and notice that trigger updates don't need to be run for the part of the reflink pointer not changing. That is going to be a bigger project though, for another patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-04	bcachefs: Convert compression_stats to for_each_btree_key2	Kent Overstreet
	for_each_btree_key2() runs each loop iteration in a btree transaction, and thus does not cause SRCU lock hold time problems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-04	bcachefs: Fix bch2_extent_drop_ptrs() call	Kent Overstreet
	Also, make bch2_extent_drop_ptrs() safer, so it works with extents and non-extents iterators. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-04	bcachefs: Fix a journal deadlock in replay	Kent Overstreet
	Recently, journal pre-reservations were removed. They were for reserving space ahead of time in the journal for operations that are required for journal reclaim, e.g. btree key cache flushing and interior node btree updates. Instead we have watermarks - only operations for journal reclaim are allowed when the journal is low on space, and in general we're quite good about doing operations in the order that will free up space in the journal quickest when we're low on space. If we're doing a journal reclaim operation out of order, we usually do it in nonblocking mode if it's not freeing up space at the end of the journal. There's an exceptino though - interior btree node update operations have to be BCH_WATERMARK_reclaim - once they've been started, and they can't be nonblocking. Generally this is fine because they'll only be a very small fraction of transaction commits - but there's an exception, which is during journal replay. Journal replay does many btree operations, but doesn't need to commit them to the journal since they're already in the journal. So killing off of pre-reservation, plus another change to make journal replay more efficient by initially doing the replay in sorted btree order, made it possible for the interior update operations replay generates to fill and deadlock the journal. Fix this by introducing a new check on journal space at the _start_ of an interior update operation. This causes us to block if necessary in exactly the same way as we used to when interior updates took a journal pre-reservaiton, but without all the expensive accounting pre-reservations required. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-04	bcachefs; Don't use btree write buffer until journal replay is finished	Kent Overstreet
	The keys being replayed by journal replay have to be synchronized with updates by other threads that overwrite them. We rely on btree node locks for synchronizing - but since btree write buffer updates take no btree locks, that won't work. Instead, simply disable using the btree write buffer until journal replay is finished. This fixes a rare backpointers error in the merge_torture_flakey test. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-03	bcachefs: Don't drop journal pins in exit path	Kent Overstreet
	There's no need to drop journal pins in our exit paths - the code was trying to have everything cleaned up on any shutdown, but better to just tweak the assertions a bit. This fixes a bug where calling into journal reclaim in the exit path would cass a null ptr deref. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: Extra kthread_should_stop() calls for copygc	Kent Overstreet
	This fixes a bug where going read-only was taking longer than it should have due to copygc forgetting to check kthread_should_stop() Additionally: fix a missing is_kthread check in bch2_move_ratelimit(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: Convert gc_alloc_start() to for_each_btree_key2()	Kent Overstreet
	This eliminates some SRCU warnings: for_each_btree_key2() runs every loop iteration in a distinct transaction context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: Fix race between btree writes and metadata drop	Kent Overstreet
	btree writes update the btree node key after every write, in order to update sectors_written, and they also might need to drop pointers if one of the writes failed in a replicated btree node. But the btree node might also have had a pointer dropped while the write was in flight, by bch2_dev_metadata_drop(), and thus there was a bug where the btree node write would ovewrite the btree node's key with what it had at the start of the write. Fix this by dropping pointers not currently in the btree node key. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: move journal seq assertion	Kent Overstreet
	journal_cur_seq() can legitimately be used outside of the journal lock, where this assert can race Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: -EROFS doesn't count as move_extent_start_fail	Kent Overstreet
	The automated tests check if we've hit too many slowpath/error path events and fail the test - if we're just shutting down, that naturally shouldn't count. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: trace_move_extent_start_fail() now includes errcode	Kent Overstreet
	Renamed from trace_move_extent_alloc_mem_fail, because there are other reasons we colud fail (disk space allocation failure). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28	bcachefs: Fix split_race livelock	Kent Overstreet
	bch2_btree_update_start() calculates which nodes are going to have to be split/rewritten, so that we know how many nodes to reserve and how deep in the tree we have to take locks. But btree node merges require inserting two keys into the parent node, not just splits. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>