lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2023-10-22	bcachefs: bch2_bkey_get_mut() improvements	Kent Overstreet
	- bch2_bkey_get_mut() now handles types increasing in size, allocating a buffer for the type's current size when necessary - bch2_bkey_make_mut_typed() - bch2_bkey_get_mut() now initializes the iterator, like bch2_bkey_get_iter() Also, refactor so that most of the code is in functions - now macros are only used for wrappers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Move bch2_bkey_make_mut() to btree_update.h	Kent Overstreet
	It's for doing updates - this is where it belongs, and next pathes will be changing these helpers to use items from btree_update.h. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_bkey_get_iter() helpers	Kent Overstreet
	Introduce new helpers for a common pattern: bch2_trans_iter_init(); bch2_btree_iter_peek_slot(); - bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of the correct type - bch2_bkey_get_val_typed() copies the val out of the btree to a (typically stack allocated) variable; it handles the case where the value in the btree is smaller than the current version of the type, zeroing out the remainder. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bkey_ops.min_val_size	Kent Overstreet
	This adds a new field to bkey_ops for the minimum size of the value, which standardizes that check and also enforces the new rule (previously done somewhat ad-hoc) that we can extend value types by adding new fields on to the end. To make that work we do _not_ initialize min_val_size with sizeof, instead we initialize it to the size of the first version of those values. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Converting to typed bkeys is now allowed for err, null ptrs	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Btree iterator, update flags no longer conflict	Kent Overstreet
	Change btree_update_flags to start after the last btree iterator flag, so that we can pass both in the same flags argument. This is needed for the upcoming bch2_bkey_get_mut() helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: remove unused key cache coherency flag	Brian Foster
	Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: fix accounting corruption race between reclaim and dev add	Brian Foster
	When a device is removed from a bcachefs volume, the associated content is removed from the various btrees. The alloc tree uses the key cache, so when keys are removed the deletes exist in cache for a period of time until reclaim comes along and flushes outstanding updates. When a device is re-added to the bcachefs volume, the add process re-adds some of these previously deleted keys. When marking device superblock locations on device add, the keys will likely refer to some of the same alloc keys that were just removed. The memory triggers for these key updates are responsible for further updates, such as bch2_mark_alloc() calling into bch2_dev_usage_update() to update per-device usage accounting. When a new key is added to key cache, the trans update path also flushes the key to the backing btree for coherency reasons for tree walks. With all of this context, if a device is removed and re-added quickly enough such that some key deletes from the remove are still pending a key cache flush, the trans update path can view this as addition of a new key because the old key in the insert entry refers to a deleted key. However the deleted cached key has not been filled by absence of a btree key, but rather refers to an explicit deletion of an existing key that occurred during device removal. The trans update path adds a new update to flush the key and tags the original (cached) update to skip running the memory triggers. This results in running triggers on the non-cached update instead, which in turn will perform accounting updates based on incoherent values. For example, bch2_dev_usage_update() subtracts the the old alloc key dirty sector count in the non-cached btree key from the newly initialized (i.e. zeroed) per device counters, leading to underflow and accounting corruption. There are at least a few ways to avoid this problem, the simplest of which may be to run triggers against the cached update rather than the non-cached update. If the key only needs to be flushed when the key is not present in the tree, however, then this still performs an unnecessary update. We could potentially use the cached key dirty state to determine whether the delete is a dirty, cached update vs. a clean cache fill, but this may require transmitting key cache dirty state across layers, which adds complexity and seems to be of limited value. Instead, update flush_new_cached_update() to handle this by simply checking for the key in the btree and only perform the flush when a backing key is not present. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Mark bch2_copygc() noinline	Kent Overstreet
	This works around a "stack from too large" error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Delete obsolete btree ptr check	Kent Overstreet
	This patch deletes a .key_invalid check for btree pointers that only applies to _very_ old on disk format versions, and potentially complicates the upgrade process. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Always run topology error when CONFIG_BCACHEFS_DEBUG=y	Kent Overstreet
	Improved test coverage. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix a userspace build error	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Make sure hash info gets initialized in fsck	Kent Overstreet
	We had some bugs with setting/using first_this_inode in the inode walker in the dirents/xattr code. This patch changes to not clear first_this_inode until after initializing the new hash info. Also, we fix an error message to not print on transaction restart, and add a comment to related fsck error code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Kill bch2_verify_bucket_evacuated()	Kent Overstreet
	With backpointers, it's now impossible for bch2_evacuate_bucket() to be completely reliable: it can race with an extent being partially overwritten or split, which needs a new write buffer flush for the backpointer to be seen. This shouldn't be a real issue in practice; the previous patch added a new tracepoint so we'll be able to see more easily if it is. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improve move path tracepoints	Kent Overstreet
	Move path tracepoints now include the key being moved. Also, add new tracepoints for the start of move_extent, and evacuate_bucket. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Drop a redundant error message	Kent Overstreet
	When we're already read-only, we don't need to print out errors from writing btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: remove bucket_gens btree keys on device removal	Brian Foster
	If a device has keys in the bucket_gens btree associated with its buckets and is removed from a bcachefs volume, fsck will complain about the presence of keys associated with an invalid device index. A repair removes the associated keys and restores correctness. Update bch2_dev_remove_alloc() to remove device related keys at device removal time to avoid the problem. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: fix NULL bch_dev deref when checking bucket_gens keys	Brian Foster
	fsck removes bucket_gens keys for devices that do not exist in the volume (i.e., if the device was removed). In 'fsck -n' mode, the associated fsck_err_on() wrapper returns false to skip the key removal. This proceeds on to the rest of the function, which eventually segfaults on a NULL bch_dev because the device does not exist. Update bch2_check_bucket_gens_key() to skip out of the rest of the function when the associated device does not exist, regardless of running fsck in check or repair mode. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: folio pos to bch_folio_sector index helper	Brian Foster
	Create a small helper to translate from file offset to the associated bch_folio_sector index in the underlying bch_folio. The helper assumes the file offset is covered by the passed folio. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix a null ptr deref in fsck check_extents()	Kent Overstreet
	It turns out, in rare situations we need to be passing in a disk reservation, which will be used internally by the transaction commit path when needed. Pass one in... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Fix a slab-out-of-bounds	Kent Overstreet
	In __bch2_alloc_to_v4_mut(), we overrun the buffer we allocate if the alloc key had backpointers stored in it (which we no longer support). Fix this with a max() call. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Allow answering y or n to all fsck errors of given type	Kent Overstreet
	This changes the ask_yn() function used by fsck to accept Y or N, meaning yes or no for all errors of a given type. With this, the user can be prompted only for distinct error types - useful when a filesystem has lots of errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: use u64 for folio end pos to avoid overflows	Brian Foster
	Some of the folio_end_() helpers are prone to overflow of signed 64-bit types because the mapping is only limited by the max value of loff_t and the associated helpers return the start offset of the next folio. Therefore, a folio_end_pos() of the max allowable folio in a mapping returns a value that overflows loff_t. This makes it hard to rely on such values when doing folio processing across a range of a file, as bcachefs attempts to do with the recent folio changes. For example, generic/564 causes problems in the buffered write path when testing writes at max boundary conditions. The current understanding is that the pagecache historically limited the mapping to one less page to avoid this problem and this was dropped with some of the folio conversions, but may be reinstated to properly address the problem. In the meantime, update the internal folio_end_() helpers in bcachefs to return a u64, and all of the associated code to use or cast to u64 to avoid overflow problems. This allows generic/564 to pass and can be reverted back to using loff_t if at any point the pagecache subsystem can guarantee these boundary conditions will not overflow. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: clean up post-eof folios on -ENOSPC	Brian Foster
	The buffered write path batches folio creations in the file mapping based on the requested size of the write. Under low free space conditions, it is possible to add a bunch of folios to the mapping and then return a short write or -ENOSPC due to lack of space. If this occurs on an extending write, the file size is updated based on the amount of data successfully written to the file. If folios were added beyond the final i_size, they may hang around until reclaimed, truncated or encountered unexpectedly by another operation. For example, generic/083 reproduces a sequence of events where a short write leaves around one or more post-EOF folios on an inode, a subsequent zero range request extends beyond i_size and overlaps with an aforementioned folio, and __bch2_truncate_folio() happens across it and complains. Update __bch2_buffered_write() to keep track of the start offset of the last folio added to the mapping for a prospective write. After i_size is updated, check whether this offset starts beyond EOF. If so, truncate pagecache beyond the latest EOF to clean up any folios that don't reside at least partially within EOF upon completion of the write. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: fix truncate overflow if folio is beyond EOF	Brian Foster
	generic/083 occasionally reproduces a panic caused by an overflow when accessing the bch_folio_sector array of the folio being processed by __bch2_truncate_folio(). The immediate cause of the overflow is that the folio offset is beyond i_size, and therefore the sector index calculation underflows on subtraction of the folio offset. One cause of this is mainly observed on nocow mounts. When nocow is enabled, fallocate performs physical block allocation (as opposed to block reservation in cow mode), which range_has_data() then interprets as valid data that requires partial zeroing on truncate. Therefore, if a post-eof zero range request lands across post-eof preallocated blocks, __bch2_truncate_folio() may actually create a post-eof folio in order to perform zeroing. To avoid this problem, update range_has_data() to filter out unwritten blocks from folio creation and partial zeroing. Even though we should never create folios beyond EOF like this, the mere existence of such folios is not necessarily a fatal error. Fix up the truncate code to warn about this condition and not overflow the sector array and possibly crash the system. The addition of this warning without the corresponding unwritten extent fix has shown that various other fstests are able to reproduce this problem fairly frequently, but often in ways that doesn't necessarily result in a kernel panic or a change in user observable behavior, and therefore the problem goes undetected. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Enable large folios	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Check for folios that don't have bch_folio attached	Kent Overstreet
	With large folios, it's now incidentally possible to end up with a clean, uptodate folio in the page cache that doesn't have a bch_folio attached, if a folio has to be split. This patch fixes __bch2_truncate_folio() to check for this; other code paths appear to handle it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_readahead() large folio conversion	Kent Overstreet
	Readahead now uses the new filemap_get_contig_folios_d() helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: filemap_get_contig_folios_d()	Kent Overstreet
	Add a new helper for getting a range of contiguous folios and returning them in a darray. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch_folio_sector_state improvements	Kent Overstreet
	- X-macro-ize the bch_folio_sector_state enum: this means we can easily generate strings, which is helpful for debugging. - Add helpers for state transitions: folio_sector_dirty(), folio_sector_undirty(), folio_sector_reserve() - Add folio_sector_set(), a single helper for changing folio sector state just so that we have a single place to instrument when we're debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_truncate_page() large folio conversion	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_buffered_write large folio conversion	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch_folio can now handle multi-order folios	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: More assorted large folio conversion	Kent Overstreet
	Various misc small conversions in fs-io.c for large folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_seek_pagecache_data() folio conversion	Kent Overstreet
	This converts bch2_seek_pagecache_data() to handle large folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bch2_seek_pagecache_hole() folio conversion	Kent Overstreet
	This converts bch2_seek_pagecache_hole() to handle large folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: bio_for_each_segment_all() -> bio_for_each_folio_all()	Kent Overstreet
	This converts the writepage end_io path to folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Initial folio conversion	Kent Overstreet
	This converts fs-io.c to pass folios, not pages. We're not handling large folios yet, there's no functional changes in this patch - just a lot of churn doing the initial type conversions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Rename bch_page_state -> bch_folio	Kent Overstreet
	Start of the large folio conversion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add a bch_page_state assert	Kent Overstreet
	Seeing an odd bug with page/folio state not being properly initialized, this is to help track it down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add a cond_resched() call to journal_keys_sort()	Kent Overstreet
	We're just doing cpu work here and it could take awhile, a cond_resched() is definitely needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improve trace_move_extent_fail()	Kent Overstreet
	This greatly expands the move_extent_fail tracepoint - now it includes all the information we have available, including exactly why the extent wasn't updated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Print out counters correctly	Kent Overstreet
	Most counters aren't in units of sectors, and the ones that are should just be switched to bytes, for simplicity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Add missing bch2_err_class() call	Kent Overstreet
	We're not supposed to return our private error codes to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Rip out code for storing backpointers in alloc keys	Kent Overstreet
	We don't store backpointers in alloc keys anymore, since we gained the btree write buffer. This patch drops support for backpointers in alloc keys, and revs the on disk format version so that we know a fsck is required. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: use reservation for log messages during recovery	Brian Foster
	If we block on journal reservation attempting to log journal messages during recovery, particularly for the first message(s) before we start doing actual work, chances are the filesystem ends up deadlocked. Allow logged messages to use reserved journal space to mitigate this problem. In the worst case where no space is available whatsoever, this at least allows the fs to recognize that the journal is stuck and fail the mount gracefully. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Improve trans_restart_split_race tracepoint	Kent Overstreet
	Seeing occasional test failures where we get stuck in a livelock that involves this event - this will help track it down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Data update path no longer leaves cached replicas	Kent Overstreet
	It turns out that it's currently impossible to invalidate buckets containing only cached data if they're part of a stripe. The normal bucket invalidate path can't do it because we have to be able to incerement the bucket's gen, which isn't correct becasue it's still a member of the stripe - and the bucket invalidate path makes the bucket availabel for reuse right away, which also isn't correct for buckets in stripes. What would work is invalidating cached data by following backpointers, except that cached replicas don't currently get backpointers - because they would be awkward for the existing bucket invalidate path to delete and they haven't been needed elsewhere. So for the time being, to prevent running out of space in stripes, switch the data update path to not leave cached replicas; we may revisit this in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Rhashtable based buckets_in_flight for copygc	Kent Overstreet
	Previously, copygc used a fifo for tracking buckets in flight - this had the disadvantage of being fixed size, since we pass references to elements into the move code. This restructures it to be a hash table and linked list, since with erasure coding we need to be able to pipeline across an arbitrary number of buckets. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22	bcachefs: Use BTREE_ITER_INTENT in ec_stripe_update_extent()	Kent Overstreet
	This adds a flags param to bch2_backpointer_get_key() so that we can pass BTREE_ITER_INTENT, since ec_stripe_update_extent() is updating the extent immediately. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>