Age | Commit message (Collapse) | Author |
|
There's no point in doing copygc on non-rw devices: the fragmentation
doesn't matter if we're not writing to them, and we may not have
anywhere to put the data on our other devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we fixed journal resize spuriousl failing with
-BCH_ERR_open_buckets_empty, but initial journal allocation was missed
because it didn't invoke the "block on allocator" loop at all.
Factor out the "loop on allocator" code to fix that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We shouldn't be setting incompatible bits or the incompatible version
field unless explicitly request or allowed - otherwise we break mounting
with old kernels or userspace.
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
__bch_truncate_folio() may return 1 to indicate dirtyness of the folio
being truncated, needed for fpunch to get the i_size writes correct.
But truncate was forgetting to clear ret, and sometimes returning it as
an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes two deadlocks:
1.pcpu_alloc_mutex involved one as pointed by syzbot[1]
2.recursion deadlock.
The root cause is that we hold the bc lock during alloc_percpu, fix it
by following the pattern used by __btree_node_mem_alloc().
[1] https://lore.kernel.org/all/66f97d9a.050a0220.6bad9.001d.GAE@google.com/T/
Reported-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com
Tested-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes occasional failures from journal resize.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This turned out to have several bugs, which were missed because the fsck
code wasn't properly reporting errors - whoops.
Kicking it out for now, hopefully it can make 6.15.
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The fix alone doesn't fix [1], but should be applied before debugging
that.
[1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
"nonce inconstancy" is popping up again, causing us to go emergency
read-only.
This one looks less serious, i.e. specific to the encryption path and
not indicative of a data corruption bug. But we'll need more info to
track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't want to be holding the srcu lock while waiting on btree write
completions - easily fixed.
Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We had some error handling confusion here;
-BCH_ERR_missing_indirect_extent is thrown by
trans_trigger_reflink_p_segment(); at this point we haven't decide
whether we're generating an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Error handling was wrong, causing unhandled transaction restart errors.
check_directory_size() was also inefficient, since keys in multiple
snapshots would be iterated over once for every snapshot. Convert it to
the same scheme used for i_sectors and subdir count checking.
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_nocow_write_convert_unwritten is already in transaction context:
00191 ========= TEST generic/648
00242 kernel BUG at fs/bcachefs/btree_iter.c:3332!
00242 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
00242 Modules linked in:
00242 CPU: 4 UID: 0 PID: 2593 Comm: fsstress Not tainted 6.13.0-rc3-ktest-g345af8f855b7 #14403
00242 Hardware name: linux,dummy-virt (DT)
00242 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00242 pc : __bch2_trans_get+0x120/0x410
00242 lr : __bch2_trans_get+0xcc/0x410
00242 sp : ffffff80d89af600
00242 x29: ffffff80d89af600 x28: ffffff80ddb23000 x27: 00000000fffff705
00242 x26: ffffff80ddb23028 x25: ffffff80d8903fe0 x24: ffffff80ebb30168
00242 x23: ffffff80c8aeb500 x22: 000000000000005d x21: ffffff80d8904078
00242 x20: ffffff80d8900000 x19: ffffff80da9e8000 x18: 0000000000000000
00242 x17: 64747568735f6c61 x16: 6e72756f6a20726f x15: 0000000000000028
00242 x14: 0000000000000004 x13: 000000000000f787 x12: ffffffc081bbcdc8
00242 x11: 0000000000000000 x10: 0000000000000003 x9 : ffffffc08094efbc
00242 x8 : 000000001092c111 x7 : 000000000000000c x6 : ffffffc083c31fc4
00242 x5 : ffffffc083c31f28 x4 : ffffff80c8aeb500 x3 : ffffff80ebb30000
00242 x2 : 0000000000000001 x1 : 0000000000000a21 x0 : 000000000000028e
00242 Call trace:
00242 __bch2_trans_get+0x120/0x410 (P)
00242 bch2_inum_offset_err_msg+0x48/0xb0
00242 bch2_nocow_write_convert_unwritten+0x3d0/0x530
00242 bch2_nocow_write+0xeb0/0x1000
00242 __bch2_write+0x330/0x4e8
00242 bch2_write+0x1f0/0x530
00242 bch2_direct_write+0x530/0xc00
00242 bch2_write_iter+0x160/0xbe0
00242 vfs_write+0x1cc/0x360
00242 ksys_write+0x5c/0xf0
00242 __arm64_sys_write+0x20/0x30
00242 invoke_syscall.constprop.0+0x54/0xe8
00242 do_el0_svc+0x44/0xc0
00242 el0_svc+0x34/0xa0
00242 el0t_64_sync_handler+0x104/0x130
00242 el0t_64_sync+0x154/0x158
00242 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000)
00242 ---[ end trace 0000000000000000 ]---
00242 Kernel panic - not syncing: Oops - BUG: Fatal exception
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
_orig_restart_count is unused now, according to the logic, trans_was_restarted
should be using _orig_restart_count.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Incorrectly handled transaction restarts can be a source of heisenbugs;
add a mode where we randomly inject them to shake them out.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
want_new_bset() returns the address of a new bset to initialize if we
wish to do so in a btree node - either because the previous one is too
big, or because it's been written.
The case for 'previous bset was written' was wrong: it's only supposed
to check for if we have space in the node for one more block, but
because it subtracted the header from the space available it would never
initialize a new bset if we were down to the last block in a node.
Fixing this results in fewer btree node splits/compactions, which fixes
a bug with flushing the journal to go read-only sometimes not
terminating or taking excessively long.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This lets us flush the journal to go read-only more effectively.
Flushing the journal and going read-only requires halting mutually
recursive processes, which strictly speaking are not guaranteed to
terminate.
Flushing btree node journal pins will kick off a btree node write, and
btree node writes on completion must do another btree update to the
parent node to update the 'sectors_written' field for that node's key.
If the parent node is full and requires a split or compaction, that's
going to generate a whole bunch of additional btree updates - alloc
info, LRU btree, and more - which then have to be flushed, and the cycle
repeats.
This process will terminate much more effectively if we tweak journal
reclaim to flush btree updates leaf to root: i.e., don't flush updates
for a given btree node (kicking off a write, and consuming space within
that node up to the next block boundary) if there might still be
unflushed updates in child nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
acc->k.data should be used with the lock hold:
00221 ========= TEST generic/187
00221 run fstests generic/187 at 2025-02-09 21:08:10
00221 spectre-v4 mitigation disabled by command-line option
00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00222 bcachefs (vdc): initializing new filesystem
00222 bcachefs (vdc): going read-write
00222 bcachefs (vdc): marking superblocks
00222 bcachefs (vdc): initializing freespace
00222 bcachefs (vdc): done initializing freespace
00222 bcachefs (vdc): reading snapshots table
00222 bcachefs (vdc): reading snapshots done
00222 bcachefs (vdc): done starting filesystem
00222 bcachefs (vdc): shutting down
00222 bcachefs (vdc): going read-only
00222 bcachefs (vdc): finished waiting for writes to stop
00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6
00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8
00223 bcachefs (vdc): clean shutdown complete, journal seq 9
00223 bcachefs (vdc): marking filesystem clean
00223 bcachefs (vdc): shutdown complete
00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00223 bcachefs (vdc): initializing new filesystem
00223 bcachefs (vdc): going read-write
00223 bcachefs (vdc): marking superblocks
00223 bcachefs (vdc): initializing freespace
00223 bcachefs (vdc): done initializing freespace
00223 bcachefs (vdc): reading snapshots table
00223 bcachefs (vdc): reading snapshots done
00223 bcachefs (vdc): done starting filesystem
00244 hrtimer: interrupt took 123350440 ns
00264 bcachefs (vdc): shutting down
00264 bcachefs (vdc): going read-only
00264 bcachefs (vdc): finished waiting for writes to stop
00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97
00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101
00265 bcachefs (vdc): clean shutdown complete, journal seq 102
00265 bcachefs (vdc): marking filesystem clean
00265 bcachefs (vdc): shutdown complete
00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102
00265 bcachefs (vdc): accounting_read...
00265 ==================================================================
00265 done
00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728
00265 bcachefs (vdc): alloc_read... done
00265 bcachefs (vdc): stripes_read... done
00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531
00265 bcachefs (vdc): snapshots_read... done
00265
00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103
00265 Hardware name: linux,dummy-virt (DT)
00265 Call trace:
00265 show_stack+0x1c/0x30 (C)
00265 dump_stack_lvl+0x6c/0x80
00265 print_report+0xf8/0x5d8
00265 kasan_report+0x90/0xd0
00265 __asan_report_load4_noabort+0x1c/0x28
00265 bch2_fs_to_text+0x12b4/0x1728
00265 bch2_fs_show+0x94/0x188
00265 sysfs_kf_seq_show+0x1a4/0x348
00265 kernfs_seq_show+0x12c/0x198
00265 seq_read_iter+0x27c/0xfd0
00265 kernfs_fop_read_iter+0x390/0x4f8
00265 vfs_read+0x480/0x7f0
00265 ksys_read+0xe0/0x1e8
00265 __arm64_sys_read+0x70/0xa8
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265
00265 Allocated by task 7510:
00265 kasan_save_stack+0x28/0x50
00265 kasan_save_track+0x1c/0x38
00265 kasan_save_alloc_info+0x3c/0x50
00265 __kasan_kmalloc+0xac/0xb0
00265 __kmalloc_node_noprof+0x168/0x348
00265 __kvmalloc_node_noprof+0x20/0x140
00265 __bch2_darray_resize_noprof+0x90/0x1b0
00265 __bch2_accounting_mem_insert+0x76c/0xb08
00265 bch2_accounting_mem_insert+0x224/0x3b8
00265 bch2_accounting_mem_mod_locked+0x480/0xc58
00265 bch2_accounting_read+0xa94/0x3eb8
00265 bch2_run_recovery_pass+0x80/0x178
00265 bch2_run_recovery_passes+0x340/0x698
00265 bch2_fs_recovery+0x1c98/0x2bd8
00265 bch2_fs_start+0x240/0x490
00265 bch2_fs_get_tree+0xe1c/0x1458
00265 vfs_get_tree+0x7c/0x250
00265 path_mount+0xe24/0x1648
00265 __arm64_sys_mount+0x240/0x438
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265
00265 Freed by task 7510:
00265 kasan_save_stack+0x28/0x50
00265 kasan_save_track+0x1c/0x38
00265 kasan_save_free_info+0x48/0x88
00265 __kasan_slab_free+0x48/0x60
00265 kfree+0x188/0x408
00265 kvfree+0x3c/0x50
00265 __bch2_darray_resize_noprof+0xe0/0x1b0
00265 __bch2_accounting_mem_insert+0x76c/0xb08
00265 bch2_accounting_mem_insert+0x224/0x3b8
00265 bch2_accounting_mem_mod_locked+0x480/0xc58
00265 bch2_accounting_read+0xa94/0x3eb8
00265 bch2_run_recovery_pass+0x80/0x178
00265 bch2_run_recovery_passes+0x340/0x698
00265 bch2_fs_recovery+0x1c98/0x2bd8
00265 bch2_fs_start+0x240/0x490
00265 bch2_fs_get_tree+0xe1c/0x1458
00265 vfs_get_tree+0x7c/0x250
00265 path_mount+0xe24/0x1648
00265 bcachefs (vdc): going read-write
00265 __arm64_sys_mount+0x240/0x438
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265
00265 The buggy address belongs to the object at ffffff80c57eac00
00265 which belongs to the cache kmalloc-128 of size 128
00265 The buggy address is located 0 bytes inside of
00265 freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80)
00265
00265 The buggy address belongs to the physical page:
00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea
00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
00265 flags: 0x8000000000000040(head|zone=2)
00265 page_type: f5(slab)
00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122
00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400
00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122
00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400
00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000
00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000
00265 page dumped because: kasan: bad access detected
00265
00265 Memory state around the buggy address:
00265 ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00265 ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
00265 ^
00265 ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00265 ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
00265 ==================================================================
00265 Kernel panic - not syncing: kasan.fault=panic set ...
00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103
00265 Hardware name: linux,dummy-virt (DT)
00265 Call trace:
00265 show_stack+0x1c/0x30 (C)
00265 dump_stack_lvl+0x30/0x80
00265 dump_stack+0x18/0x20
00265 panic+0x4d4/0x518
00265 start_report.constprop.0+0x0/0x90
00265 kasan_report+0xa0/0xd0
00265 __asan_report_load4_noabort+0x1c/0x28
00265 bch2_fs_to_text+0x12b4/0x1728
00265 bch2_fs_show+0x94/0x188
00265 sysfs_kf_seq_show+0x1a4/0x348
00265 kernfs_seq_show+0x12c/0x198
00265 seq_read_iter+0x27c/0xfd0
00265 kernfs_fop_read_iter+0x390/0x4f8
00265 vfs_read+0x480/0x7f0
00265 ksys_read+0xe0/0x1e8
00265 __arm64_sys_read+0x70/0xa8
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265 SMP: stopping secondary CPUs
00265 Kernel Offset: disabled
00265 CPU features: 0x000,00000070,00000010,8240500b
00265 Memory Limit: none
00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]---
00270 ========= FAILED TIMEOUT generic.187 in 1200s
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
reflink pointers to missing indirect extents aren't deleted, they just
have an error bit set - in case the indirect extent somehow reappears.
fsck/mark and sweep thus needs to ignore these errors.
Also, they can be marked AUTOFIX now.
Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch_extent_rebalance
Previously, bch2_bkey_sectors_need_rebalance() called
bch2_target_accepts_data(), checking whether the target is writable.
However, this means that adding or removing devices from a target would
change the value of bch2_bkey_sectors_need_rebalance() for an existing
extent; this needs to be invariant so that the extent trigger can
correctly maintain rebalance_work accounting.
Instead, check target_accepts_data() in io_opts_to_rebalance_opts(),
before creating the bch_extent_rebalance entry.
This fixes (one?) cause of rebalance_work accounting being off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Spotted by sparse.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.
Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In the previous commit b3d82c2f2761, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.
However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.
So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.
Fixes: b3d82c2f2761 ("bcachefs: Guard against journal seq overflow")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For some unknown reason, checks on struct bkey_s_c_snapshot and struct
bkey_s_c_snapshot_tree pointers are missing.
Therefore, I think it would be appropriate to fix the incorrect pointer checking
through this patch.
Fixes: 4bd06f07bcb5 ("bcachefs: Fixes for snapshot_tree.master_subvol")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull misc vfs cleanups from Al Viro:
"Two unrelated patches - one is a removal of long-obsolete include in
overlayfs (it used to need fs/internal.h, but the extern it wanted has
been moved back to include/linux/namei.h) and another introduces
convenience helper constructing struct qstr by a NUL-terminated
string"
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add a string-to-qstr constructor
fs/overlayfs/namei.c: get rid of include ../internal.h
|
|
Pull bcachefs fixes from Kent Overstreet:
- second half of a fix for a bug that'd been causing oopses on
filesystems using snapshots with memory pressure (key cache fills for
snaphots btrees are tricky)
- build fix for strange compiler configurations that double stack frame
size
- "journal stuck timeout" now takes into account device latency: this
fixes some spurious warnings, and the main remaining source of SRCU
lock hold time warnings (I'm no longer seeing this in my CI, so any
users still seeing this should definitely ping me)
- fix for slow/hanging unmounts (" Improve journal pin flushing")
- some more tracepoint fixes/improvements, to chase down the "rebalance
isn't making progress" issues
* tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs:
bcachefs: Improve trace_move_extent_finish
bcachefs: Fix trace_copygc
bcachefs: Journal writes are now IOPRIO_CLASS_RT
bcachefs: Improve journal pin flushing
bcachefs: fix bch2_btree_node_flags
bcachefs: rebalance, copygc enabled are runtime opts
bcachefs: Improve decompression error messages
bcachefs: bset_blacklisted_journal_seq is now AUTOFIX
bcachefs: "Journal stuck" timeout now takes into account device latency
bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
bcachefs: Fix btree_trans_peek_key_cache()
|
|
Quite a few places want to build a struct qstr by given string;
it would be convenient to have a primitive doing that, rather
than open-coding it via QSTR_INIT().
The closest approximation was in bcachefs, but that expands to
initializer list - {.len = strlen(string), .name = string}.
It would be more useful to have it as compound literal -
(struct qstr){.len = strlen(string), .name = string}.
Unlike initializer list it's a valid expression. What's more,
it's a valid lvalue - it's an equivalent of anonymous local
variable with such initializer, so the things like
path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
are valid. It can also be used as initializer, with identical
effect -
struct qstr x = (struct qstr){.name = s, .len = strlen(s)};
is equivalent to
struct qstr anon_variable = {.name = s, .len = strlen(s)};
struct qstr x = anon_variable;
// anon_variable is never used after that point
and any even remotely sane compiler will manage to collapse that
into
struct qstr x = {.name = s, .len = strlen(s)};
What compound literals can't be used for is initialization of
global variables, but those are covered by QSTR_INIT().
This commit lifts definition(s) of QSTR() into linux/dcache.h,
converts it to compound literal (all bcachefs users are fine
with that) and converts assorted open-coded instances to using
that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We're currently debugging issues with rebalance, where it's not making
progress as quickly as it should be (or sometimes not at all).
Add the full data_update to the move_extent_finish tracepoint, so we can
check that the replicas we wrote match what we were supposed to do.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
System performance is particularly sensitive to journal write latency,
the number of outstanding journal writes is bounded and we can't issue
journal flushes until other journal writes have completed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.
The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.
So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
we can wait for key cache/write buffer pin flushing to complete
before flushing dirty btree nodes
- A new closure_waitlist is added for bch2_journal_flush_pins; this one
is only used under or when we're taking the journal lock, so it's
pretty cheap to add rigorously correct wakeups to journal_pin_set()
and journal_pin_drop().
Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.
Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a regression from when these were switched to normal opts.h options.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Ratelimit them, and use the new bch2_write_op_error() helper that prints
path and file offset.
Reported-by: https://github.com/koverstreet/bcachefs/issues/819
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
- Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
directly accessible via the library API, instead of requiring the
crypto API. This is much simpler and more efficient.
- Convert some users such as ext4 to use the CRC32 library API instead
of the crypto API. More conversions like this will come later.
- Add a KUnit test that tests and benchmarks multiple CRC variants.
Remove older, less-comprehensive tests that are made redundant by
this.
- Add an entry to MAINTAINERS for the kernel's CRC library code. I'm
volunteering to maintain it. I have additional cleanups and
optimizations planned for future cycles.
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits)
MAINTAINERS: add entry for CRC library
powerpc/crc: delete obsolete crc-vpmsum_test.c
lib/crc32test: delete obsolete crc32test.c
lib/crc16_kunit: delete obsolete crc16_kunit.c
lib/crc_kunit.c: add KUnit test suite for CRC library functions
powerpc/crc-t10dif: expose CRC-T10DIF function through lib
arm64/crc-t10dif: expose CRC-T10DIF function through lib
arm/crc-t10dif: expose CRC-T10DIF function through lib
x86/crc-t10dif: expose CRC-T10DIF function through lib
crypto: crct10dif - expose arch-optimized lib function
lib/crc-t10dif: add support for arch overrides
lib/crc-t10dif: stop wrapping the crypto API
scsi: target: iscsi: switch to using the crc32c library
f2fs: switch to using the crc32 library
jbd2: switch to using the crc32c library
ext4: switch to using the crc32c library
lib/crc32: make crc32c() go directly to lib
bcachefs: Explicitly select CRYPTO from BCACHEFS_FS
x86/crc32: expose CRC32 functions through lib
x86/crc32: update prototype for crc32_pclmul_le_16()
...
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If a block device (e.g. your typical consumer SSD) is taking multiple
seconds for IOs (typically flushes), we don't want to emit the "journal
stuck" message prematurely.
Also, make sure to drop the btree_trans srcu lock if we're blocking for
more than a second.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't need all the helpers inlined here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.
It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.
That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).
The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.
Fixes: bd5b09727f3d ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
|
|
Can't use memcmp() when the struct contains padding.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've got a problem with bch_stripe that is going to take an on disk
format rev to fix - we can't access the block sector counts if the
checksum type is unknown.
Document it for now, there are a few other things to fix as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were incorrectly checking if there'd been an io error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The transaction is going to abort, so there will be no cycle involving
this transaction anymore.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When the cycle doesn't involve the initiator of the cycle detection,
we might choose a transaction that is not involved in the cycle to abort.
It shouldn't be that since it won't break the cycle, this patch
therefore chooses the transaction in the cycle to abort.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch introduces a helper function called lock_graph_pop_from,
it pops the graph from i.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If the transaction chose itself as a victim before and restarted, it
might request a no fail lock request this time. But it might be added to
others' lock graph and be chose as the victim again, it's no longer safe
without additional check. We can also convert the cycle detector to be
fully RCU-based to solve that unsoundness, but the latency added to trans_put
and additional memory required may not worth it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If the lock has been acquired and unlocked, we don't have to do clear
and wakeup again, though harmless since we hold the intent lock. Merge
the condition might be clearer.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|