| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and IPsec.
Current release - regressions:
- do not acquire dev->tx_global_lock in netdev_watchdog_up()
- ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
- fix deadlock in nested UP notifier events
Current release - new code bugs:
- eth:
- cn20k: fix subbank free list indexing for search order
- airoha: fix BQL underflow in shared QDMA TX ring
Previous releases - regressions:
- netfilter:
- flowtable: fix offloaded ct timeout never being extended
- nf_conncount: prevent connlimit drops for early confirmed ct
Previous releases - always broken:
- require CAP_NET_ADMIN in the originating netns when modifying
cross-netns devices
- report NAPI thread PID in the caller's pid namespace
- mac802154: fix dirty frag in in-place crypto for IOT radios
- sctp: hold socket lock when dumping endpoints in sctp_diag, avoid
an overflow
- eth: gve: fix header buffer corruption with header-split and HW-GRO
- af_key: initialize alg_key_len for IPComp states, prevent OOB read"
* tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (213 commits)
selftests: bonding: add a test for VLAN propagation over a bonded real device
vlan: defer real device state propagation to netdev_work
net: add the driver-facing netdev_work scheduling API
net: turn the rx_mode work into a generic netdev_work facility
net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate
rxrpc: Fix leak of released call in recvmsg(MSG_PEEK)
rxrpc: Fix socket notification race
rxrpc: Fix potential infinite loop in rxrpc_recvmsg()
rxrpc: Fix oob challenge leak in cleanup after notification failure
rxrpc: Fix the reception of a reply packet before data transmission
afs: Fix uncancelled rxrpc OOB message handler
afs: Fix further netns teardown to cancel the preallocation charger
rxrpc: Fix double unlock in rxrpc_recvmsg()
rxrpc: Fix leak of connection from OOB challenge
rxrpc: Fix ACKALL packet handling
net: hns3: differentiate autoneg default values between copper and fiber
net: hns3: fix permanent link down deadlock after reset
net: hns3: refactor MAC autoneg and speed configuration
net: hns3: unify copper port ksettings configuration path
...
|
|
Fix AFS to cancel its OOB message processing (typically to respond to
security challenges). Also move OOB message processing to afs_wq so that
it's also waited for and make the OOB handler just return if the net
namespace is no longer live.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260624163819.3017002-6-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When an afs network namespace is torn down, it cancels and waits for the
work item that keeps the preallocated rxrpc call/conn/peer queue charged
before disabling incoming (i.e. listen 0), but there's a small window in
which it can be requeued by an incoming call wending through the I/O
thread.
Fix this by cancelling the charger work item again after reducing the
listen backlog to zero.
Fixes: 47694fbc9d24 ("afs: Fix netns teardown to cancel the preallocation charger")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260624163819.3017002-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"Added:
- depth limit to indx_find_buffer() to prevent stack overflow
- validate split-point offset in indx_insert_into_buffer()
- bounds check to run_get_highest_vcn()
- fileattr_get() and fileattr_set() support
- zero stale pagecache beyond valid data length
- handle delayed allocation overlap in run lookup
- validate lcns_follow in log_replay() conversion
- cap RESTART_TABLE free-chain walker at rt->used
- resize log->one_page_buf when adopting on-disk page size
- reject direct userspace writes to reserved $LX* xattrs
Fixed:
- out-of-bounds read in decompress_lznt()
- avoid -Wmaybe-uninitialized warnings
- hold ni_lock across readdir metadata walk
- preserve non-DOS attribute bits in system.dos_attrib
- validate index entry key bounds
- syncing wrong inode on DIRSYNC cross-directory rename
- validate Dirty Page Table capacity in log_replay() copy_lcns
- wrong LCN in run_remove_range() when splitting a run
- allocate iomap inline_data using alloc_page
- mount failure on 64K page-size kernels
- out-of-bounds read in ntfs_dir_emit() and hdr_find_e()
- bound attr_off in UpdateResidentValue against data_off
- bound DeleteIndexEntryAllocation memmove length
- bound copy_lcns dp->page_lcns[] index in analysis pass
- bound NTFS_DE view.data_off in UpdateRecordData{Root,Allocation}
- prevent potential lcn remains uninitialized
Changed:
- bound to_move in indx_insert_into_root() before hdr_insert_head()
- call _ntfs_bad_inode() when failing to rename
- fold resident writeback into writepages loop
- force waiting for direct I/O completion
- fold file size handling into ntfs_set_size()
- reject SEEK_DATA and SEEK_HOLE past EOF early
- format code, add descriptive comments and remove non-useful"
* tag 'ntfs3_for_7.2' of https://github.com/Paragon-Software-Group/linux-ntfs3: (34 commits)
ntfs3: reject direct userspace writes to reserved $LX* xattrs
fs/ntfs3: resize log->one_page_buf when adopting on-disk page size
fs/ntfs3: prevent potential lcn remains uninitialized
ntfs3: cap RESTART_TABLE free-chain walker at rt->used
fs/ntfs3: bound NTFS_DE view.data_off in UpdateRecordData{Root,Allocation}
fs/ntfs3: validate lcns_follow in log_replay conversion
fs/ntfs3: bound copy_lcns dp->page_lcns[] index in analysis pass
fs/ntfs3: bound DeleteIndexEntryAllocation memmove length
fs/ntfs3: bound attr_off in UpdateResidentValue against data_off
ntfs3: fix out-of-bounds read in ntfs_dir_emit() and hdr_find_e()
fs/ntfs3: fix mount failure on 64K page-size kernels
ntfs3: avoid another -Wmaybe-uninitialized warning
ntfs3: Allocate iomap inline_data using alloc_page
fs/ntfs3: format code, deal with comments
fs/ntfs3: reject SEEK_DATA and SEEK_HOLE past EOF early
fs/ntfs3: fold file size handling into ntfs_set_size()
fs/ntfs3: force waiting for direct I/O completion
fs/ntfs3: fold resident writeback into writepages loop
fs/ntfs3: handle delayed allocation overlap in run lookup
fs/ntfs3: zero stale pagecache beyond valid data length
...
|
|
Pull NFS client updates from Anna Schumaker:
"New features:
- XPRTRDMA: Decouple req recycling from RPC completion
- NFS: Expose FMODE_NOWAIT for read-only files
Bugfixes:
- SUNRPC:
- Fix sunrpc sysfs error handling
- Fix uninitialized xprt_create_args structure
- XPRTRDMA:
- Harden connect and reply handling
- NFS:
- Fix EOF updates after fallocate/zero-range
- Keep PG_UPTODATE clear after read errors in page groups
- Use nfsi->rwsem to protect traversal of the file lock list
- Prevent resource leak in nfs_alloc_server()
- NFSv4:
- Clear exception state on successful mkdir retry
- Don't skip revalidate when holding a dir delegation and attrs are stale
- pNFS:
- Fix use-after-free in pnfs_update_layout()
- Defer return_range callbacks until after inode unlock
- Fix LAYOUTCOMMIT retry loop on OLD_STATEID
- Reject zero-length r_addr in nfs4_decode_mp_ds_addr
- NFS/flexfiles:
- Reject zero-length filehandle version arrays
- Fix checking if a layout is striped
- Fixes for honoring FF_FLAGS_NO_IO_THRU_MDS
Other cleanups and improvements:
- Remove the fileid field from struct nfs_inode
- Move long-delayed xprtrdma work onto the system_dfl_long_wq
- Convert xprtrdma send buffer free list to an llist
- Show "<redacted>" for cert_serial and privkey_serial mount options"
* tag 'nfs-for-7.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (42 commits)
NFS: Use common error handling code in nfs_alloc_server()
NFS: Prevent resource leak in nfs_alloc_server()
NFSv4/pNFS: reject zero-length r_addr in nfs4_decode_mp_ds_addr
nfs: don't skip revalidate on directory delegation when attrs flagged stale
xprtrdma: Return sendctx slot after Send preparation failure
xprtrdma: Repost Receive buffers for malformed replies
xprtrdma: Sanitize the reply credit grant after parsing
xprtrdma: Fix bcall rep leak and unbounded peek
xprtrdma: Resize reply buffers before reposting receives
xprtrdma: Check frwr_wp_create() during connect
xprtrdma: Initialize re_id before removal registration
xprtrdma: Fix ep kref imbalance on ADDR_CHANGE
xprtrdma: Convert send buffer free list to llist
NFS: correct CONFIG_NFS_V4 macro name in #endif comment
nfs: use nfsi->rwsem to protect traversal of the file lock list
NFSv4.1/pNFS: fix LAYOUTCOMMIT retry loop on OLD_STATEID
nfs: expose FMODE_NOWAIT for read-only files
nfs: add nowait version of nfs_start_io_direct
NFSv4/flexfiles: honor FF_FLAGS_NO_IO_THRU_MDS in pg_get_mirror_count_write
NFSv4/flexfiles: honor FF_FLAGS_NO_IO_THRU_MDS on fatal DS connect errors
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"The changes primarily focus on filesystem error reporting, reducing
memory footprint by reverting in-memory data structures used for
runtime validation, honoring FDP hints, and adding trace and debug
logs. In addition, there are critical bug fixes resolving
out-of-bounds read vulnerabilities in inline directory and ACL
handling, potential deadlocks in balance_fs, use-after-free issues in
atomic writes, and false data/node type assignments in large sections.
Enhancements:
- Revert in-memory sit version and block bitmaps
- support to report fserror
- add trace_f2fs_fault_report
- add iostat latency tracking for direct IO
- add logs in f2fs_disable_checkpoint()
- honor per-I/O write streams for direct writes
- map data writes to FDP streams
- skip inode folio lookup for cached overwrite
- skip direct I/O iostat context when disabled
- revert "check in-memory block bitmap"
- revert "check in-memory sit version bitmap"
Fixes:
- optimize representative type determination in GC
- fix incorrect FI_NO_EXTENT handling in __destroy_extent_node()
- fix potential deadlock in f2fs_balance_fs()
- fix potential deadlock in gc_merge path of f2fs_balance_fs()
- atomic: fix UAF issue on f2fs_inode_info.atomic_inode
- fix missing read bio submission on large folio error
- pass correct iostat type for single node writes
- fix to do sanity check on f2fs_get_node_folio_ra()
- validate orphan inode entry count
- keep atomic write retry from zeroing original data
- read COW data with the original inode during atomic write
- validate inline dentry name lengths before conversion
- validate dentry name length before lookup compares it
- reject setattr size changes on large folio files
- revert "remove non-uptodate folio from the page cache in move_data_block"
- validate ACL entry sizes in f2fs_acl_from_disk()
- bound i_inline_xattr_size for non-inline-xattr inodes
- fix listxattr handling of corrupted xattr entries
- fix to round down start offset of fallocate for pin file"
* tag 'f2fs-for-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (42 commits)
f2fs: fix to round down start offset of fallocate for pin file
f2fs: fix listxattr handling of corrupted xattr entries
f2fs: skip direct I/O iostat context when disabled
f2fs: remove unneeded f2fs_is_compressed_page()
f2fs: avoid unnecessary fscrypt_finalize_bounce_page()
f2fs: avoid unnecessary sanity check on ckpt_valid_blocks
f2fs: misc cleanup in f2fs_record_stop_reason()
f2fs: fix wrong description in printed log
f2fs: bound i_inline_xattr_size for non-inline-xattr inodes
f2fs: validate ACL entry sizes in f2fs_acl_from_disk()
Revert "f2fs: remove non-uptodate folio from the page cache in move_data_block"
f2fs: Split f2fs_write_end_io()
f2fs: Rename f2fs_post_read_wq into f2fs_wq
f2fs: Prepare for supporting delayed bio completion
f2fs: reject setattr size changes on large folio files
f2fs: validate dentry name length before lookup compares it
f2fs: validate inline dentry name lengths before conversion
f2fs: read COW data with the original inode during atomic write
f2fs: skip inode folio lookup for cached overwrite
f2fs: keep atomic write retry from zeroing original data
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "khugepaged: add mTHP collapse support" (Nico Pache)
Provide khugepaged with the capability to collapse anonymous memory
regions to mTHPs
- "Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable
files" (Zi Yan)
Remove the READ_ONLY_THP_FOR_FS check in file_thp_enabled(), so that
khugepaged and MADV_COLLAPSE can run on filesystems with PMD THP
pagecache support even without READ_ONLY_THP_FOR_FS enabled
- "make MM selftests more CI friendly" (Mike Rapoport)
General fixes and cleanups to the MM selftests. Also move more MM
selftests under the kselftest framework, making them more amenable to
ongoing CI testing
- "selftests/mm: fix failures and robustness improvements" and
"selftests/mm: assorted fixes for hmm-tests" (Sayali Patil)
Fix several issues in MM selftests which were revealed by powerpc 64k
pagesize
* tag 'mm-stable-2026-06-23-08-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (118 commits)
Revert "mm: limit filemap_fault readahead to VMA boundaries"
mm/vmscan: pass NULL to trace vmscan node reclaim
mm: use mapping_mapped to simplify the code
selftests/mm: fix exclusive_cow test fork() handling
selftests/mm: remove hardcoded THP sizing assumptions in hmm tests
selftests/mm: allow PUD-level entries in compound testcase of hmm tests
mm/gup_test: reject wrapped user ranges
mm/page_frag: reject invalid CPUs in page_frag_test
mm/damon/core: always put unsuccessfully committed target pids
mm: page_isolation: avoid unsafe folio reads while scanning compound pages
mm/shrinker: do not hold RCU lock in shrinker_debugfs_count_show()
selftests: mm: fix and speedup "droppable" test
mm: merge writeout into pageout
MAINTAINERS: add Hao Ge as reviewer for codetag and alloc_tag
selftests/mm: clarify alternate unmapping in compaction_test
selftests/mm: move hwpoison setup into run_test() and silence modprobe output for memory-failure category
selftests/mm: skip uffd-stress test when nr_pages_per_cpu is zero
selftests/mm: skip uffd-wp-mremap if UFFD write-protect is unsupported
selftests/mm: ensure destination is hugetlb-backed in hugetlb-mremap
selftest/mm: register existing mapping with userfaultfd in hugetlb-mremap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"The most notable change is the removal of the fscache backend: it has
been deprecated for almost two years, mainly because EROFS file-backed
mounts and fanotify pre-content hooks (together with erofs-utils) now
provide better functionality and simpler codebase. In addition,
fscache has depended on netfslib for years, which is undesirable for
EROFS since it is a local filesystem. More details in [1].
In addition, sparse support has been added to the pcluster layout,
which is helpful for large sparse AI datasets, and map requests for
chunk-based inodes have been optimized to be more efficient as well.
There are also the usual fixes and cleanups.
Summary:
- Report more consecutive chunks of the same type for
each iomap request
- Add sparse support for the pcluster layout
- Update the EROFS documentation overview
- Remove the deprecated fscache backend
- Various fixes and cleanups"
Link: https://lore.kernel.org/r/20260622013622.934174-1-hsiangkao@linux.alibaba.com [1]
* tag 'erofs-for-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: handle 48-bit blocks_hi for compressed inodes
erofs: remove fscache backend entirely
erofs: simplify RCU read critical sections
erofs: add sparse support to pcluster layout
erofs: add folio order to trace_erofs_read_folio
erofs: introduce erofs_map_chunks()
erofs: call erofs_exit_ishare() before rcu_barrier()
erofs: update the overview of the documentation
erofs: clean up erofs_ishare_fill_inode()
|
|
Currently, the length of fallocate for pin file is section-aligned to
keep allocated sections from being selected as victims of GC. However,
for the case that the start offset of fallocate is not aligned in
section, the allocated sections can't be fully utilized. It's because a
new section is allocated by f2fs_allocate_pinning_section() after using
blks_per_sec blocks regardless of the start offset. As a result, several
unexpected dirty segments may be created, including blocks assigned to
the pinned file.
To address this issue, let's round down the start offset of fallocate
to the length of section.
The reproducing scenario is as below
chunk=$(((2<<20)+4096)) # 2MB + 4KB
touch test
f2fs_io pinfile set test
f2fs_io fallocate 0 0 $chunk test
f2fs_io fallocate 0 $chunk $chunk test
f2fs_io fallocate 0 $((chunk*2)) $chunk test
f2fs_io fiemap 0 $((chunk*3)) test
Fiemap: offset = 0 len = 12288
logical addr. physical addr. length flags
0 0000000000000000 000000068c600000 0000000000400000 00001088
1 0000000000400000 000000003d400000 0000000000001000 00001088
2 0000000000401000 00000003eb200000 0000000000200000 00001088
3 0000000000601000 00000005e4200000 0000000000001000 00001088
4 0000000000602000 0000000605400000 0000000000200000 00001089
Cc: stable@vger.kernel.org
Fixes: f5a53edcf01e ("f2fs: support aligned pinned file")
Reviewed-by: Yunji Kang <yunji0.kang@samsung.com>
Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Validate the xattr entry before reading its fields in f2fs_listxattr().
Return -EFSCORRUPTED when the entry is outside the valid xattr storage
area instead of returning a successful partial result.
Fixes: 688078e7f36c ("f2fs: fix to avoid memory leakage in f2fs_listxattr")
Cc: stable@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Keshav Verma <iganschel@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
F2FS iostat is optional and is disabled by default. Direct I/O still
allocates and binds a bio_iostat_ctx, updates the submit timestamp, and
replaces bi_end_io for every DIO bio even when sbi->iostat_enable is
false.
The byte accounting calls do not need an extra guard because
f2fs_update_iostat() already checks sbi->iostat_enable. Only skip the
DIO bio context setup when iostat is disabled. If iostat is enabled
through sysfs before submission, the existing context allocation and
latency accounting path is still used.
QEMU benchmark on a 1GiB F2FS virtio-blk image, with iostat_enable=0,
4KiB O_DIRECT I/O over a 64MiB file, 50000 iterations per run:
baseline patched
direct_read median 65264.50 ns 55470.95 ns
direct_read recheck 65553.75 ns 55470.95 ns
direct_write median 68054.62 ns 56309.44 ns
direct_write recheck 66873.51 ns 56309.44 ns
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We have checked f2fs_is_compressed_page() before f2fs_compress_write_end_io(),
so we don't need to check the status again, remove it.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
fscrypt_finalize_bounce_page() should be called only if we use fs layer
crypto, let's avoid unnecessary fscrypt_finalize_bounce_page() in error
path of f2fs_write_compressed_pages().
BTW, fscrypt_finalize_bounce_page() will check mapping of bounced page
before retrieving original page, so, previously it won't cause any issue
w/ fscrypt_finalize_bounce_page(), but still we'd better avoid coupling
w/ any logic inside fscrypt_finalize_bounce_page().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The calculation of sec->ckpt_valid_blocks are the same in both
set_ckpt_valid_blocks() and sanity_check_valid_blocks(), so it
doesn't necessary to call sanity_check_valid_blocks() right after
set_ckpt_valid_blocks().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch fixes wrong description in printed log:
"SSA and SIT" -> "SIT and SSA"
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When the flexible_inline_xattr feature is enabled, do_read_inode() loads
the on-disk i_inline_xattr_size unconditionally:
if (f2fs_sb_has_flexible_inline_xattr(sbi))
fi->i_inline_xattr_size = le16_to_cpu(ri->i_inline_xattr_size);
but sanity_check_inode() only range-checks it when the inode also has the
FI_INLINE_XATTR flag set. An inode that carries an inline dentry or inline
data but not FI_INLINE_XATTR -- the normal layout for an inline
directory -- therefore keeps a fully attacker-controlled
i_inline_xattr_size from a crafted image.
get_inline_xattr_addrs() returns that value with no flag gating, so it
feeds the inode geometry:
MAX_INLINE_DATA() = 4 * (CUR_ADDRS_PER_INODE - i_inline_xattr_size - 1)
NR_INLINE_DENTRY() = MAX_INLINE_DATA() * BITS_PER_BYTE / (...)
addrs_per_page() = CUR_ADDRS_PER_INODE - i_inline_xattr_size
A large i_inline_xattr_size drives MAX_INLINE_DATA() and NR_INLINE_DENTRY()
negative, so make_dentry_ptr_inline() sets d->max (int) to a negative
value. The inline directory walk then compares an unsigned long bit_pos
against that negative d->max, which is promoted to a huge unsigned bound,
and reads far past the inline area:
while (bit_pos < d->max) /* fs/f2fs/dir.c */
... test_bit_le(bit_pos, d->bitmap) / d->dentry[bit_pos] ...
Mounting a crafted image and reading such a directory triggers an
out-of-bounds read in f2fs_fill_dentries(); the same underflow also
corrupts ADDRS_PER_INODE for regular files.
Validate i_inline_xattr_size against MAX_INLINE_XATTR_SIZE whenever the
flexible_inline_xattr feature is enabled -- i.e. whenever the value is
loaded from disk and consumed -- and keep the lower MIN_INLINE_XATTR_SIZE
bound gated on inodes that actually carry an inline xattr, so legitimate
inodes with i_inline_xattr_size == 0 are still accepted.
Cc: stable@vger.kernel.org
Fixes: 6afc662e68b5 ("f2fs: support flexible inline xattr size")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs_acl_count() only validates the aggregate ACL xattr length. A
malformed ACL can still place ACL_USER or ACL_GROUP in a slot that only
contains struct f2fs_acl_entry_short bytes, and f2fs_acl_from_disk()
then reads entry->e_id before verifying that a full entry fits.
Require a short entry before reading e_tag and e_perm, and require a
full entry before reading e_id for ACL_USER and ACL_GROUP. Return
-EFSCORRUPTED from these new truncated-entry checks, while keeping the
pre-existing -EINVAL paths unchanged.
Validation reproduced this kernel report:
KASAN slab-out-of-bounds in __f2fs_get_acl+0x6fb/0x7e0
RIP: 0033:0x7f4b835ea7aa
The buggy address belongs to the object at ffff888114589960 which belongs
to the cache kmalloc-8 of size 8
The buggy address is located 0 bytes to the right of allocated 8-byte
region [ffff888114589960, ffff888114589968)
Read of size 4
Call trace:
dump_stack_lvl+0x66/0xa0 (?:?)
print_report+0xce/0x630 (?:?)
__f2fs_get_acl+0x6fb/0x7e0 (fs/f2fs/acl.c:169)
srso_alias_return_thunk+0x5/0xfbef5 (?:?)
__virt_addr_valid+0x224/0x430 (?:?)
kasan_report+0xe0/0x110 (?:?)
__f2fs_get_acl+0x5/0x7e0 (fs/f2fs/acl.c:169)
__get_acl+0x281/0x380 (?:?)
vfs_get_acl+0x10b/0x190 (?:?)
do_get_acl+0x2a/0x410 (?:?)
do_get_acl+0x9/0x410 (?:?)
do_getxattr+0xe8/0x260 (?:?)
filename_getxattr+0xd1/0x140 (?:?)
do_getname+0x2d/0x2d0 (?:?)
path_getxattrat+0x16c/0x200 (?:?)
lock_release+0xc8/0x290 (?:?)
cgroup_update_frozen+0x9d/0x320 (?:?)
lockdep_hardirqs_on_prepare+0xea/0x1a0 (?:?)
trace_hardirqs_on+0x1a/0x170 (?:?)
_raw_spin_unlock_irq+0x28/0x50 (?:?)
do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87)
entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?)
Cc: stable@kernel.org
Fixes: af48b85b8cd3 ("f2fs: add xattr and acl functionalities")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This reverts commit 9609dd704725a40cd63d915f2ab6c44248a44598.
The kernel panics are keeping to be reported especially when the f2fs
partition get almost full. By investigation, we find that the reason is
one f2fs page got freed to buddy without being deleted from LRU and the
root cause is the race happened in [2] which is enrolled by this commit.
There are 3 race processes in this scenario, please find below for their
main activities.
The changed code in move_data_block() lets the GC path evict the tail-end
folio from the page cache through folio_end_dropbehind(). Once
folio_unmap_invalidate() removes the folio from mapping->i_pages, the
page-cache references for all pages in the folio are dropped. The folio
is then kept alive only by temporary external references, which allows a
later split to operate on a folio whose subpages are no longer protected
by page-cache references.
After the page-cache references are gone, split_folio_to_order() can
split the big folio into individual pages and put the resulting subpages
back on the LRU. For tail pages beyond EOF, split removes them from the
page cache and drops their page-cache references. A tail page can then
remain on the LRU with PG_lru set while holding only the split caller's
temporary reference. When free_folio_and_swap_cache() drops that final
reference, the page enters the final folio_put() release path.
In parallel, folio_isolate_lru() can observe the same tail page with a
non-zero refcount and PG_lru set. It clears PG_lru before taking its own
reference. If this races with the final folio_put() from the split path,
__folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
The page is then freed back to the allocator while its lru links are
still present in the LRU list. A later LRU operation on a neighboring
page detects the stale link and reports list corruption.
[1]
[ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
[ 22.486130] ------------[ cut here ]------------
[ 22.486134] kernel BUG at lib/list_debug.c:67!
[ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
[ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
[ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
[ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
[ 22.488539] sp : ffffffc08006b830
[ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
[ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
[ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
[ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
[ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
[ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
[ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
[ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
[ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
[ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
[ 22.488647] Call trace:
[ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
[ 22.488661] __folio_put+0x2bc/0x434
[ 22.488670] folio_put+0x28/0x58
[ 22.488678] do_garbage_collect+0x1a34/0x2584
[ 22.488689] f2fs_gc+0x230/0x9b4
[ 22.488697] f2fs_fallocate+0xb90/0xdf4
[ 22.488706] vfs_fallocate+0x1b4/0x2bc
[ 22.488716] __arm64_sys_fallocate+0x44/0x78
[ 22.488725] invoke_syscall+0x58/0xe4
[ 22.488732] do_el0_svc+0x48/0xdc
[ 22.488739] el0_svc+0x3c/0x98
[ 22.488747] el0t_64_sync_handler+0x20/0x130
[ 22.488754] el0t_64_sync+0x1c4/0x1c8
[2]
CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
F: pagecache refs = n
F: extra refs = GC + split
F: PG_lru set
move_data_block()
folio = f2fs_grab_cache_folio(F)
...
__folio_set_dropbehind(F)
folio_unlock(F)
folio_end_dropbehind(F)
folio_unmap_invalidate(F)
__filemap_remove_folio(F)
folio_put_refs(F, n)
folio_put(F)
split_folio_to_order(F)
folio_ref_freeze(F, 1)
...
lru_add_split_folio(T)
list_add_tail(&T->lru, &F->lru)
folio_set_lru(T)
__filemap_remove_folio(T)
folio_put_refs(T, 1)
/* T refcount == 1, PageLRU set */
folio_isolate_lru(T)
folio_test_clear_lru(T)
free_folio_and_swap_cache(T)
folio_put(T)
/* refcount: 1 -> 0 */
__folio_put(T)
__page_cache_release(T)
folio_test_lru(T) == false
/* skip lruvec_del_folio(T) */
free_frozen_pages(T)
folio_get(T)
lruvec_del_folio(T)
later:
list_del(adjacent->lru)
next == &T->lru
next->prev == LIST_POISON / PCP freelist
BUG
Cc: stable@vger.kernel.org
Fixes: 9609dd704725 ("f2fs: remove non-uptodate folio from the page cache in move_data_block")
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Prepare for running most of the write completion work asynchronously.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Rename f2fs_post_read_wq into f2fs_wq. Create it unconditionally.
Prepare for using this workqueue for completing write bios.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Use bio frontpadding to allocate memory for a work_struct when
allocating a bio.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
F2FS large folios are only enabled for immutable non-compressed files.
Writable open and writable mmap reject such mappings, but truncate(2)
through f2fs_setattr() misses the same guard.
If FS_IMMUTABLE_FL is cleared while the inode is still cached, the mapping
can keep large-folio support and ATTR_SIZE can change i_size. Reject size
changes in that state.
Cc: stable@kernel.org
Fixes: 05e65c14ea59 ("f2fs: support large folio for immutable non-compressed case")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The f2fs dentry lookup path can use the on-disk name length before
checking that the name fits in the dentry filename area. A corrupted
dentry can then make lookup read beyond the filename slots.
The bounds check needs to happen before any comparison that consumes
the name length from disk.
Reject dentries with invalid name lengths before comparing their names.
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Inline dentry conversion copies names out of the inline dentry area
before checking that each recorded name length fits in the available
filename slots.
A corrupted image can therefore make the conversion path read past
the inline filename storage while building the regular dentry block.
Validate each inline dentry name length against the inline filename
area before copying it.
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <samuel.moelius@trailofbits.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When updating an atomic-write file, f2fs_write_begin() may read the
previously written data back from the COW inode:
prepare_atomic_write_begin() locates the block in the COW inode and sets
use_cow, and the read bio is then built with the COW inode:
f2fs_submit_page_read(use_cow ? F2FS_I(inode)->cow_inode : inode,
...);
and f2fs_grab_read_bio() decides whether to schedule fs-layer decryption
(STEP_DECRYPT) for the bio based on that inode via
fscrypt_inode_uses_fs_layer_crypto().
However, the folio being filled belongs to the original inode
(folio->mapping->host == inode), and the data stored in the COW block was
encrypted (or left as plaintext) using the original inode's context, not
the COW inode's -- see f2fs_encrypt_one_page(), which keys off
fio->page->mapping->host. fscrypt_decrypt_pagecache_blocks() likewise
operates on folio->mapping->host.
The COW inode is created as a tmpfile in the parent directory and inherits
its encryption policy from there. With test_dummy_encryption the newly
created COW inode gets the dummy policy and becomes encrypted, while a
pre-existing regular file -- created before the policy applied, e.g.
already present in the on-disk image -- stays unencrypted. The read
path then sets STEP_DECRYPT based on the encrypted COW inode and calls
fscrypt_decrypt_pagecache_blocks() on a folio whose host (the unencrypted
original inode) has a NULL ->i_crypt_info, dereferencing it:
Oops: general protection fault, probably for non-canonical address ...
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
RIP: 0010:fscrypt_decrypt_pagecache_blocks+0xa0/0x310
Workqueue: f2fs_post_read_wq f2fs_post_read_work
Call Trace:
fscrypt_decrypt_bio+0x1eb/0x340
f2fs_post_read_work+0xba/0x140
process_one_work+0x91c/0x1a40
worker_thread+0x677/0xe90
kthread+0x2bc/0x3a0
The COW inode is only needed to locate the on-disk block, and that block
address is already resolved into @blkaddr by prepare_atomic_write_begin()
via __find_data_block(cow_inode, ...); f2fs_submit_page_read() then reads
from that physical @blkaddr directly, so the inode argument only selects
the post-read crypto context, not which block is fetched. Reading with
@inode therefore returns the same (latest, not-yet-committed) COW data,
while making both the fs-layer decryption decision and the inline crypto
path use the correct (original inode's) key.
With the COW inode no longer used at the read site, the use_cow flag has no
remaining consumer; drop it from f2fs_write_begin() and
prepare_atomic_write_begin().
Fixes: 591fc34e1f98 ("f2fs: use cow inode data when updating atomic write")
Cc: stable@vger.kernel.org
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
prepare_write_begin() first gets the inode folio and builds a dnode,
then checks the read extent cache. For an ordinary overwrite of a
non-inline and non-compressed file, an extent-cache hit already gives the
data block address and the following path does not need to allocate or
update any node state.
Check the read extent cache before fetching the inode folio for that
narrow case. Keep the existing paths for inline data, compressed files,
and writes that may extend past EOF, where the helper may need inline
conversion, compression preparation, or block reservation.
This avoids a node-folio lookup in the buffered overwrite fast path when
the mapping is already cached.
In a QEMU/KASAN x86_64 VM, using a small buffered overwrite workload on
an existing 1MiB file, median time improved as follows:
64-byte overwrites: 1724.93 ns/write -> 1560.24 ns/write
256-byte overwrites: 1713.38 ns/write -> 1577.85 ns/write
Function profiling of 20k 64-byte overwrites showed
f2fs_get_inode_folio() calls drop from 20004 to 4.
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
A partial atomic write reserves a block in the COW inode before reading the
original data page for the untouched bytes in that page.
If that read fails, write_begin returns an error but leaves the COW inode
entry as NEW_ADDR. A retry of the same partial write then finds the COW
entry, treats it as existing COW data, and f2fs_write_begin() zeroes the
whole folio because blkaddr is NEW_ADDR.
If the retry is committed, the bytes outside the retried write range are
committed as zeroes instead of preserving the original file contents.
Only use the COW inode as the read source when it already has a real data
block. If the COW entry is still NEW_ADDR, treat it as a reservation to
reuse: keep reading the old data from the original inode and avoid
reserving or accounting the same atomic block again.
Cc: stable@kernel.org
Fixes: 3db1de0e582c ("f2fs: change the current atomic write way")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs_recover_orphan_inodes() trusts the orphan block entry_count when
replaying orphan inodes from the checkpoint pack. A corrupted entry_count
larger than F2FS_ORPHANS_PER_BLOCK makes the recovery loop read past the
ino[] array and interpret footer or following data as inode numbers.
On a crafted image, mounting an unpatched kernel can drive orphan recovery
into f2fs_bug_on() and panic the kernel. Validate entry_count before
consuming entries so corrupted checkpoint data fails the mount with
-EFSCORRUPTED and requests fsck instead.
Set ERROR_INCONSISTENT_ORPHAN as well, so the corruption reason can be
recorded in the superblock s_errors[] field. This gives fsck a persistent
hint even though mount-time orphan recovery failure may leave no chance to
persist SBI_NEED_FSCK through a checkpoint.
Cc: stable@kernel.org
Fixes: 127e670abfa7 ("f2fs: add checkpoint operations")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
io_uring can pass a per-I/O write stream through kiocb->ki_write_stream,
and block direct I/O propagates that value to bio->bi_write_stream.
F2FS added FDP stream mapping for DATA writes, but its direct write
submit hook always rewrites bio->bi_write_stream from the inode write
hint and F2FS temperature. As a result, a direct write with an explicit
io_uring write_stream is submitted to the F2FS-selected stream instead
of the user-requested stream.
Validate an explicit write stream before starting F2FS direct I/O, pass
the kiocb through the iomap private pointer, and preserve the per-I/O
stream in the direct write bio. When no per-I/O stream is supplied, keep
using the existing F2FS temperature-to-stream mapping.
Fixes: 42f7a7a50a33 ("f2fs: map data writes to FDP streams")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
kernel BUG at fs/f2fs/file.c:845!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5336 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:f2fs_do_truncate_blocks+0x1115/0x1140 fs/f2fs/file.c:845
Code: fc fc 90 0f 0b e8 8b 9d 9a fd 90 0f 0b e8 83 9d 9a fd 48 89 df 48 c7 c6 60 d1 1a 8c e8 54 f1 fc fc 90 0f 0b e8 6c 9d 9a fd 90 <0f> 0b e8 64 9d 9a fd 90 0f 0b 90 e9 93 fd ff ff e8 56 9d 9a fd 90
RSP: 0018:ffffc9000e4474c0 EFLAGS: 00010283
RAX: ffffffff842b1d34 RBX: 0000000000000003 RCX: 0000000000100000
RDX: ffffc9000f03a000 RSI: 0000000000035503 RDI: 0000000000035504
RBP: ffffc9000e447608 R08: ffff8880123b0000 R09: 0000000000000002
R10: 00000000fffffffe R11: 0000000000000002 R12: 0000000000000001
R13: 0000000000000000 R14: 1ffff92001c88ea0 R15: 00000000ffff039c
FS: 00007f7e02ee36c0(0000) GS:ffff88808c887000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff0305c4000 CR3: 0000000012d4c000 CR4: 0000000000352ef0
Call Trace:
<TASK>
f2fs_truncate_blocks+0x10a/0x300 fs/f2fs/file.c:882
f2fs_truncate+0x471/0x7c0 fs/f2fs/file.c:940
f2fs_evict_inode+0xa3f/0x1ac0 fs/f2fs/inode.c:907
evict+0x61e/0xb10 fs/inode.c:841
f2fs_fill_super+0x5f43/0x78f0 fs/f2fs/super.c:5224
get_tree_bdev_flags+0x431/0x4f0 fs/super.c:1694
vfs_get_tree+0x92/0x2a0 fs/super.c:1754
fc_mount fs/namespace.c:1193 [inline]
do_new_mount_fc fs/namespace.c:3758 [inline]
do_new_mount+0x341/0xd30 fs/namespace.c:3834
do_mount fs/namespace.c:4167 [inline]
__do_sys_mount fs/namespace.c:4383 [inline]
__se_sys_mount+0x31d/0x420 fs/namespace.c:4360
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
count = ADDRS_PER_PAGE(dn.node_folio, inode);
count -= dn.ofs_in_node;
f2fs_bug_on(sbi, count < 0);
The fuzz test will trigger above bug_on in f2fs.
The root cause should be: in the corrupted inode, there is a direct node
which has the same ino and nid in its footer, so in f2fs_do_truncate_blocks(),
after f2fs_get_dnode_of_data() finds such dnode:
1) ADDRS_PER_PAGE(dn.node_folio, inode) will return 923
2) once dn.ofs_in_node points to addr[923, 1017]
Then it will trigger the system panic.
Let's introduce NODE_TYPE_NON_IXNODE to indicate current node should
not be an inode or xattr node, and then use it in below path to detect
inconsistent node chain in inode mapping table:
- f2fs_do_truncate_blocks
- f2fs_get_dnode_of_data
- f2fs_get_node_folio_ra
- __get_node_folio
- f2fs_sanity_check_node_footer
- case NODE_TYPE_NON_IXNODE -> check whether it is inode|xnode
Cc: stable@kernel.org
Reported-by: syzbot+2488d8d751b27f7ce268@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69fa3697.170a0220.59368.0018.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Commit ae27d62e6bef ("f2fs: check in-memory sit version bitmap") added
a mirror for sit version bitmap, it expects to detect in-memory
corruption, however we never got any reports from the check points
for almost decade, let's remove the code, it can help to save
memories.
Cc: wallentx <william.allentx@gmail.com>
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Commit 355e78913c0d ("f2fs: check in-memory block bitmap") added
a mirror for valid block bitmap, it expects to detect in-memory
corruption, however we never got any reports from the check points
for almost decade, let's remove the code, it can help to save
memories.
Cc: wallentx <william.allentx@gmail.com>
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
F2FS records image errors and checkpoint-stop reasons through the same
s_error_work worker. The ordinary f2fs_handle_error() path only updates
s_errors, but the worker still calls fserror_report_shutdown()
unconditionally after committing the superblock.
As a result, a metadata corruption report can be followed by a synthetic
FAN_FS_ERROR event with ESHUTDOWN and an invalid superblock file handle,
even though no stop reason was recorded.
Track whether save_stop_reason() actually changed the stop_reason array
and only report the shutdown fserror for that case. Pure s_errors updates
still commit the superblock, but no longer generate a false shutdown event.
Fixes: 50faed607d32 ("f2fs: support to report fserror")
Cc: stable@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
F2FS_COMPRESS_INO() uses NM_I(sbi)->max_nid as the synthetic inode
number for the compressed page cache inode. That inode only exists when
the compress_cache mount option is enabled.
When compress_cache is disabled, max_nid is outside the valid inode
range. A corrupted directory entry that points to ino == max_nid should
therefore be rejected by f2fs_check_nid_range(). However, is_meta_ino()
currently treats F2FS_COMPRESS_INO() as a meta inode unconditionally,
so f2fs_iget() bypasses do_read_inode() and its nid range check, and
instantiates a fake internal inode instead.
Gate the compressed cache inode case on COMPRESS_CACHE, matching
f2fs_init_compress_inode(). With compress_cache disabled, ino ==
max_nid now follows the normal inode path and is rejected as an
out-of-range nid.
Cc: stable@kernel.org
Fixes: 6ce19aff0b8c ("f2fs: compress: add compress_inode to cache compressed blocks")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs_write_single_node_folio() takes an io_type argument, but still
passes FS_GC_NODE_IO to __write_node_folio() unconditionally.
This was harmless while the helper was only used by
f2fs_move_node_folio(), whose caller passes FS_GC_NODE_IO. However,
commit fe9b8b30b971 ("f2fs: fix inline data not being written to disk
in writeback path") made f2fs_inline_data_fiemap() call the helper with
FS_NODE_IO for FIEMAP_FLAG_SYNC.
Honor the caller supplied io_type so inline-data FIEMAP sync writeback is
accounted as normal node IO instead of GC node IO, while the GC path
continues to pass FS_GC_NODE_IO explicitly.
Cc: stable@kernel.org
Fixes: fe9b8b30b971 ("f2fs: fix inline data not being written to disk in writeback path")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs_read_data_large_folio() can keep a read bio across multiple
readahead folios. If a later folio hits an error before any of its
blocks are added to the bio, folio_in_bio is false and the current error
path returns immediately after ending that folio.
This can leave the bio accumulated for earlier folios unsubmitted. Those
folios then never receive read completion, and readers can wait
indefinitely on the locked folios.
Route errors through the common out path so any pending bio is submitted
before returning. Stop consuming more readahead folios once an error is
seen, and only wait on and clear the current folio when it was actually
added to the bio.
Cc: stable@kernel.org
Fixes: a5d8b9d94e18 ("f2fs: fix to unlock folio in f2fs_read_data_large_folio()")
Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
- ioctl(F2FS_IOC_GARBAGE_COLLECT_RANGE) - shrink
- f2fs_gc
- gc_data_segment
- ra_data_block(cow_inode)
- mapping = F2FS_I(inode)->atomic_inode->i_mapping
: f2fs_is_cow_file(cow_inode) is true
- f2fs_evict_inode(atomic_inode)
- clear_inode_flag(fi->cow_inode, FI_COW_FILE)
- F2FS_I(fi->cow_inode)->atomic_inode = NULL
...
- truncate_inode_pages_final(atomic_inode)
- f2fs_grab_cache_folio(mapping)
: create folio in atomic_inode->mapping
- clear_inode(atomic_inode)
- BUG_ON(atomic_inode->i_data.nrpages)
We need to add a reference on fi->atomic_inode before using its mapping
field during garbage collection, otherwise, it will cause UAF issue.
Cc: stable@kernel.org
Cc: Daeho Jeong <daehojeong@google.com>
Cc: Sunmin Jeong <s_min.jeong@samsung.com>
Fixes: 3db1de0e582c ("f2fs: change the current atomic write way")
Fixes: f18d00769336 ("f2fs: use meta inode for GC of COW file")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When we mount device w/ gc_merge mount option, we may suffer below
potential deadlock:
Kworker GC trehad Truncator
- f2fs_write_cache_pages
- f2fs_write_single_data_page
- f2fs_do_write_data_page
- folio_start_writeback --- set writeback flag on folio
- f2fs_outplace_write_data
: cached folio in internal bio cache
- f2fs_balance_fs
- wake_up(gc_thread)
: wake up gc thread to run foreground GC
- finish_wait(fggc_wq)
: wait on the waitqueue --- wait on GC thread to finish the work
- truncate_inode_pages_range
- __filemap_get_folio(, FGP_LOCK) --- lock folio
- truncate_inode_partial_folio
- folio_wait_writeback --- wait on writeback being cleared
- do_garbage_collect
- move_data_page
- f2fs_get_lock_data_folio
- lock on folio --- blocked on folio's lock
In order to avoid such deadlock, let's call below functions to commit
cached bios in GC_MERGE path of f2fs_balance_fs() as the same as we did
in NOGC_MERGE path.
- f2fs_submit_merged_write(sbi, DATA);
- f2fs_submit_all_merged_ipu_writes(sbi);
Cc: stable@kernel.org
Fixes: 351df4b20115 ("f2fs: add segment operations")
Cc: Ruipeng Qi <ruipengqi3@gmail.com>
Reported: Sandeep Dhavale <dhavale@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Chao Yu <chaseyu@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In order to troubleshoot in which step we may block on during
mount w/ checkpoint_disable mount option.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
F2FS did not collect iostat latency for direct IO reads and writes,
hook iomap_dio_ops.submit_io to bind an iostat context and record the
submission timestamp. Replace bi_end_io with f2fs_dio_end_bio() to
collect IO latency on completion before calling back to the original
iomap_dio_bio_end_io(), to add iostat latency tracking support for
F2FS DIO.
Signed-off-by: shengyong1 <shengyong1@xiaomi.com>
Signed-off-by: liujinbao1 <liujinbao1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In large section mode, do_garbage_collect() previously determined the
section's representative type by looking only at the first segment of
the section. However, if data was fsynced into an area previously used
as a node section, and this area is recovered during roll-forward
recovery after sudden power off (SPO), GC would incorrectly assume the
section's type based on an empty or obsolete first segment. This caused
the recovered data segment to be misunderstood as being stuck inside a
node section, triggering false inconsistency panics (Inconsistent
segment type in SSA and SIT) and subsequent mount failures.
This patch optimizes do_garbage_collect() to determine the section's
representative type by identifying the first segment that actually
contains valid blocks (valid_blocks > 0) during the main GC loop. This
eliminates false alarms from empty/obsolete leading segments while
maintaining strict section-level type consistency checks for genuine
corruption.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add trace_f2fs_fault_report to trigger reporting upon f2fs_bug_on,
need_fsck, stop_checkpoint, and handle_eio. Since f2fs_bug_on and
need_fsck can be triggered in hundreds of scenarios, define set_sbi_flag
as a macro to help capture the effective fault function and line number.
Signed-off-by: shengyong1 <shengyong1@xiaomi.com>
Signed-off-by: liujinbao1 <liujinbao1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
nat_cnt[] is updated while callers hold nat_tree_lock, but F2FS samples
the counters locklessly in f2fs_available_free_memory(),
excess_dirty_nats(), and excess_cached_nats(). Those helpers only steer
cache reclaim and background sync heuristics; they do not control NAT
entry lifetime or checkpoint correctness.
Document the intent with data_race(READ_ONCE()) and a short comment
instead of adding locking to the balance path.
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs stores mount-wide activity timestamps in sbi->last_time[] and
samples them from background discard, GC, and balance paths without a
dedicated lock. The timestamps are used as best-effort heuristics to
decide whether background work should run now or sleep a bit longer.
The current helpers use plain loads and stores, so KCSAN can report races
between frequent foreground updates and background readers. Exact
freshness is not required here, but the intentional lockless accesses
should be marked explicitly.
Use WRITE_ONCE() in f2fs_update_time() and READ_ONCE() in
f2fs_time_over() and f2fs_time_to_wait(). This preserves the existing
heuristic behavior and avoids adding locking to hot paths.
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Combine i_nb.blocks_hi with i_u.blocks_lo when computing
inode->i_blocks for compressed inodes, mirroring the startblk_hi
handling for unencoded inodes a few lines above. Also evaluate
the shift in u64 to avoid truncation.
Fixes: efb2aef569b3 ("erofs: add encoded extent on-disk definition")
Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
EROFS over fscache was introduced to provide image lazy pulling
functionality. After the feature landed, the fscache subsystem made
netfs a new hard dependency, which is unexpected for a local filesystem
and has an kernel-defined caching hierarchy which could be inflexible
compared to the fanotify pre-content hooks. Therefore, this feature has
been deprecated for almost two years.
As EROFS file-backed mounts and fanotify pre-content hooks both upstream
for a while and already providing equivalent functionality (erofs-utils
has supported fanotify pre-content hooks), let's remove the fscache
backend now.
The main application of this feature is Nydus [1], and they plan to move
to use fanotify pre-content hooks in the near future too.
I hope this patch can be merged into Linux 7.2, which is also motivated
by newly found implementation issues [2][3] that are not worth
investigating given the deprecation and limited development resources.
The associated fscache/cachefiles cleanup patch will follow separately
through the vfs tree (netfs) later: it seems fine since the codebase is
isolated by CONFIG_CACHEFILES_ONDEMAND.
[1] https://github.com/dragonflyoss/nydus/blob/v2.1.0/docs/nydus-fscache.md
[2] https://github.com/dragonflyoss/nydus/pull/1824
[3] https://lore.kernel.org/r/20260619135800.1594811-1-michael.bommarito@gmail.com
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
- use scoped_guard() for RCU read critical section in
z_erofs_decompress_kickoff();
- simplify the RCU critical section loop in
z_erofs_pcluster_begin().
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Although zeros can be compressed transparently on EROFS using fixed-size
output compression so that it is never prioritized in the Android use
cases, indicating entire pclusters as holes is still useful to preserve
holes in the sparse datasets; otherwise overlayfs will allocate more
space when copying up, and SEEK_HOLE won't report any hole.
This patch introduces two ways to mark a pcluster as a hole:
- A new Z_EROFS_LI_HOLE compatible flag (bit 14) in the HEAD lcluster
advise field for non-compact (full) indexes;
- A 0-block CBLKCNT value on the first NONHEAD lcluster.
The hole tag is preferred for maximum compatibility since pre-existing
kernels that do not understand Z_EROFS_LI_HOLE will decompress at the
stored blkaddr (the same blkaddr will be shared among all sparse
pclusters). Only the 0-block CBLKCNT approach also works for compact
indexes, but it is limited to big pclusters and new kernels.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen)
Fix a taskstats TGID aggregation bug where fields added in the TGID
query path were not preserved after thread exit, and adds a kselftest
covering the regression.
- "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko)
Improve lib/tests/string_helpers_kunit.c a little
- "lib/base64: decode fixes" (Josh Law)
Address minor issues in lib/base64.c
- "selftests/filelock: Make output more kselftestish" (Mark Brown)
Make the output from the ofdlocks test a bit easier for tooling to
work with. Also ignore the generated file
- "uaccess: unify inline vs outline copy_{from,to}_user() selection"
(Yury Norov)
Simplify the usercopy code by removing the selectability of inlining
copy_{from,to}_user().
- "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang)
Fix a number of possible issues in the ocfs2 xattr code
- "lib and lib/cmdline enhancements" (Dmitry Antipov)
Provide additional robustness checking in the cmdline handling code
and its in-kernel testing and selftests
- "cleanup the RAID6 P/Q library" (Christoph Hellwig)
Clean up the RAID6 P/Q library to match the recent updates to the
RAID 5 XOR library and other CRC/crypto libraries
- "ocfs2: harden inode validators against forged metadata" (Michael
Bommarito)
Add three structural checks to OCFS2 dinode validation so malformed
on-disk fields are rejected before ocfs2_populate_inode() copies them
into the in-core inode
- "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike
Rapoport)
Clean up the lib/raid code by using kmalloc() in more places
* tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits)
ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write
ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits
lib: interval_tree_test: validate benchmark parameters
ocfs2: avoid moving extents to occupied clusters
treewide: fix transposed "sign" typos and update spelling.txt
ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec
fat: reject BPB volumes whose data area starts beyond total sectors
selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems
lib/test_firmware: allocate the configured into_buf size
fs: efs: remove unneeded debug prints
checkpatch: cuppress warnings when Reported-by: is followed by Link:
MAINTAINERS: add Alexander as a kcov reviewer
mailmap: update Alexander Sverdlin's Email addresses
fs: fat: inode: replace sprintf() with scnprintf()
ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent
ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release()
ocfs2/dlm: require a ref for locking_state debugfs open
ocfs2: reject FITRIM ranges shorter than a cluster
ocfs2: validate fast symlink target during inode read
ocfs2: add journal NULL check in ocfs2_checkpoint_inode()
...
|