lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
12 hours	Merge tag 'net-7.2-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and IPsec. Current release - regressions: - do not acquire dev->tx_global_lock in netdev_watchdog_up() - ethtool: keep rtnl_lock for ops using ethtool_op_get_link() - fix deadlock in nested UP notifier events Current release - new code bugs: - eth: - cn20k: fix subbank free list indexing for search order - airoha: fix BQL underflow in shared QDMA TX ring Previous releases - regressions: - netfilter: - flowtable: fix offloaded ct timeout never being extended - nf_conncount: prevent connlimit drops for early confirmed ct Previous releases - always broken: - require CAP_NET_ADMIN in the originating netns when modifying cross-netns devices - report NAPI thread PID in the caller's pid namespace - mac802154: fix dirty frag in in-place crypto for IOT radios - sctp: hold socket lock when dumping endpoints in sctp_diag, avoid an overflow - eth: gve: fix header buffer corruption with header-split and HW-GRO - af_key: initialize alg_key_len for IPComp states, prevent OOB read" * tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (213 commits) selftests: bonding: add a test for VLAN propagation over a bonded real device vlan: defer real device state propagation to netdev_work net: add the driver-facing netdev_work scheduling API net: turn the rx_mode work into a generic netdev_work facility net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link() rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate rxrpc: Fix leak of released call in recvmsg(MSG_PEEK) rxrpc: Fix socket notification race rxrpc: Fix potential infinite loop in rxrpc_recvmsg() rxrpc: Fix oob challenge leak in cleanup after notification failure rxrpc: Fix the reception of a reply packet before data transmission afs: Fix uncancelled rxrpc OOB message handler afs: Fix further netns teardown to cancel the preallocation charger rxrpc: Fix double unlock in rxrpc_recvmsg() rxrpc: Fix leak of connection from OOB challenge rxrpc: Fix ACKALL packet handling net: hns3: differentiate autoneg default values between copper and fiber net: hns3: fix permanent link down deadlock after reset net: hns3: refactor MAC autoneg and speed configuration net: hns3: unify copper port ksettings configuration path ...
15 hours	afs: Fix uncancelled rxrpc OOB message handler	David Howells
	Fix AFS to cancel its OOB message processing (typically to respond to security challenges). Also move OOB message processing to afs_wq so that it's also waited for and make the OOB handler just return if the net namespace is no longer live. Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE") Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Li Daming <d4n.for.sec@gmail.com> cc: Ren Wei <n05ec@lzu.edu.cn> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jeffrey Altman <jaltman@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260624163819.3017002-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
15 hours	afs: Fix further netns teardown to cancel the preallocation charger	David Howells
	When an afs network namespace is torn down, it cancels and waits for the work item that keeps the preallocated rxrpc call/conn/peer queue charged before disabling incoming (i.e. listen 0), but there's a small window in which it can be requeued by an incoming call wending through the I/O thread. Fix this by cancelling the charger work item again after reducing the listen backlog to zero. Fixes: 47694fbc9d24 ("afs: Fix netns teardown to cancel the preallocation charger") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com cc: Li Daming <d4n.for.sec@gmail.com> cc: Ren Wei <n05ec@lzu.edu.cn> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jeffrey Altman <jaltman@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/20260624163819.3017002-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
39 hours	Merge tag 'ntfs3_for_7.2' of ↵	Linus Torvalds
	https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 updates from Konstantin Komarov: "Added: - depth limit to indx_find_buffer() to prevent stack overflow - validate split-point offset in indx_insert_into_buffer() - bounds check to run_get_highest_vcn() - fileattr_get() and fileattr_set() support - zero stale pagecache beyond valid data length - handle delayed allocation overlap in run lookup - validate lcns_follow in log_replay() conversion - cap RESTART_TABLE free-chain walker at rt->used - resize log->one_page_buf when adopting on-disk page size - reject direct userspace writes to reserved $LX* xattrs Fixed: - out-of-bounds read in decompress_lznt() - avoid -Wmaybe-uninitialized warnings - hold ni_lock across readdir metadata walk - preserve non-DOS attribute bits in system.dos_attrib - validate index entry key bounds - syncing wrong inode on DIRSYNC cross-directory rename - validate Dirty Page Table capacity in log_replay() copy_lcns - wrong LCN in run_remove_range() when splitting a run - allocate iomap inline_data using alloc_page - mount failure on 64K page-size kernels - out-of-bounds read in ntfs_dir_emit() and hdr_find_e() - bound attr_off in UpdateResidentValue against data_off - bound DeleteIndexEntryAllocation memmove length - bound copy_lcns dp->page_lcns[] index in analysis pass - bound NTFS_DE view.data_off in UpdateRecordData{Root,Allocation} - prevent potential lcn remains uninitialized Changed: - bound to_move in indx_insert_into_root() before hdr_insert_head() - call _ntfs_bad_inode() when failing to rename - fold resident writeback into writepages loop - force waiting for direct I/O completion - fold file size handling into ntfs_set_size() - reject SEEK_DATA and SEEK_HOLE past EOF early - format code, add descriptive comments and remove non-useful" * tag 'ntfs3_for_7.2' of https://github.com/Paragon-Software-Group/linux-ntfs3: (34 commits) ntfs3: reject direct userspace writes to reserved $LX* xattrs fs/ntfs3: resize log->one_page_buf when adopting on-disk page size fs/ntfs3: prevent potential lcn remains uninitialized ntfs3: cap RESTART_TABLE free-chain walker at rt->used fs/ntfs3: bound NTFS_DE view.data_off in UpdateRecordData{Root,Allocation} fs/ntfs3: validate lcns_follow in log_replay conversion fs/ntfs3: bound copy_lcns dp->page_lcns[] index in analysis pass fs/ntfs3: bound DeleteIndexEntryAllocation memmove length fs/ntfs3: bound attr_off in UpdateResidentValue against data_off ntfs3: fix out-of-bounds read in ntfs_dir_emit() and hdr_find_e() fs/ntfs3: fix mount failure on 64K page-size kernels ntfs3: avoid another -Wmaybe-uninitialized warning ntfs3: Allocate iomap inline_data using alloc_page fs/ntfs3: format code, deal with comments fs/ntfs3: reject SEEK_DATA and SEEK_HOLE past EOF early fs/ntfs3: fold file size handling into ntfs_set_size() fs/ntfs3: force waiting for direct I/O completion fs/ntfs3: fold resident writeback into writepages loop fs/ntfs3: handle delayed allocation overlap in run lookup fs/ntfs3: zero stale pagecache beyond valid data length ...
2 days	Merge tag 'nfs-for-7.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs	Linus Torvalds
	Pull NFS client updates from Anna Schumaker: "New features: - XPRTRDMA: Decouple req recycling from RPC completion - NFS: Expose FMODE_NOWAIT for read-only files Bugfixes: - SUNRPC: - Fix sunrpc sysfs error handling - Fix uninitialized xprt_create_args structure - XPRTRDMA: - Harden connect and reply handling - NFS: - Fix EOF updates after fallocate/zero-range - Keep PG_UPTODATE clear after read errors in page groups - Use nfsi->rwsem to protect traversal of the file lock list - Prevent resource leak in nfs_alloc_server() - NFSv4: - Clear exception state on successful mkdir retry - Don't skip revalidate when holding a dir delegation and attrs are stale - pNFS: - Fix use-after-free in pnfs_update_layout() - Defer return_range callbacks until after inode unlock - Fix LAYOUTCOMMIT retry loop on OLD_STATEID - Reject zero-length r_addr in nfs4_decode_mp_ds_addr - NFS/flexfiles: - Reject zero-length filehandle version arrays - Fix checking if a layout is striped - Fixes for honoring FF_FLAGS_NO_IO_THRU_MDS Other cleanups and improvements: - Remove the fileid field from struct nfs_inode - Move long-delayed xprtrdma work onto the system_dfl_long_wq - Convert xprtrdma send buffer free list to an llist - Show "<redacted>" for cert_serial and privkey_serial mount options" * tag 'nfs-for-7.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (42 commits) NFS: Use common error handling code in nfs_alloc_server() NFS: Prevent resource leak in nfs_alloc_server() NFSv4/pNFS: reject zero-length r_addr in nfs4_decode_mp_ds_addr nfs: don't skip revalidate on directory delegation when attrs flagged stale xprtrdma: Return sendctx slot after Send preparation failure xprtrdma: Repost Receive buffers for malformed replies xprtrdma: Sanitize the reply credit grant after parsing xprtrdma: Fix bcall rep leak and unbounded peek xprtrdma: Resize reply buffers before reposting receives xprtrdma: Check frwr_wp_create() during connect xprtrdma: Initialize re_id before removal registration xprtrdma: Fix ep kref imbalance on ADDR_CHANGE xprtrdma: Convert send buffer free list to llist NFS: correct CONFIG_NFS_V4 macro name in #endif comment nfs: use nfsi->rwsem to protect traversal of the file lock list NFSv4.1/pNFS: fix LAYOUTCOMMIT retry loop on OLD_STATEID nfs: expose FMODE_NOWAIT for read-only files nfs: add nowait version of nfs_start_io_direct NFSv4/flexfiles: honor FF_FLAGS_NO_IO_THRU_MDS in pg_get_mirror_count_write NFSv4/flexfiles: honor FF_FLAGS_NO_IO_THRU_MDS on fatal DS connect errors ...
2 days	Merge tag 'f2fs-for-7.2-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "The changes primarily focus on filesystem error reporting, reducing memory footprint by reverting in-memory data structures used for runtime validation, honoring FDP hints, and adding trace and debug logs. In addition, there are critical bug fixes resolving out-of-bounds read vulnerabilities in inline directory and ACL handling, potential deadlocks in balance_fs, use-after-free issues in atomic writes, and false data/node type assignments in large sections. Enhancements: - Revert in-memory sit version and block bitmaps - support to report fserror - add trace_f2fs_fault_report - add iostat latency tracking for direct IO - add logs in f2fs_disable_checkpoint() - honor per-I/O write streams for direct writes - map data writes to FDP streams - skip inode folio lookup for cached overwrite - skip direct I/O iostat context when disabled - revert "check in-memory block bitmap" - revert "check in-memory sit version bitmap" Fixes: - optimize representative type determination in GC - fix incorrect FI_NO_EXTENT handling in __destroy_extent_node() - fix potential deadlock in f2fs_balance_fs() - fix potential deadlock in gc_merge path of f2fs_balance_fs() - atomic: fix UAF issue on f2fs_inode_info.atomic_inode - fix missing read bio submission on large folio error - pass correct iostat type for single node writes - fix to do sanity check on f2fs_get_node_folio_ra() - validate orphan inode entry count - keep atomic write retry from zeroing original data - read COW data with the original inode during atomic write - validate inline dentry name lengths before conversion - validate dentry name length before lookup compares it - reject setattr size changes on large folio files - revert "remove non-uptodate folio from the page cache in move_data_block" - validate ACL entry sizes in f2fs_acl_from_disk() - bound i_inline_xattr_size for non-inline-xattr inodes - fix listxattr handling of corrupted xattr entries - fix to round down start offset of fallocate for pin file" * tag 'f2fs-for-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (42 commits) f2fs: fix to round down start offset of fallocate for pin file f2fs: fix listxattr handling of corrupted xattr entries f2fs: skip direct I/O iostat context when disabled f2fs: remove unneeded f2fs_is_compressed_page() f2fs: avoid unnecessary fscrypt_finalize_bounce_page() f2fs: avoid unnecessary sanity check on ckpt_valid_blocks f2fs: misc cleanup in f2fs_record_stop_reason() f2fs: fix wrong description in printed log f2fs: bound i_inline_xattr_size for non-inline-xattr inodes f2fs: validate ACL entry sizes in f2fs_acl_from_disk() Revert "f2fs: remove non-uptodate folio from the page cache in move_data_block" f2fs: Split f2fs_write_end_io() f2fs: Rename f2fs_post_read_wq into f2fs_wq f2fs: Prepare for supporting delayed bio completion f2fs: reject setattr size changes on large folio files f2fs: validate dentry name length before lookup compares it f2fs: validate inline dentry name lengths before conversion f2fs: read COW data with the original inode during atomic write f2fs: skip inode folio lookup for cached overwrite f2fs: keep atomic write retry from zeroing original data ...
3 days	Merge tag 'mm-stable-2026-06-23-08-55' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "khugepaged: add mTHP collapse support" (Nico Pache) Provide khugepaged with the capability to collapse anonymous memory regions to mTHPs - "Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files" (Zi Yan) Remove the READ_ONLY_THP_FOR_FS check in file_thp_enabled(), so that khugepaged and MADV_COLLAPSE can run on filesystems with PMD THP pagecache support even without READ_ONLY_THP_FOR_FS enabled - "make MM selftests more CI friendly" (Mike Rapoport) General fixes and cleanups to the MM selftests. Also move more MM selftests under the kselftest framework, making them more amenable to ongoing CI testing - "selftests/mm: fix failures and robustness improvements" and "selftests/mm: assorted fixes for hmm-tests" (Sayali Patil) Fix several issues in MM selftests which were revealed by powerpc 64k pagesize * tag 'mm-stable-2026-06-23-08-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (118 commits) Revert "mm: limit filemap_fault readahead to VMA boundaries" mm/vmscan: pass NULL to trace vmscan node reclaim mm: use mapping_mapped to simplify the code selftests/mm: fix exclusive_cow test fork() handling selftests/mm: remove hardcoded THP sizing assumptions in hmm tests selftests/mm: allow PUD-level entries in compound testcase of hmm tests mm/gup_test: reject wrapped user ranges mm/page_frag: reject invalid CPUs in page_frag_test mm/damon/core: always put unsuccessfully committed target pids mm: page_isolation: avoid unsafe folio reads while scanning compound pages mm/shrinker: do not hold RCU lock in shrinker_debugfs_count_show() selftests: mm: fix and speedup "droppable" test mm: merge writeout into pageout MAINTAINERS: add Hao Ge as reviewer for codetag and alloc_tag selftests/mm: clarify alternate unmapping in compaction_test selftests/mm: move hwpoison setup into run_test() and silence modprobe output for memory-failure category selftests/mm: skip uffd-stress test when nr_pages_per_cpu is zero selftests/mm: skip uffd-wp-mremap if UFFD write-protect is unsupported selftests/mm: ensure destination is hugetlb-backed in hugetlb-mremap selftest/mm: register existing mapping with userfaultfd in hugetlb-mremap ...
3 days	Merge tag 'erofs-for-7.2-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "The most notable change is the removal of the fscache backend: it has been deprecated for almost two years, mainly because EROFS file-backed mounts and fanotify pre-content hooks (together with erofs-utils) now provide better functionality and simpler codebase. In addition, fscache has depended on netfslib for years, which is undesirable for EROFS since it is a local filesystem. More details in [1]. In addition, sparse support has been added to the pcluster layout, which is helpful for large sparse AI datasets, and map requests for chunk-based inodes have been optimized to be more efficient as well. There are also the usual fixes and cleanups. Summary: - Report more consecutive chunks of the same type for each iomap request - Add sparse support for the pcluster layout - Update the EROFS documentation overview - Remove the deprecated fscache backend - Various fixes and cleanups" Link: https://lore.kernel.org/r/20260622013622.934174-1-hsiangkao@linux.alibaba.com [1] * tag 'erofs-for-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: handle 48-bit blocks_hi for compressed inodes erofs: remove fscache backend entirely erofs: simplify RCU read critical sections erofs: add sparse support to pcluster layout erofs: add folio order to trace_erofs_read_folio erofs: introduce erofs_map_chunks() erofs: call erofs_exit_ishare() before rcu_barrier() erofs: update the overview of the documentation erofs: clean up erofs_ishare_fill_inode()
3 days	f2fs: fix to round down start offset of fallocate for pin file	Sunmin Jeong
	Currently, the length of fallocate for pin file is section-aligned to keep allocated sections from being selected as victims of GC. However, for the case that the start offset of fallocate is not aligned in section, the allocated sections can't be fully utilized. It's because a new section is allocated by f2fs_allocate_pinning_section() after using blks_per_sec blocks regardless of the start offset. As a result, several unexpected dirty segments may be created, including blocks assigned to the pinned file. To address this issue, let's round down the start offset of fallocate to the length of section. The reproducing scenario is as below chunk=$(((2<<20)+4096)) # 2MB + 4KB touch test f2fs_io pinfile set test f2fs_io fallocate 0 0 $chunk test f2fs_io fallocate 0 $chunk $chunk test f2fs_io fallocate 0 $((chunk2)) $chunk test f2fs_io fiemap 0 $((chunk3)) test Fiemap: offset = 0 len = 12288 logical addr. physical addr. length flags 0 0000000000000000 000000068c600000 0000000000400000 00001088 1 0000000000400000 000000003d400000 0000000000001000 00001088 2 0000000000401000 00000003eb200000 0000000000200000 00001088 3 0000000000601000 00000005e4200000 0000000000001000 00001088 4 0000000000602000 0000000605400000 0000000000200000 00001089 Cc: stable@vger.kernel.org Fixes: f5a53edcf01e ("f2fs: support aligned pinned file") Reviewed-by: Yunji Kang <yunji0.kang@samsung.com> Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: fix listxattr handling of corrupted xattr entries	Keshav Verma
	Validate the xattr entry before reading its fields in f2fs_listxattr(). Return -EFSCORRUPTED when the entry is outside the valid xattr storage area instead of returning a successful partial result. Fixes: 688078e7f36c ("f2fs: fix to avoid memory leakage in f2fs_listxattr") Cc: stable@kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Keshav Verma <iganschel@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: skip direct I/O iostat context when disabled	Wenjie Qi
	F2FS iostat is optional and is disabled by default. Direct I/O still allocates and binds a bio_iostat_ctx, updates the submit timestamp, and replaces bi_end_io for every DIO bio even when sbi->iostat_enable is false. The byte accounting calls do not need an extra guard because f2fs_update_iostat() already checks sbi->iostat_enable. Only skip the DIO bio context setup when iostat is disabled. If iostat is enabled through sysfs before submission, the existing context allocation and latency accounting path is still used. QEMU benchmark on a 1GiB F2FS virtio-blk image, with iostat_enable=0, 4KiB O_DIRECT I/O over a 64MiB file, 50000 iterations per run: baseline patched direct_read median 65264.50 ns 55470.95 ns direct_read recheck 65553.75 ns 55470.95 ns direct_write median 68054.62 ns 56309.44 ns direct_write recheck 66873.51 ns 56309.44 ns Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: remove unneeded f2fs_is_compressed_page()	Chao Yu
	We have checked f2fs_is_compressed_page() before f2fs_compress_write_end_io(), so we don't need to check the status again, remove it. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: avoid unnecessary fscrypt_finalize_bounce_page()	Chao Yu
	fscrypt_finalize_bounce_page() should be called only if we use fs layer crypto, let's avoid unnecessary fscrypt_finalize_bounce_page() in error path of f2fs_write_compressed_pages(). BTW, fscrypt_finalize_bounce_page() will check mapping of bounced page before retrieving original page, so, previously it won't cause any issue w/ fscrypt_finalize_bounce_page(), but still we'd better avoid coupling w/ any logic inside fscrypt_finalize_bounce_page(). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: avoid unnecessary sanity check on ckpt_valid_blocks	Chao Yu
	The calculation of sec->ckpt_valid_blocks are the same in both set_ckpt_valid_blocks() and sanity_check_valid_blocks(), so it doesn't necessary to call sanity_check_valid_blocks() right after set_ckpt_valid_blocks(). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: misc cleanup in f2fs_record_stop_reason()	Chao Yu
	Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: fix wrong description in printed log	Chao Yu
	This patch fixes wrong description in printed log: "SSA and SIT" -> "SIT and SSA" Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: bound i_inline_xattr_size for non-inline-xattr inodes	Bryam Vargas
	When the flexible_inline_xattr feature is enabled, do_read_inode() loads the on-disk i_inline_xattr_size unconditionally: if (f2fs_sb_has_flexible_inline_xattr(sbi)) fi->i_inline_xattr_size = le16_to_cpu(ri->i_inline_xattr_size); but sanity_check_inode() only range-checks it when the inode also has the FI_INLINE_XATTR flag set. An inode that carries an inline dentry or inline data but not FI_INLINE_XATTR -- the normal layout for an inline directory -- therefore keeps a fully attacker-controlled i_inline_xattr_size from a crafted image. get_inline_xattr_addrs() returns that value with no flag gating, so it feeds the inode geometry: MAX_INLINE_DATA() = 4 * (CUR_ADDRS_PER_INODE - i_inline_xattr_size - 1) NR_INLINE_DENTRY() = MAX_INLINE_DATA() * BITS_PER_BYTE / (...) addrs_per_page() = CUR_ADDRS_PER_INODE - i_inline_xattr_size A large i_inline_xattr_size drives MAX_INLINE_DATA() and NR_INLINE_DENTRY() negative, so make_dentry_ptr_inline() sets d->max (int) to a negative value. The inline directory walk then compares an unsigned long bit_pos against that negative d->max, which is promoted to a huge unsigned bound, and reads far past the inline area: while (bit_pos < d->max) /* fs/f2fs/dir.c */ ... test_bit_le(bit_pos, d->bitmap) / d->dentry[bit_pos] ... Mounting a crafted image and reading such a directory triggers an out-of-bounds read in f2fs_fill_dentries(); the same underflow also corrupts ADDRS_PER_INODE for regular files. Validate i_inline_xattr_size against MAX_INLINE_XATTR_SIZE whenever the flexible_inline_xattr feature is enabled -- i.e. whenever the value is loaded from disk and consumed -- and keep the lower MIN_INLINE_XATTR_SIZE bound gated on inodes that actually carry an inline xattr, so legitimate inodes with i_inline_xattr_size == 0 are still accepted. Cc: stable@vger.kernel.org Fixes: 6afc662e68b5 ("f2fs: support flexible inline xattr size") Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: validate ACL entry sizes in f2fs_acl_from_disk()	Zhang Cen
	f2fs_acl_count() only validates the aggregate ACL xattr length. A malformed ACL can still place ACL_USER or ACL_GROUP in a slot that only contains struct f2fs_acl_entry_short bytes, and f2fs_acl_from_disk() then reads entry->e_id before verifying that a full entry fits. Require a short entry before reading e_tag and e_perm, and require a full entry before reading e_id for ACL_USER and ACL_GROUP. Return -EFSCORRUPTED from these new truncated-entry checks, while keeping the pre-existing -EINVAL paths unchanged. Validation reproduced this kernel report: KASAN slab-out-of-bounds in __f2fs_get_acl+0x6fb/0x7e0 RIP: 0033:0x7f4b835ea7aa The buggy address belongs to the object at ffff888114589960 which belongs to the cache kmalloc-8 of size 8 The buggy address is located 0 bytes to the right of allocated 8-byte region [ffff888114589960, ffff888114589968) Read of size 4 Call trace: dump_stack_lvl+0x66/0xa0 (?:?) print_report+0xce/0x630 (?:?) __f2fs_get_acl+0x6fb/0x7e0 (fs/f2fs/acl.c:169) srso_alias_return_thunk+0x5/0xfbef5 (?:?) __virt_addr_valid+0x224/0x430 (?:?) kasan_report+0xe0/0x110 (?:?) __f2fs_get_acl+0x5/0x7e0 (fs/f2fs/acl.c:169) __get_acl+0x281/0x380 (?:?) vfs_get_acl+0x10b/0x190 (?:?) do_get_acl+0x2a/0x410 (?:?) do_get_acl+0x9/0x410 (?:?) do_getxattr+0xe8/0x260 (?:?) filename_getxattr+0xd1/0x140 (?:?) do_getname+0x2d/0x2d0 (?:?) path_getxattrat+0x16c/0x200 (?:?) lock_release+0xc8/0x290 (?:?) cgroup_update_frozen+0x9d/0x320 (?:?) lockdep_hardirqs_on_prepare+0xea/0x1a0 (?:?) trace_hardirqs_on+0x1a/0x170 (?:?) _raw_spin_unlock_irq+0x28/0x50 (?:?) do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87) entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?) Cc: stable@kernel.org Fixes: af48b85b8cd3 ("f2fs: add xattr and acl functionalities") Assisted-by: Codex:gpt-5.5 Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	Revert "f2fs: remove non-uptodate folio from the page cache in move_data_block"	Zhaoyang Huang
	This reverts commit 9609dd704725a40cd63d915f2ab6c44248a44598. The kernel panics are keeping to be reported especially when the f2fs partition get almost full. By investigation, we find that the reason is one f2fs page got freed to buddy without being deleted from LRU and the root cause is the race happened in [2] which is enrolled by this commit. There are 3 race processes in this scenario, please find below for their main activities. The changed code in move_data_block() lets the GC path evict the tail-end folio from the page cache through folio_end_dropbehind(). Once folio_unmap_invalidate() removes the folio from mapping->i_pages, the page-cache references for all pages in the folio are dropped. The folio is then kept alive only by temporary external references, which allows a later split to operate on a folio whose subpages are no longer protected by page-cache references. After the page-cache references are gone, split_folio_to_order() can split the big folio into individual pages and put the resulting subpages back on the LRU. For tail pages beyond EOF, split removes them from the page cache and drops their page-cache references. A tail page can then remain on the LRU with PG_lru set while holding only the split caller's temporary reference. When free_folio_and_swap_cache() drops that final reference, the page enters the final folio_put() release path. In parallel, folio_isolate_lru() can observe the same tail page with a non-zero refcount and PG_lru set. It clears PG_lru before taking its own reference. If this races with the final folio_put() from the split path, __folio_put() sees PG_lru already cleared and skips lruvec_del_folio(). The page is then freed back to the allocator while its lru links are still present in the LRU list. A later LRU operation on a neighboring page detects the stale link and reports list corruption. [1] [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88) [ 22.486130] ------------[ cut here ]------------ [ 22.486134] kernel BUG at lib/list_debug.c:67! [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT) [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154 [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154 [ 22.488539] sp : ffffffc08006b830 [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000 [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0 [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122 [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060 [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058 [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003 [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00 [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010 [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d [ 22.488647] Call trace: [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P) [ 22.488661] __folio_put+0x2bc/0x434 [ 22.488670] folio_put+0x28/0x58 [ 22.488678] do_garbage_collect+0x1a34/0x2584 [ 22.488689] f2fs_gc+0x230/0x9b4 [ 22.488697] f2fs_fallocate+0xb90/0xdf4 [ 22.488706] vfs_fallocate+0x1b4/0x2bc [ 22.488716] __arm64_sys_fallocate+0x44/0x78 [ 22.488725] invoke_syscall+0x58/0xe4 [ 22.488732] do_el0_svc+0x48/0xdc [ 22.488739] el0_svc+0x3c/0x98 [ 22.488747] el0t_64_sync_handler+0x20/0x130 [ 22.488754] el0t_64_sync+0x1c4/0x1c8 [2] CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru) F: pagecache refs = n F: extra refs = GC + split F: PG_lru set move_data_block() folio = f2fs_grab_cache_folio(F) ... __folio_set_dropbehind(F) folio_unlock(F) folio_end_dropbehind(F) folio_unmap_invalidate(F) __filemap_remove_folio(F) folio_put_refs(F, n) folio_put(F) split_folio_to_order(F) folio_ref_freeze(F, 1) ... lru_add_split_folio(T) list_add_tail(&T->lru, &F->lru) folio_set_lru(T) __filemap_remove_folio(T) folio_put_refs(T, 1) /* T refcount == 1, PageLRU set / folio_isolate_lru(T) folio_test_clear_lru(T) free_folio_and_swap_cache(T) folio_put(T) / refcount: 1 -> 0 / __folio_put(T) __page_cache_release(T) folio_test_lru(T) == false / skip lruvec_del_folio(T) */ free_frozen_pages(T) folio_get(T) lruvec_del_folio(T) later: list_del(adjacent->lru) next == &T->lru next->prev == LIST_POISON / PCP freelist BUG Cc: stable@vger.kernel.org Fixes: 9609dd704725 ("f2fs: remove non-uptodate folio from the page cache in move_data_block") Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: Split f2fs_write_end_io()	Bart Van Assche
	Prepare for running most of the write completion work asynchronously. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: Rename f2fs_post_read_wq into f2fs_wq	Bart Van Assche
	Rename f2fs_post_read_wq into f2fs_wq. Create it unconditionally. Prepare for using this workqueue for completing write bios. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: Prepare for supporting delayed bio completion	Bart Van Assche
	Use bio frontpadding to allocate memory for a work_struct when allocating a bio. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: reject setattr size changes on large folio files	Wenjie Qi
	F2FS large folios are only enabled for immutable non-compressed files. Writable open and writable mmap reject such mappings, but truncate(2) through f2fs_setattr() misses the same guard. If FS_IMMUTABLE_FL is cleared while the inode is still cached, the mapping can keep large-folio support and ATTR_SIZE can change i_size. Reject size changes in that state. Cc: stable@kernel.org Fixes: 05e65c14ea59 ("f2fs: support large folio for immutable non-compressed case") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: validate dentry name length before lookup compares it	Samuel Moelius
	The f2fs dentry lookup path can use the on-disk name length before checking that the name fits in the dentry filename area. A corrupted dentry can then make lookup read beyond the filename slots. The bounds check needs to happen before any comparison that consumes the name length from disk. Reject dentries with invalid name lengths before comparing their names. Assisted-by: Codex:gpt-5.5-cyber-preview Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: validate inline dentry name lengths before conversion	Samuel Moelius
	Inline dentry conversion copies names out of the inline dentry area before checking that each recorded name length fits in the available filename slots. A corrupted image can therefore make the conversion path read past the inline filename storage while building the regular dentry block. Validate each inline dentry name length against the inline filename area before copying it. Assisted-by: Codex:gpt-5.5-cyber-preview Signed-off-by: Samuel Moelius <samuel.moelius@trailofbits.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: read COW data with the original inode during atomic write	Mikhail Lobanov
	When updating an atomic-write file, f2fs_write_begin() may read the previously written data back from the COW inode: prepare_atomic_write_begin() locates the block in the COW inode and sets use_cow, and the read bio is then built with the COW inode: f2fs_submit_page_read(use_cow ? F2FS_I(inode)->cow_inode : inode, ...); and f2fs_grab_read_bio() decides whether to schedule fs-layer decryption (STEP_DECRYPT) for the bio based on that inode via fscrypt_inode_uses_fs_layer_crypto(). However, the folio being filled belongs to the original inode (folio->mapping->host == inode), and the data stored in the COW block was encrypted (or left as plaintext) using the original inode's context, not the COW inode's -- see f2fs_encrypt_one_page(), which keys off fio->page->mapping->host. fscrypt_decrypt_pagecache_blocks() likewise operates on folio->mapping->host. The COW inode is created as a tmpfile in the parent directory and inherits its encryption policy from there. With test_dummy_encryption the newly created COW inode gets the dummy policy and becomes encrypted, while a pre-existing regular file -- created before the policy applied, e.g. already present in the on-disk image -- stays unencrypted. The read path then sets STEP_DECRYPT based on the encrypted COW inode and calls fscrypt_decrypt_pagecache_blocks() on a folio whose host (the unencrypted original inode) has a NULL ->i_crypt_info, dereferencing it: Oops: general protection fault, probably for non-canonical address ... KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] RIP: 0010:fscrypt_decrypt_pagecache_blocks+0xa0/0x310 Workqueue: f2fs_post_read_wq f2fs_post_read_work Call Trace: fscrypt_decrypt_bio+0x1eb/0x340 f2fs_post_read_work+0xba/0x140 process_one_work+0x91c/0x1a40 worker_thread+0x677/0xe90 kthread+0x2bc/0x3a0 The COW inode is only needed to locate the on-disk block, and that block address is already resolved into @blkaddr by prepare_atomic_write_begin() via __find_data_block(cow_inode, ...); f2fs_submit_page_read() then reads from that physical @blkaddr directly, so the inode argument only selects the post-read crypto context, not which block is fetched. Reading with @inode therefore returns the same (latest, not-yet-committed) COW data, while making both the fs-layer decryption decision and the inline crypto path use the correct (original inode's) key. With the COW inode no longer used at the read site, the use_cow flag has no remaining consumer; drop it from f2fs_write_begin() and prepare_atomic_write_begin(). Fixes: 591fc34e1f98 ("f2fs: use cow inode data when updating atomic write") Cc: stable@vger.kernel.org Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: skip inode folio lookup for cached overwrite	Wenjie Qi
	prepare_write_begin() first gets the inode folio and builds a dnode, then checks the read extent cache. For an ordinary overwrite of a non-inline and non-compressed file, an extent-cache hit already gives the data block address and the following path does not need to allocate or update any node state. Check the read extent cache before fetching the inode folio for that narrow case. Keep the existing paths for inline data, compressed files, and writes that may extend past EOF, where the helper may need inline conversion, compression preparation, or block reservation. This avoids a node-folio lookup in the buffered overwrite fast path when the mapping is already cached. In a QEMU/KASAN x86_64 VM, using a small buffered overwrite workload on an existing 1MiB file, median time improved as follows: 64-byte overwrites: 1724.93 ns/write -> 1560.24 ns/write 256-byte overwrites: 1713.38 ns/write -> 1577.85 ns/write Function profiling of 20k 64-byte overwrites showed f2fs_get_inode_folio() calls drop from 20004 to 4. Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: keep atomic write retry from zeroing original data	Wenjie Qi
	A partial atomic write reserves a block in the COW inode before reading the original data page for the untouched bytes in that page. If that read fails, write_begin returns an error but leaves the COW inode entry as NEW_ADDR. A retry of the same partial write then finds the COW entry, treats it as existing COW data, and f2fs_write_begin() zeroes the whole folio because blkaddr is NEW_ADDR. If the retry is committed, the bytes outside the retried write range are committed as zeroes instead of preserving the original file contents. Only use the COW inode as the read source when it already has a real data block. If the COW entry is still NEW_ADDR, treat it as a reservation to reuse: keep reading the old data from the original inode and avoid reserving or accounting the same atomic block again. Cc: stable@kernel.org Fixes: 3db1de0e582c ("f2fs: change the current atomic write way") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: validate orphan inode entry count	Wenjie Qi
	f2fs_recover_orphan_inodes() trusts the orphan block entry_count when replaying orphan inodes from the checkpoint pack. A corrupted entry_count larger than F2FS_ORPHANS_PER_BLOCK makes the recovery loop read past the ino[] array and interpret footer or following data as inode numbers. On a crafted image, mounting an unpatched kernel can drive orphan recovery into f2fs_bug_on() and panic the kernel. Validate entry_count before consuming entries so corrupted checkpoint data fails the mount with -EFSCORRUPTED and requests fsck instead. Set ERROR_INCONSISTENT_ORPHAN as well, so the corruption reason can be recorded in the superblock s_errors[] field. This gives fsck a persistent hint even though mount-time orphan recovery failure may leave no chance to persist SBI_NEED_FSCK through a checkpoint. Cc: stable@kernel.org Fixes: 127e670abfa7 ("f2fs: add checkpoint operations") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: honor per-I/O write streams for direct writes	Wenjie Qi
	io_uring can pass a per-I/O write stream through kiocb->ki_write_stream, and block direct I/O propagates that value to bio->bi_write_stream. F2FS added FDP stream mapping for DATA writes, but its direct write submit hook always rewrites bio->bi_write_stream from the inode write hint and F2FS temperature. As a result, a direct write with an explicit io_uring write_stream is submitted to the F2FS-selected stream instead of the user-requested stream. Validate an explicit write stream before starting F2FS direct I/O, pass the kiocb through the iomap private pointer, and preserve the per-I/O stream in the direct write bio. When no per-I/O stream is supplied, keep using the existing F2FS temperature-to-stream mapping. Fixes: 42f7a7a50a33 ("f2fs: map data writes to FDP streams") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: fix to do sanity check on f2fs_get_node_folio_ra()	Chao Yu
	kernel BUG at fs/f2fs/file.c:845! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5336 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:f2fs_do_truncate_blocks+0x1115/0x1140 fs/f2fs/file.c:845 Code: fc fc 90 0f 0b e8 8b 9d 9a fd 90 0f 0b e8 83 9d 9a fd 48 89 df 48 c7 c6 60 d1 1a 8c e8 54 f1 fc fc 90 0f 0b e8 6c 9d 9a fd 90 <0f> 0b e8 64 9d 9a fd 90 0f 0b 90 e9 93 fd ff ff e8 56 9d 9a fd 90 RSP: 0018:ffffc9000e4474c0 EFLAGS: 00010283 RAX: ffffffff842b1d34 RBX: 0000000000000003 RCX: 0000000000100000 RDX: ffffc9000f03a000 RSI: 0000000000035503 RDI: 0000000000035504 RBP: ffffc9000e447608 R08: ffff8880123b0000 R09: 0000000000000002 R10: 00000000fffffffe R11: 0000000000000002 R12: 0000000000000001 R13: 0000000000000000 R14: 1ffff92001c88ea0 R15: 00000000ffff039c FS: 00007f7e02ee36c0(0000) GS:ffff88808c887000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff0305c4000 CR3: 0000000012d4c000 CR4: 0000000000352ef0 Call Trace: <TASK> f2fs_truncate_blocks+0x10a/0x300 fs/f2fs/file.c:882 f2fs_truncate+0x471/0x7c0 fs/f2fs/file.c:940 f2fs_evict_inode+0xa3f/0x1ac0 fs/f2fs/inode.c:907 evict+0x61e/0xb10 fs/inode.c:841 f2fs_fill_super+0x5f43/0x78f0 fs/f2fs/super.c:5224 get_tree_bdev_flags+0x431/0x4f0 fs/super.c:1694 vfs_get_tree+0x92/0x2a0 fs/super.c:1754 fc_mount fs/namespace.c:1193 [inline] do_new_mount_fc fs/namespace.c:3758 [inline] do_new_mount+0x341/0xd30 fs/namespace.c:3834 do_mount fs/namespace.c:4167 [inline] __do_sys_mount fs/namespace.c:4383 [inline] __se_sys_mount+0x31d/0x420 fs/namespace.c:4360 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f count = ADDRS_PER_PAGE(dn.node_folio, inode); count -= dn.ofs_in_node; f2fs_bug_on(sbi, count < 0); The fuzz test will trigger above bug_on in f2fs. The root cause should be: in the corrupted inode, there is a direct node which has the same ino and nid in its footer, so in f2fs_do_truncate_blocks(), after f2fs_get_dnode_of_data() finds such dnode: 1) ADDRS_PER_PAGE(dn.node_folio, inode) will return 923 2) once dn.ofs_in_node points to addr[923, 1017] Then it will trigger the system panic. Let's introduce NODE_TYPE_NON_IXNODE to indicate current node should not be an inode or xattr node, and then use it in below path to detect inconsistent node chain in inode mapping table: - f2fs_do_truncate_blocks - f2fs_get_dnode_of_data - f2fs_get_node_folio_ra - __get_node_folio - f2fs_sanity_check_node_footer - case NODE_TYPE_NON_IXNODE -> check whether it is inode\|xnode Cc: stable@kernel.org Reported-by: syzbot+2488d8d751b27f7ce268@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69fa3697.170a0220.59368.0018.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	Revert: "f2fs: check in-memory sit version bitmap"	Chao Yu
	Commit ae27d62e6bef ("f2fs: check in-memory sit version bitmap") added a mirror for sit version bitmap, it expects to detect in-memory corruption, however we never got any reports from the check points for almost decade, let's remove the code, it can help to save memories. Cc: wallentx <william.allentx@gmail.com> Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	Revert: "f2fs: check in-memory block bitmap"	Chao Yu
	Commit 355e78913c0d ("f2fs: check in-memory block bitmap") added a mirror for valid block bitmap, it expects to detect in-memory corruption, however we never got any reports from the check points for almost decade, let's remove the code, it can help to save memories. Cc: wallentx <william.allentx@gmail.com> Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: avoid false shutdown fserror reports	Wenjie Qi
	F2FS records image errors and checkpoint-stop reasons through the same s_error_work worker. The ordinary f2fs_handle_error() path only updates s_errors, but the worker still calls fserror_report_shutdown() unconditionally after committing the superblock. As a result, a metadata corruption report can be followed by a synthetic FAN_FS_ERROR event with ESHUTDOWN and an invalid superblock file handle, even though no stop reason was recorded. Track whether save_stop_reason() actually changed the stop_reason array and only report the shutdown fserror for that case. Pure s_errors updates still commit the superblock, but no longer generate a false shutdown event. Fixes: 50faed607d32 ("f2fs: support to report fserror") Cc: stable@kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: validate compress cache inode only when enabled	Wenjie Qi
	F2FS_COMPRESS_INO() uses NM_I(sbi)->max_nid as the synthetic inode number for the compressed page cache inode. That inode only exists when the compress_cache mount option is enabled. When compress_cache is disabled, max_nid is outside the valid inode range. A corrupted directory entry that points to ino == max_nid should therefore be rejected by f2fs_check_nid_range(). However, is_meta_ino() currently treats F2FS_COMPRESS_INO() as a meta inode unconditionally, so f2fs_iget() bypasses do_read_inode() and its nid range check, and instantiates a fake internal inode instead. Gate the compressed cache inode case on COMPRESS_CACHE, matching f2fs_init_compress_inode(). With compress_cache disabled, ino == max_nid now follows the normal inode path and is rejected as an out-of-range nid. Cc: stable@kernel.org Fixes: 6ce19aff0b8c ("f2fs: compress: add compress_inode to cache compressed blocks") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: pass correct iostat type for single node writes	Wenjie Qi
	f2fs_write_single_node_folio() takes an io_type argument, but still passes FS_GC_NODE_IO to __write_node_folio() unconditionally. This was harmless while the helper was only used by f2fs_move_node_folio(), whose caller passes FS_GC_NODE_IO. However, commit fe9b8b30b971 ("f2fs: fix inline data not being written to disk in writeback path") made f2fs_inline_data_fiemap() call the helper with FS_NODE_IO for FIEMAP_FLAG_SYNC. Honor the caller supplied io_type so inline-data FIEMAP sync writeback is accounted as normal node IO instead of GC node IO, while the GC path continues to pass FS_GC_NODE_IO explicitly. Cc: stable@kernel.org Fixes: fe9b8b30b971 ("f2fs: fix inline data not being written to disk in writeback path") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: fix missing read bio submission on large folio error	Wenjie Qi
	f2fs_read_data_large_folio() can keep a read bio across multiple readahead folios. If a later folio hits an error before any of its blocks are added to the bio, folio_in_bio is false and the current error path returns immediately after ending that folio. This can leave the bio accumulated for earlier folios unsubmitted. Those folios then never receive read completion, and readers can wait indefinitely on the locked folios. Route errors through the common out path so any pending bio is submitted before returning. Stop consuming more readahead folios once an error is seen, and only wait on and clear the current folio when it was actually added to the bio. Cc: stable@kernel.org Fixes: a5d8b9d94e18 ("f2fs: fix to unlock folio in f2fs_read_data_large_folio()") Signed-off-by: Wenjie Qi <qiwenjie@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: atomic: fix UAF issue on f2fs_inode_info.atomic_inode	Chao Yu
	- ioctl(F2FS_IOC_GARBAGE_COLLECT_RANGE) - shrink - f2fs_gc - gc_data_segment - ra_data_block(cow_inode) - mapping = F2FS_I(inode)->atomic_inode->i_mapping : f2fs_is_cow_file(cow_inode) is true - f2fs_evict_inode(atomic_inode) - clear_inode_flag(fi->cow_inode, FI_COW_FILE) - F2FS_I(fi->cow_inode)->atomic_inode = NULL ... - truncate_inode_pages_final(atomic_inode) - f2fs_grab_cache_folio(mapping) : create folio in atomic_inode->mapping - clear_inode(atomic_inode) - BUG_ON(atomic_inode->i_data.nrpages) We need to add a reference on fi->atomic_inode before using its mapping field during garbage collection, otherwise, it will cause UAF issue. Cc: stable@kernel.org Cc: Daeho Jeong <daehojeong@google.com> Cc: Sunmin Jeong <s_min.jeong@samsung.com> Fixes: 3db1de0e582c ("f2fs: change the current atomic write way") Fixes: f18d00769336 ("f2fs: use meta inode for GC of COW file") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: fix potential deadlock in gc_merge path of f2fs_balance_fs()	Chao Yu
	When we mount device w/ gc_merge mount option, we may suffer below potential deadlock: Kworker GC trehad Truncator - f2fs_write_cache_pages - f2fs_write_single_data_page - f2fs_do_write_data_page - folio_start_writeback --- set writeback flag on folio - f2fs_outplace_write_data : cached folio in internal bio cache - f2fs_balance_fs - wake_up(gc_thread) : wake up gc thread to run foreground GC - finish_wait(fggc_wq) : wait on the waitqueue --- wait on GC thread to finish the work - truncate_inode_pages_range - __filemap_get_folio(, FGP_LOCK) --- lock folio - truncate_inode_partial_folio - folio_wait_writeback --- wait on writeback being cleared - do_garbage_collect - move_data_page - f2fs_get_lock_data_folio - lock on folio --- blocked on folio's lock In order to avoid such deadlock, let's call below functions to commit cached bios in GC_MERGE path of f2fs_balance_fs() as the same as we did in NOGC_MERGE path. - f2fs_submit_merged_write(sbi, DATA); - f2fs_submit_all_merged_ipu_writes(sbi); Cc: stable@kernel.org Fixes: 351df4b20115 ("f2fs: add segment operations") Cc: Ruipeng Qi <ruipengqi3@gmail.com> Reported: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Chao Yu <chaseyu@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: add logs in f2fs_disable_checkpoint()	Chao Yu
	In order to troubleshoot in which step we may block on during mount w/ checkpoint_disable mount option. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: add iostat latency tracking for direct IO	liujinbao1
	F2FS did not collect iostat latency for direct IO reads and writes, hook iomap_dio_ops.submit_io to bind an iostat context and record the submission timestamp. Replace bi_end_io with f2fs_dio_end_bio() to collect IO latency on completion before calling back to the original iomap_dio_bio_end_io(), to add iostat latency tracking support for F2FS DIO. Signed-off-by: shengyong1 <shengyong1@xiaomi.com> Signed-off-by: liujinbao1 <liujinbao1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: optimize representative type determination in GC	Daeho Jeong
	In large section mode, do_garbage_collect() previously determined the section's representative type by looking only at the first segment of the section. However, if data was fsynced into an area previously used as a node section, and this area is recovered during roll-forward recovery after sudden power off (SPO), GC would incorrectly assume the section's type based on an empty or obsolete first segment. This caused the recovered data segment to be misunderstood as being stuck inside a node section, triggering false inconsistency panics (Inconsistent segment type in SSA and SIT) and subsequent mount failures. This patch optimizes do_garbage_collect() to determine the section's representative type by identifying the first segment that actually contains valid blocks (valid_blocks > 0) during the main GC loop. This eliminates false alarms from empty/obsolete leading segments while maintaining strict section-level type consistency checks for genuine corruption. Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: Add trace_f2fs_fault_report	liujinbao1
	Add trace_f2fs_fault_report to trigger reporting upon f2fs_bug_on, need_fsck, stop_checkpoint, and handle_eio. Since f2fs_bug_on and need_fsck can be triggered in hundreds of scenarios, define set_sbi_flag as a macro to help capture the effective fault function and line number. Signed-off-by: shengyong1 <shengyong1@xiaomi.com> Signed-off-by: liujinbao1 <liujinbao1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: annotate lockless NAT counter reads	Cen Zhang
	nat_cnt[] is updated while callers hold nat_tree_lock, but F2FS samples the counters locklessly in f2fs_available_free_memory(), excess_dirty_nats(), and excess_cached_nats(). Those helpers only steer cache reclaim and background sync heuristics; they do not control NAT entry lifetime or checkpoint correctness. Document the intent with data_race(READ_ONCE()) and a short comment instead of adding locking to the balance path. Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
3 days	f2fs: annotate lockless last_time[] accesses	Cen Zhang
	f2fs stores mount-wide activity timestamps in sbi->last_time[] and samples them from background discard, GC, and balance paths without a dedicated lock. The timestamps are used as best-effort heuristics to decide whether background work should run now or sleep a bit longer. The current helpers use plain loads and stores, so KCSAN can report races between frequent foreground updates and background readers. Exact freshness is not required here, but the intentional lockless accesses should be marked explicitly. Use WRITE_ONCE() in f2fs_update_time() and READ_ONCE() in f2fs_time_over() and f2fs_time_to_wait(). This preserves the existing heuristic behavior and avoids adding locking to hot paths. Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
4 days	erofs: handle 48-bit blocks_hi for compressed inodes	Zhan Xusheng
	Combine i_nb.blocks_hi with i_u.blocks_lo when computing inode->i_blocks for compressed inodes, mirroring the startblk_hi handling for unencoded inodes a few lines above. Also evaluate the shift in u64 to avoid truncation. Fixes: efb2aef569b3 ("erofs: add encoded extent on-disk definition") Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
4 days	erofs: remove fscache backend entirely	Gao Xiang
	EROFS over fscache was introduced to provide image lazy pulling functionality. After the feature landed, the fscache subsystem made netfs a new hard dependency, which is unexpected for a local filesystem and has an kernel-defined caching hierarchy which could be inflexible compared to the fanotify pre-content hooks. Therefore, this feature has been deprecated for almost two years. As EROFS file-backed mounts and fanotify pre-content hooks both upstream for a while and already providing equivalent functionality (erofs-utils has supported fanotify pre-content hooks), let's remove the fscache backend now. The main application of this feature is Nydus [1], and they plan to move to use fanotify pre-content hooks in the near future too. I hope this patch can be merged into Linux 7.2, which is also motivated by newly found implementation issues [2][3] that are not worth investigating given the deprecation and limited development resources. The associated fscache/cachefiles cleanup patch will follow separately through the vfs tree (netfs) later: it seems fine since the codebase is isolated by CONFIG_CACHEFILES_ONDEMAND. [1] https://github.com/dragonflyoss/nydus/blob/v2.1.0/docs/nydus-fscache.md [2] https://github.com/dragonflyoss/nydus/pull/1824 [3] https://lore.kernel.org/r/20260619135800.1594811-1-michael.bommarito@gmail.com Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
4 days	erofs: simplify RCU read critical sections	Gao Xiang
	- use scoped_guard() for RCU read critical section in z_erofs_decompress_kickoff(); - simplify the RCU critical section loop in z_erofs_pcluster_begin(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
4 days	erofs: add sparse support to pcluster layout	Gao Xiang
	Although zeros can be compressed transparently on EROFS using fixed-size output compression so that it is never prioritized in the Android use cases, indicating entire pclusters as holes is still useful to preserve holes in the sparse datasets; otherwise overlayfs will allocate more space when copying up, and SEEK_HOLE won't report any hole. This patch introduces two ways to mark a pcluster as a hole: - A new Z_EROFS_LI_HOLE compatible flag (bit 14) in the HEAD lcluster advise field for non-compact (full) indexes; - A 0-block CBLKCNT value on the first NONHEAD lcluster. The hole tag is preferred for maximum compatibility since pre-existing kernels that do not understand Z_EROFS_LI_HOLE will decompress at the stored blkaddr (the same blkaddr will be shared among all sparse pclusters). Only the 0-block CBLKCNT approach also works for compact indexes, but it is limited to big pclusters and new kernels. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
4 days	Merge tag 'mm-nonmm-stable-2026-06-21-10-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen) Fix a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. - "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko) Improve lib/tests/string_helpers_kunit.c a little - "lib/base64: decode fixes" (Josh Law) Address minor issues in lib/base64.c - "selftests/filelock: Make output more kselftestish" (Mark Brown) Make the output from the ofdlocks test a bit easier for tooling to work with. Also ignore the generated file - "uaccess: unify inline vs outline copy_{from,to}_user() selection" (Yury Norov) Simplify the usercopy code by removing the selectability of inlining copy_{from,to}_user(). - "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang) Fix a number of possible issues in the ocfs2 xattr code - "lib and lib/cmdline enhancements" (Dmitry Antipov) Provide additional robustness checking in the cmdline handling code and its in-kernel testing and selftests - "cleanup the RAID6 P/Q library" (Christoph Hellwig) Clean up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries - "ocfs2: harden inode validators against forged metadata" (Michael Bommarito) Add three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode - "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike Rapoport) Clean up the lib/raid code by using kmalloc() in more places * tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits) ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits lib: interval_tree_test: validate benchmark parameters ocfs2: avoid moving extents to occupied clusters treewide: fix transposed "sign" typos and update spelling.txt ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec fat: reject BPB volumes whose data area starts beyond total sectors selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems lib/test_firmware: allocate the configured into_buf size fs: efs: remove unneeded debug prints checkpatch: cuppress warnings when Reported-by: is followed by Link: MAINTAINERS: add Alexander as a kcov reviewer mailmap: update Alexander Sverdlin's Email addresses fs: fat: inode: replace sprintf() with scnprintf() ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release() ocfs2/dlm: require a ref for locking_state debugfs open ocfs2: reject FITRIM ranges shorter than a cluster ocfs2: validate fast symlink target during inode read ocfs2: add journal NULL check in ocfs2_checkpoint_inode() ...