Age | Commit message (Collapse) | Author |
|
Pull more xfs updates from Darrick Wong:
"The second large pile of new stuff for 5.10, with changes even more
monumental than last week!
We are formally announcing the deprecation of the V4 filesystem format
in 2030. All users must upgrade to the V5 format, which contains
design improvements that greatly strengthen metadata validation,
supports reflink and online fsck, and is the intended vehicle for
handling timestamps past 2038. We're also deprecating the old Irix
behavioral tweaks in September 2025.
Coming along for the ride are two design changes to the deferred
metadata ops subsystem. One of the improvements is to retain correct
logical ordering of tasks and subtasks, which is a more logical design
for upper layers of XFS and will become necessary when we add atomic
file range swaps and commits. The second improvement to deferred ops
improves the scalability of the log by helping the log tail to move
forward during long-running operations. This reduces log contention
when there are a large number of threads trying to run transactions.
In addition to that, this fixes numerous small bugs in log recovery;
refactors logical intent log item recovery to remove the last
remaining place in XFS where we could have nested transactions; fixes
a couple of ways that intent log item recovery could fail in ways that
wouldn't have happened in the regular commit paths; fixes a deadlock
vector in the GETFSMAP implementation (which improves its performance
by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
bitmap handling code.
Summary:
- Deprecate the V4 filesystem format, some disused mount options, and
some legacy sysctl knobs now that we can support dates into the
25th century. Note that removal of V4 support will not happen until
the early 2030s.
- Fix some probles with inode realtime flag propagation.
- Fix some buffer handling issues when growing a rt filesystem.
- Fix a problem where a BMAP_REMAP unmap call would free rt extents
even though the purpose of BMAP_REMAP is to avoid freeing the
blocks.
- Strengthen the dabtree online scrubber to check hash values on
child dabtree blocks.
- Actually log new intent items created as part of recovering log
intent items.
- Fix a bug where quotas weren't attached to an inode undergoing bmap
intent item recovery.
- Fix a buffer overrun problem with specially crafted log buffer
headers.
- Various cleanups to type usage and slightly inaccurate comments.
- More cleanups to the xattr, log, and quota code.
- Don't run the (slower) shared-rmap operations on attr fork
mappings.
- Fix a bug where we failed to check the LSN of finobt blocks during
replay and could therefore overwrite newer data with older data.
- Clean up the ugly nested transaction mess that log recovery uses to
stage intent item recovery in the correct order by creating a
proper data structure to capture recovered chains.
- Use the capture structure to resume intent item chains with the
same log space and block reservations as when they were captured.
- Fix a UAF bug in bmap intent item recovery where we failed to
maintain our reference to the incore inode if the bmap operation
needed to relog itself to continue.
- Rearrange the defer ops mechanism to finish newly created subtasks
of a parent task before moving on to the next parent task.
- Automatically relog intent items in deferred ops chains if doing so
would help us avoid pinning the log tail. This will help fix some
log scaling problems now and will facilitate atomic file updates
later.
- Fix a deadlock in the GETFSMAP implementation by using an internal
memory buffer to reduce indirect calls and copies to userspace,
thereby improving its performance by ~20%.
- Fix various problems when calling growfs on a realtime volume would
not fully update the filesystem metadata.
- Fix broken Kconfig asking about deprecated XFS when XFS is
disabled"
* tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
xfs: fix high key handling in the rt allocator's query_range function
xfs: annotate grabbing the realtime bitmap/summary locks in growfs
xfs: make xfs_growfs_rt update secondary superblocks
xfs: fix realtime bitmap/summary file truncation when growing rt volume
xfs: fix the indent in xfs_trans_mod_dquot
xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
xfs: fix deadlock and streamline xfs_getfsmap performance
xfs: limit entries returned when counting fsmap records
xfs: only relog deferred intent items if free space in the log gets low
xfs: expose the log push threshold
xfs: periodically relog deferred intent items
xfs: change the order in which child and parent defer ops are finished
xfs: fix an incore inode UAF in xfs_bui_recover
xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
xfs: clean up bmap intent item recovery checking
xfs: xfs_defer_capture should absorb remaining transaction reservation
xfs: xfs_defer_capture should absorb remaining block reservations
xfs: proper replay of deferred ops queued during log recovery
xfs: remove XFS_LI_RECOVERED
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Support directly accessing host page cache from virtiofs. This can
improve I/O performance for various workloads, as well as reducing
the memory requirement by eliminating double caching. Thanks to Vivek
Goyal for doing most of the work on this.
- Allow automatic submounting inside virtiofs. This allows unique
st_dev/ st_ino values to be assigned inside the guest to files
residing on different filesystems on the host. Thanks to Max Reitz
for the patches.
- Fix an old use after free bug found by Pradeep P V K.
* tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
virtiofs: calculate number of scatter-gather elements accurately
fuse: connection remove fix
fuse: implement crossmounts
fuse: Allow fuse_fill_super_common() for submounts
fuse: split fuse_mount off of fuse_conn
fuse: drop fuse_conn parameter where possible
fuse: store fuse_conn in fuse_req
fuse: add submount support to <uapi/linux/fuse.h>
fuse: fix page dereference after free
virtiofs: add logic to free up a memory range
virtiofs: maintain a list of busy elements
virtiofs: serialize truncate/punch_hole and dax fault path
virtiofs: define dax address space operations
virtiofs: add DAX mmap support
virtiofs: implement dax read/write operations
virtiofs: introduce setupmapping/removemapping commands
virtiofs: implement FUSE_INIT map_alignment field
virtiofs: keep a list of free dax memory ranges
virtiofs: add a mount option to enable dax
virtiofs: set up virtio_fs dax_device
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs updates from Damien Le Moal:
"Add an 'explicit-open' mount option to automatically issue a
REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
is open for writing for the first time.
This avoids 'insufficient zone resources' errors for write operations
on some drives with limited zone resources or on ZNS drives with a
limited number of active zones. From Johannes"
* tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: document the explicit-open mount option
zonefs: open/close zone on file open/close
zonefs: provide no-lock zonefs_io_error variant
zonefs: introduce helper for zone management
|
|
Merge yet more updates from Andrew Morton:
"Subsystems affected by this patch series: mm (memcg, migration,
pagemap, gup, madvise, vmalloc), ia64, and misc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
mm: remove duplicate include statement in mmu.c
mm: remove the filename in the top of file comment in vmalloc.c
mm: cleanup the gfp_mask handling in __vmalloc_area_node
mm: remove alloc_vm_area
x86/xen: open code alloc_vm_area in arch_gnttab_valloc
xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv
drm/i915: use vmap in i915_gem_object_map
drm/i915: stop using kmap in i915_gem_object_map
drm/i915: use vmap in shmem_pin_map
zsmalloc: switch from alloc_vm_area to get_vm_area
mm: allow a NULL fn callback in apply_to_page_range
mm: add a vmap_pfn function
mm: add a VM_MAP_PUT_PAGES flag for vmap
mm: update the documentation for vfree
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
pid: move pidfd_get_pid() to pid.c
mm/madvise: pass mm to do_madvise
selftests/vm: 10x speedup for hmm-tests
binfmt_elf: take the mmap lock around find_extend_vma()
mm/gup_benchmark: take the mmap lock around GUP
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull more ubi and ubifs updates from Richard Weinberger:
"UBI:
- Correctly use kthread_should_stop in ubi worker
UBIFS:
- Fixes for memory leaks while iterating directory entries
- Fix for a user triggerable error message
- Fix for a space accounting bug in authenticated mode"
* tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: journal: Make sure to not dirty twice for auth nodes
ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails
ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode
ubi: check kthread_should_stop() after the setting of task state
ubifs: dent: Fix some potential memory leaks while iterating entries
ubifs: xattr: Fix some potential memory leaks while iterating entries
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull ubifs updates from Richard Weinberger:
- Kernel-doc fixes
- Fixes for memory leaks in authentication option parsing
* tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: mount_ubifs: Release authentication resource in error handling path
ubifs: Don't parse authentication mount options in remount process
ubifs: Fix a memleak after dumping authentication mount options
ubifs: Fix some kernel-doc warnings in tnc.c
ubifs: Fix some kernel-doc warnings in replay.c
ubifs: Fix some kernel-doc warnings in gc.c
ubifs: Fix 'hash' kernel-doc warning in auth.c
|
|
Patch series "introduce memory hinting API for external process", v9.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.
1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's
address space.
For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.
This patch (of 3):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.
Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.
And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.
[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it. (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)
While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.
Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.
(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.
memalloc_use_memcg(target_memcg);
<...>
memalloc_unuse_memcg();
It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.
memalloc_use_memcg(target_memcg);
memalloc_use_memcg(target_memcg_2);
<...>
memalloc_unuse_memcg();
Error: allocation here are charged to the memcg of the current
process instead of target_memcg.
memalloc_unuse_memcg();
This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:
old_memcg = set_active_memcg(target_memcg);
<...>
set_active_memcg(old_memcg);
This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Schatzberg <dschatzberg@fb.com>
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pavel Machek complained that the question about supporting deprecated
XFS v4 comes up even when XFS is disabled. This clearly makes no sense,
so fix Kconfig.
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
|
Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
in the realtime bitmap is always one less than the number of rt extents,
which means that the key clamp at the start of the function is wrong.
The 4th argument to xfs_rtfind_forw is the highest rt extent that we
want to probe, which means that passing 1 less than the high key is
wrong. Finally, drop the rem variable that controls the loop because we
can compare the iteration point (rtstart) against the high key directly.
The sordid history of this function is that the original commit (fb3c3)
incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
to xfs_rtfind_forw. This was wrong because the "high key" is supposed
to be the largest key for which the caller wants result rows, not the
key for the first row that could possibly be outside the range that the
caller wants to see.
A subsequent attempt (8ad56) to strengthen the parameter checking added
incorrect clamping of the parameters to the number of rt blocks in the
system (despite the bitmap functions all taking units of rt extents) to
avoid querying ranges past the end of rt bitmap file but failed to fix
the incorrect _rtfind_forw parameter. The original _rtfind_forw
parameter error then survived the conversion of the startblock and
blockcount fields to rt extents (a0e5c), and the most recent off-by-one
fix (a3a37) thought it was patching a problem when the end of the rt
volume is not in use, but none of these fixes actually solved the
original problem that the author was confused about the "limit" argument
to xfs_rtfind_forw.
Sadly, all four of these patches were written by this author and even
his own usage of this function and rt testing were inadequate to get
this fixed quickly.
Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks")
Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units")
Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
- Improve performance for certain container setups by introducing a
"volatile" mode
- ioctl improvements
- continue preparation for unprivileged overlay mounts
* tag 'ovl-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: use generic vfs_ioc_setflags_prepare() helper
ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories
ovl: rearrange ovl_can_list()
ovl: enumerate private xattrs
ovl: pass ovl_fs down to functions accessing private xattrs
ovl: drop flags argument from ovl_do_setxattr()
ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs
ovl: use ovl_do_getxattr() for private xattr
ovl: fold ovl_getxattr() into ovl_get_redirect_xattr()
ovl: clean up ovl_getxattr() in copy_up.c
duplicate ovl_getxattr()
ovl: provide a mount option "volatile"
ovl: check for incompatible features in work dir
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs updates from David Howells:
"A collection of fixes to fix afs_cell struct refcounting, thereby
fixing a slew of related syzbot bugs:
- Fix the cell tree in the netns to use an rwsem rather than RCU.
There seem to be some problems deriving from the use of RCU and a
seqlock to walk the rbtree, but it's not entirely clear what since
there are several different failures being seen.
Changing things to use an rwsem instead makes it more robust. The
extra performance derived from using RCU isn't necessary in this
case since the only time we're looking up a cell is during mount or
when cells are being manually added.
- Fix the refcounting by splitting the usage counter into a memory
refcount and an active users counter. The usage counter was doing
double duty, keeping track of whether a cell is still in use and
keeping track of when it needs to be destroyed - but this makes the
clean up tricky. Separating these out simplifies the logic.
- Fix purging a cell that has an alias. A cell alias pins the cell
it's an alias of, but the alias is always later in the list. Trying
to purge in a single pass causes rmmod to hang in such a case.
- Fix cell removal. If a cell's manager is requeued whilst it's
removing itself, the manager will run again and re-remove itself,
causing problems in various places. Follow Hillf Danton's
suggestion to insert a more terminal state that causes the manager
to do nothing post-removal.
In additional to the above, two other changes:
- Add a tracepoint for the cell refcount and active users count. This
helped with debugging the above and may be useful again in future.
- Downgrade an assertion to a print when a still-active server is
seen during purging. This was happening as a consequence of
incomplete cell removal before the servers were cleaned up"
* tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Don't assert on unpurgeable server records
afs: Add tracing for cell refcount and active user count
afs: Fix cell removal
afs: Fix cell purging with aliases
afs: Fix cell refcounting by splitting the usage counter
afs: Fix rapid cell addition/removal by not using RCU on cells tree
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added new features such as zone capacity for ZNS
and a new GC policy, ATGC, along with in-memory segment management. In
addition, we could improve the decompression speed significantly by
changing virtual mapping method. Even though we've fixed lots of small
bugs in compression support, I feel that it becomes more stable so
that I could give it a try in production.
Enhancements:
- suport zone capacity in NVMe Zoned Namespace devices
- introduce in-memory current segment management
- add standart casefolding support
- support age threshold based garbage collection
- improve decompression speed by changing virtual mapping method
Bug fixes:
- fix condition checks in some ioctl() such as compression, move_range, etc
- fix 32/64bits support in data structures
- fix memory allocation in zstd decompress
- add some boundary checks to avoid kernel panic on corrupted image
- fix disallowing compression for non-empty file
- fix slab leakage of compressed block writes
In addition, it includes code refactoring for better readability and
minor bug fixes for compression and zoned device support"
* tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
f2fs: code cleanup by removing unnecessary check
f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
f2fs: fix writecount false positive in releasing compress blocks
f2fs: introduce check_swap_activate_fast()
f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
f2fs: handle errors of f2fs_get_meta_page_nofail
f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
f2fs: reject CASEFOLD inode flag without casefold feature
f2fs: fix memory alignment to support 32bit
f2fs: fix slab leak of rpages pointer
f2fs: compress: fix to disallow enabling compress on non-empty file
f2fs: compress: introduce cic/dic slab cache
f2fs: compress: introduce page array slab cache
f2fs: fix to do sanity check on segment/section count
f2fs: fix to check segment boundary during SIT page readahead
f2fs: fix uninit-value in f2fs_lookup
f2fs: remove unneeded parameter in find_in_block()
f2fs: fix wrong total_sections check and fsmeta check
f2fs: remove duplicated code in sanity_check_area_boundary
f2fs: remove unused check on version_bitmap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
|
|
Merge more updates from Andrew Morton:
"155 patches.
Subsystems affected by this patch series: mm (dax, debug, thp,
readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
romfs, and fault-injection"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
lib, uaccess: add failure injection to usercopy functions
lib, include/linux: add usercopy failure capability
ROMFS: support inode blocks calculation
ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
sched.h: drop in_ubsan field when UBSAN is in trap mode
scripts/gdb/tasks: add headers and improve spacing format
scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
kernel/relay.c: drop unneeded initialization
panic: dump registers on panic_on_warn
rapidio: fix the missed put_device() for rio_mport_add_riodev
rapidio: fix error handling path
nilfs2: fix some kernel-doc warnings for nilfs2
autofs: harden ioctl table
ramfs: fix nommu mmap with gaps in the page cache
mm: remove the now-unnecessary mmget_still_valid() hack
mm/gup: take mmap_lock in get_dump_page()
binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
coredump: rework elf/elf_fdpic vma_dump_size() into common helper
coredump: refactor page range dumping into common helper
coredump: let dump_emit() bail out on short writes
...
|
|
When use 'stat' tool to display file status, the 'Blocks' field always in
'0', this is not good for tool 'du'(e.g.: busybox 'du'), it always output
'0' size for the files under ROMFS since such tool calculates number of
512B Blocks.
This patch calculates approx. number of 512B blocks based on inode size.
Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200811052606.4243-1-libing.zhou@nokia-sbell.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fixes the following W=1 kernel build warning(s):
fs/nilfs2/bmap.c:378: warning: Excess function parameter 'bhp' description in 'nilfs_bmap_assign'
fs/nilfs2/cpfile.c:907: warning: Excess function parameter 'status' description in 'nilfs_cpfile_change_cpmode'
fs/nilfs2/cpfile.c:946: warning: Excess function parameter 'stat' description in 'nilfs_cpfile_get_stat'
fs/nilfs2/page.c:76: warning: Excess function parameter 'inode' description in 'nilfs_forget_buffer'
fs/nilfs2/sufile.c:563: warning: Excess function parameter 'stat' description in 'nilfs_sufile_get_stat'
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1601386269-2423-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The table of ioctl functions should be marked const in order to put them
in read-only memory, and we should use array_index_nospec() to avoid
speculation disclosing the contents of kernel memory to userspace.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ian Kent <raven@themaw.net>
Link: https://lkml.kernel.org/r/20200818122203.GO17456@casper.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ramfs needs to check that pages are both physically contiguous and
contiguous in the file. If the page cache happens to have, eg, page A for
index 0 of the file, no page for index 1, and page A+1 for index 2, then
an mmap of the first two pages of the file will succeed when it should
fail.
Fixes: 642fb4d1f1dd ("[PATCH] NOMMU: Provide shared-writable mmap support on ramfs")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Link: https://lkml.kernel.org/r/20200914122239.GO6583@casper.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The preceding patches have ensured that core dumping properly takes the
mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
its users.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.
An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g. because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because
someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
something like that.
Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in
the future, they'd probably need it, too), implement this with a common
helper in fs/coredump.c.
A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.
There currently is a data race between stack expansion and anything that
reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
mitigate that for core dumping, take the mmap_lock in write mode when
taking a snapshot of the VMA hierarchy. (If we only took the mmap_lock in
read mode, we could end up with a corrupted core dump if someone does
get_user_pages_remote() concurrently. Not really a major problem, but
taking the mmap_lock either way works here, so we might as well avoid the
issue.) (This doesn't do anything about the existing data races with stack
expansion in other mm code.)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
different code to figure out which VMAs should be dumped, and if so,
whether the dump should contain the entire VMA or just its first page.
Eliminate duplicate code by reworking the binfmt_elf version into a
generic core dumping helper in coredump.c.
As part of that, change the heuristic for detecting executable/library
header pages to check whether the inode is executable instead of looking
at the file mode.
This is less problematic in terms of locking because it lets us avoid
get_user() under the mmap_sem. (And arguably it looks nicer and makes
more sense in generic code.)
Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
has been written to.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
pages into the coredump file. Extract that logic into a common helper.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
dump_emit() has a retry loop, but there seems to be no way for that retry
logic to actually be used; and it was also buggy, writing the same data
repeatedly after a short write.
Let's just bail out on a short write.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-3-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.
At the moment, we have that rather ugly mmget_still_valid() helper to work
around <https://crbug.com/project-zero/1790>: ELF core dumping doesn't
take the mmap_sem while traversing the task's VMAs, and if anything (like
userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
at the moment we use mmget_still_valid() to bail out in any writers that
might be operating on a remote mm's VMAs.
With this series, I'm trying to get rid of the need for that as cleanly as
possible. ("cleanly" meaning "avoid holding the mmap_lock across
unbounded sleeps".)
Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
dumping code.
Patches 5 and 6 implement the main change: Instead of repeatedly accessing
the VMA list with sleeps in between, we snapshot it at the start with
proper locking, and then later we just use our copy of the VMA list. This
ensures that the kernel won't crash, that VMA metadata in the coredump is
consistent even in the presence of concurrent modifications, and that any
virtual addresses that aren't being concurrently modified have their
contents show up in the core dump properly.
The disadvantage of this approach is that we need a bit more memory during
core dumping for storing metadata about all VMAs.
At the end of the series, patch 7 removes the old workaround for this
issue (mmget_still_valid()).
I have tested:
- Creating a simple core dump on X86-64 still works.
- The created coredump on X86-64 opens in GDB and looks plausible.
- X86-64 core dumps contain the first page for executable mappings at
offset 0, and don't contain the first page for non-executable file
mappings or executable mappings at offset !=0.
- NOMMU 32-bit ARM can still generate plausible-looking core dumps
through the FDPIC implementation. (I can't test this with GDB because
GDB is missing some structure definition for nommu ARM, but I've
poked around in the hexdump and it looked decent.)
This patch (of 7):
dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.
One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the latter
doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
just call __get_user_pages() for us.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "Selecting Load Addresses According to p_align", v3.
The current ELF loading mechancism provides page-aligned mappings. This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.
While specifying -z,max-page-size=0x200000 to the linker will generate
suitably aligned segments for huge pages on x86_64, the executable needs
to be loaded at a suitably aligned address as well. This alignment
requires the binary's cooperation, as distinct segments need to be
appropriately paddded to be eligible for THP.
For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.
This patch (of 2):
The current ELF loading mechancism provides page-aligned mappings. This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.
For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.
Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.
[akpm@linux-foundation.org: fix max() warning]
[ckennelly@google.com: augment comment]
Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com
Signed-off-by: Chris Kennelly <ckennelly@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Drop duplicated words {the, that} in comments.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20200811021826.25032-1-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Define it in the callers instead of in page_cache_ra_unbounded().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Link: https://lkml.kernel.org/r/20200903140844.14194-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page cache needs to know whether the filesystem supports THPs so that
it doesn't send THPs to filesystems which can't handle them. Dave Chinner
points out that getting from the page mapping to the filesystem type is
too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
information in the address space flags.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Don't give an assertion failure on unpurgeable afs_server records - which
kills the thread - but rather emit a trace line when we are purging a
record (which only happens during network namespace removal or rmmod) and
print a notice of the problem.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Add a tracepoint to log the cell refcount and active user count and pass in
a reason code through various functions that manipulate these counters.
Additionally, a helper function, afs_see_cell(), is provided to log
interesting places that deal with a cell without actually doing any
accounting directly.
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Fix cell removal by inserting a more final state than AFS_CELL_FAILED that
indicates that the cell has been unpublished in case the manager is already
requeued and will go through again. The new AFS_CELL_REMOVED state will
just immediately leave the manager function.
Going through a second time in the AFS_CELL_FAILED state will cause it to
try to remove the cell again, potentially leading to the proc list being
removed.
Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
Reported-by: syzbot+b994ecf2b023f14832c1@syzkaller.appspotmail.com
Reported-by: syzbot+0e0db88e1eb44a91ae8d@syzkaller.appspotmail.com
Reported-by: syzbot+2d0585e5efcd43d113c2@syzkaller.appspotmail.com
Reported-by: syzbot+1ecc2f9d3387f1d79d42@syzkaller.appspotmail.com
Reported-by: syzbot+18d51774588492bf3f69@syzkaller.appspotmail.com
Reported-by: syzbot+a5e4946b04d6ca8fa5f3@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
|
|
When the afs module is removed, one of the things that has to be done is to
purge the cell database. afs_cell_purge() cancels the management timer and
then starts the cell manager work item to do the purging. This does a
single run through and then assumes that all cells are now purged - but
this is no longer the case.
With the introduction of alias detection, a later cell in the database can
now be holding an active count on an earlier cell (cell->alias_of). The
purge scan passes by the earlier cell first, but this can't be got rid of
until it has discarded the alias. Ordinarily, afs_unuse_cell() would
handle this by setting the management timer to trigger another pass - but
afs_set_cell_timer() doesn't do anything if the namespace is being removed
(net->live == false). rmmod then hangs in the wait on cells_outstanding in
afs_cell_purge().
Fix this by making afs_set_cell_timer() directly queue the cell manager if
net->live is false. This causes additional management passes.
Queueing the cell manager increments cells_outstanding to make sure the
wait won't complete until all cells are destroyed.
Fixes: 8a070a964877 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
Management of the lifetime of afs_cell struct has some problems due to the
usage counter being used to determine whether objects of that type are in
use in addition to whether anyone might be interested in the structure.
This is made trickier by cell objects being cached for a period of time in
case they're quickly reused as they hold the result of a setup process that
may be slow (DNS lookups, AFS RPC ops).
Problems include the cached root volume from alias resolution pinning its
parent cell record, rmmod occasionally hanging and occasionally producing
assertion failures.
Fix this by splitting the count of active users from the struct reference
count. Things then work as follows:
(1) The cell cache keeps +1 on the cell's activity count and this has to
be dropped before the cell can be removed. afs_manage_cell() tries to
exchange the 1 to a 0 with the cells_lock write-locked, and if
successful, the record is removed from the net->cells.
(2) One struct ref is 'owned' by the activity count. That is put when the
active count is reduced to 0 (final_destruction label).
(3) A ref can be held on a cell whilst it is queued for management on a
work queue without confusing the active count. afs_queue_cell() is
added to wrap this.
(4) The queue's ref is dropped at the end of the management. This is
split out into a separate function, afs_manage_cell_work().
(5) The root volume record is put after a cell is removed (at the
final_destruction label) rather then in the RCU destruction routine.
(6) Volumes hold struct refs, but aren't active users.
(7) Both counts are displayed in /proc/net/afs/cells.
There are some management function changes:
(*) afs_put_cell() now just decrements the refcount and triggers the RCU
destruction if it becomes 0. It no longer sets a timer to have the
manager do this.
(*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease
the active count. afs_unuse_cell() sets the management timer.
(*) afs_queue_cell() is added to queue a cell with approprate refs.
There are also some other fixes:
(*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL.
(*) Make sure that candidate cells in lookups are properly destroyed
rather than being simply kfree'd. This ensures the bits it points to
are destroyed also.
(*) afs_dec_cells_outstanding() is now called in cell destruction rather
than at "final_destruction". This ensures that cell->net is still
valid to the end of the destructor.
(*) As a consequence of the previous two changes, move the increment of
net->cells_outstanding that was at the point of insertion into the
tree to the allocation routine to correctly balance things.
Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
|
|
There are a number of problems that are being seen by the rapidly mounting
and unmounting an afs dynamic root with an explicit cell and volume
specified (which should probably be rejected, but that's a separate issue):
What the tests are doing is to look up/create a cell record for the name
given and then tear it down again without actually using it to try to talk
to a server. This is repeated endlessly, very fast, and the new cell
collides with the old one if it's not quick enough to reuse it.
It appears (as suggested by Hillf Danton) that the search through the RB
tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and
that it's not blocking the write_seqlock(), despite taking two passes at
it. He suggested that the code should take a ref on the cell it's
attempting to look at - but this shouldn't be necessary until we've
compared the cell names. It's possible that I'm missing a barrier
somewhere.
However, using an RCU search for this is overkill, really - we only need to
access the cell name in a few places, and they're places where we're may
end up sleeping anyway.
Fix this by switching to an R/W semaphore instead.
Additionally, draw the down_read() call inside the function (renamed to
afs_find_cell()) since all the callers were taking the RCU read lock (or
should've been[*]).
[*] afs_probe_cell_name() should have been, but that doesn't appear to be
involved in the bug reports.
The symptoms of this look like:
general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN
KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf]
...
RIP: 0010:strncasecmp lib/string.c:52 [inline]
RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43
afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88
afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249
afs_parse_source fs/afs/super.c:290 [inline]
...
Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
cc: syzkaller-bugs@googlegroups.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.
Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).
- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.
- Allow more than 255 IPv4 multicast interfaces.
- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.
- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.
- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.
- Allow more calls to same peer in RxRPC.
- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.
- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.
- Add TC actions for implementing MPLS L2 VPNs.
- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.
- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.
- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.
- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.
- Support sleepable BPF programs, initially targeting LSM and tracing.
- Add bpf_d_path() helper for returning full path for given 'struct
path'.
- Make bpf_tail_call compatible with bpf-to-bpf calls.
- Allow BPF programs to call map_update_elem on sockmaps.
- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).
- Support listing and getting information about bpf_links via the bpf
syscall.
- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).
- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.
- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).
- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.
- Add XDP support for Intel's igb driver.
- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.
- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.
- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.
- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.
- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.
- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.
- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.
- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.
- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.
- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).
* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
"The latest advances in computer science from the trivial queue"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
xtensa: fix Kconfig typo
spelling.txt: Remove some duplicate entries
mtd: rawnand: oxnas: cleanup/simplify code
selftests: vm: add fragment CONFIG_GUP_BENCHMARK
perf: Fix opt help text for --no-bpf-event
HID: logitech-dj: Fix spelling in comment
bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIG
MAINTAINERS: rectify MMP SUPPORT after moving cputype.h
scif: Fix spelling of EACCES
printk: fix global comment
lib/bitmap.c: fix spello
fs: Fix missing 'bit' in comment
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull direct-io fix from Jan Kara:
"Fix for unaligned direct IO read past EOF in legacy DIO code"
* tag 'dio_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
direct-io: defer alignment check until after the EOF check
direct-io: don't force writeback for reads beyond EOF
direct-io: clean up error paths of do_blockdev_direct_IO
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF, reiserfs, ext2, quota fixes from Jan Kara:
- a couple of UDF fixes for issues found by syzbot fuzzing
- a couple of reiserfs fixes for issues found by syzbot fuzzing
- some minor ext2 cleanups
- quota patches to support grace times beyond year 2038 for XFS quota
APIs
* tag 'fs_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Fix oops during mount
udf: Limit sparing table size
udf: Remove pointless union in udf_inode_info
udf: Avoid accessing uninitialized data on failed inode read
quota: clear padding in v2r1_mem2diskdqb()
reiserfs: Initialize inode keys properly
udf: Fix memory leak when mounting
udf: Remove redundant initialization of variable ret
reiserfs: only call unlock_new_inode() if I_NEW
ext2: Fix some kernel-doc warnings in balloc.c
quota: Expand comment describing d_itimer
quota: widen timestamps for the fs_disk_quota structure
reiserfs: Fix memory leak in reiserfs_parse_options()
udf: Use kvzalloc() in udf_sb_alloc_bitmap()
ext2: remove duplicate include
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of char, misc, and other assorted driver subsystem
patches for 5.10-rc1.
There's a lot of different things in here, all over the drivers/
directory. Some summaries:
- soundwire driver updates
- habanalabs driver updates
- extcon driver updates
- nitro_enclaves new driver
- fsl-mc driver and core updates
- mhi core and bus updates
- nvmem driver updates
- eeprom driver updates
- binder driver updates and fixes
- vbox minor bugfixes
- fsi driver updates
- w1 driver updates
- coresight driver updates
- interconnect driver updates
- misc driver updates
- other minor driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
binder: fix UAF when releasing todo list
docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
misc: Kconfig: fix a HISI_HIKEY_USB dependency
LSM: Fix type of id parameter in kernel_post_load_data prototype
misc: Kconfig: add a new dependency for HISI_HIKEY_USB
firmware_loader: fix a kernel-doc markup
w1: w1_therm: make w1_poll_completion static
binder: simplify the return expression of binder_mmap
test_firmware: Test partial read support
firmware: Add request_partial_firmware_into_buf()
firmware: Store opt_flags in fw_priv
fs/kernel_file_read: Add "offset" arg for partial reads
IMA: Add support for file reads without contents
LSM: Add "contents" flag to kernel_read_file hook
module: Call security_kernel_post_load_data()
firmware_loader: Use security_post_load_data()
LSM: Introduce kernel_post_load_data() hook
fs/kernel_read_file: Add file_size output argument
fs/kernel_read_file: Switch buffer size arg to size_t
fs/kernel_read_file: Remove redundant size argument
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core patches for 5.10-rc1
They include a lot of different things, all related to the driver core
and/or some driver logic:
- sysfs common write functions to make it easier to audit sysfs
attributes
- device connection cleanups and fixes
- devm helpers for a few functions
- NOIO allocations for when devices are being removed
- minor cleanups and fixes
All have been in linux-next for a while with no reported issues"
* tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
regmap: debugfs: use semicolons rather than commas to separate statements
platform/x86: intel_pmc_core: do not create a static struct device
drivers core: node: Use a more typical macro definition style for ACCESS_ATTR
drivers core: Use sysfs_emit for shared_cpu_map_show and shared_cpu_list_show
mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit
drivers core: Miscellaneous changes for sysfs_emit
drivers core: Reindent a couple uses around sysfs_emit
drivers core: Remove strcat uses around sysfs_emit and neaten
drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
dyndbg: use keyword, arg varnames for query term pairs
driver core: force NOIO allocations during unplug
platform_device: switch to simpler IDA interface
driver core: platform: Document return type of more functions
Revert "driver core: Annotate dev_err_probe() with __must_check"
Revert "test_firmware: Test platform fw loading on non-EFI systems"
iio: adc: xilinx-xadc: use devm_krealloc()
hwmon: pmbus: use more devres helpers
devres: provide devm_krealloc()
syscore: Use pm_pr_dbg() for syscore_{suspend,resume}()
...
|
|
Fix data race in prepend_path() with re-reading mnt->mnt_ns twice
without holding the lock.
is_mounted() does check for NULL, but is_anon_ns(mnt->mnt_ns) might
re-read the pointer again which could be NULL already, if in between
reads one of kern_unmount()/kern_unmount_array()/umount_tree() sets
mnt->mnt_ns to NULL.
This is seen in production with the following stack trace:
BUG: kernel NULL pointer dereference, address: 0000000000000048
...
RIP: 0010:prepend_path.isra.4+0x1ce/0x2e0
Call Trace:
d_path+0xe6/0x150
proc_pid_readlink+0x8f/0x100
vfs_readlink+0xf8/0x110
do_readlinkat+0xfd/0x120
__x64_sys_readlinkat+0x1a/0x20
do_syscall_64+0x42/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: f2683bd8d5bd ("[PATCH] fix d_absolute_path() interplay with fsmount()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull xfs updates from Darrick Wong:
"The biggest changes are two new features for the ondisk metadata: one
to record the sizes of the inode btrees in the AG to increase
redundancy checks and to improve mount times; and a second new feature
to support timestamps until the year 2486.
We also fixed a problem where reflinking into a file that requires
synchronous writes wouldn't actually flush the updates to disk; clean
up a fair amount of cruft; and started fixing some bugs in the
realtime volume code.
Summary:
- Clean up the buffer ioend calling path so that the retry strategy
isn't quite so scattered everywhere.
- Clean up m_sb_bp handling.
- New feature: storing inode btree counts in the AGI to speed up
certain mount time per-AG block reservation operatoins and add a
little more metadata redundancy.
- New feature: Widen inode timestamps and quota grace expiration
timestamps to support dates through the year 2486.
- Get rid of more of our custom buffer allocation API wrappers.
- Use a proper VLA for shortform xattr structure namevals.
- Force the log after reflinking or deduping into a file that is
opened with O_SYNC or O_DSYNC.
- Fix some math errors in the realtime allocator"
* tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (42 commits)
xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size
xfs: make sure the rt allocator doesn't run off the end
xfs: Remove unneeded semicolon
xfs: force the log after remapping a synchronous-writes file
xfs: Convert xfs_attr_sf macros to inline functions
xfs: Use variable-size array for nameval in xfs_attr_sf_entry
xfs: Remove typedef xfs_attr_shortform_t
xfs: remove typedef xfs_attr_sf_entry_t
xfs: Remove kmem_zalloc_large()
xfs: enable big timestamps
xfs: trace timestamp limits
xfs: widen ondisk quota expiration timestamps to handle y2038+
xfs: widen ondisk inode timestamps to deal with y2038+
xfs: redefine xfs_ictimestamp_t
xfs: redefine xfs_timestamp_t
xfs: move xfs_log_dinode_to_disk to the log recovery code
xfs: refactor quota timestamp coding
xfs: refactor default quota grace period setting code
xfs: refactor quota expiration timer modification
xfs: explicitly define inode timestamp range
...
|
|
f2fs_seek_block() is only used for regular file,
so don't have to check inline dentry in it.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, unmounting an
f2fs filesystem could result in the following splat:
kobject: 'loop5' ((____ptrval____)): kobject_release, parent 0000000000000000 (delayed 250)
kobject: 'f2fs_xattr_entry-7:5' ((____ptrval____)): kobject_release, parent 0000000000000000 (delayed 750)
------------[ cut here ]------------
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98
WARNING: CPU: 0 PID: 699 at lib/debugobjects.c:485 debug_print_object+0x180/0x240
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 699 Comm: syz-executor.5 Tainted: G S 5.9.0-rc8+ #101
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x4d8
show_stack+0x34/0x48
dump_stack+0x174/0x1f8
panic+0x360/0x7a0
__warn+0x244/0x2ec
report_bug+0x240/0x398
bug_handler+0x50/0xc0
call_break_hook+0x160/0x1d8
brk_handler+0x30/0xc0
do_debug_exception+0x184/0x340
el1_dbg+0x48/0xb0
el1_sync_handler+0x170/0x1c8
el1_sync+0x80/0x100
debug_print_object+0x180/0x240
debug_check_no_obj_freed+0x200/0x430
slab_free_freelist_hook+0x190/0x210
kfree+0x13c/0x460
f2fs_put_super+0x624/0xa58
generic_shutdown_super+0x120/0x300
kill_block_super+0x94/0xf8
kill_f2fs_super+0x244/0x308
deactivate_locked_super+0x104/0x150
deactivate_super+0x118/0x148
cleanup_mnt+0x27c/0x3c0
__cleanup_mnt+0x28/0x38
task_work_run+0x10c/0x248
do_notify_resume+0x9d4/0x1188
work_pending+0x8/0x34c
Like the error handling for f2fs_register_sysfs(), we need to wait for
the kobject to be destroyed before returning to prevent a potential
use-after-free.
Fixes: bf9e697ecd42 ("f2fs: expose features to sysfs entry")
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Signed-off-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Pull iomap updates from Darrick Wong:
"There's not a lot of new stuff going on here -- a little bit of code
refactoring to make iomap workable with btrfs' fsync locking model,
cleanups in preparation for adding THP support for filesystems, and
fixing a data corruption issue for blocksize < pagesize filesystems.
Summary:
- Don't WARN_ON weird states that unprivileged users can create.
- Don't invalidate page cache when direct writes want to fall back to
buffered.
- Fix some problems when readahead ios fail.
- Fix a problem where inline data pages weren't getting flushed
during an unshare operation.
- Rework iomap to support arbitrarily many blocks per page in
preparation to support THP for the page cache.
- Fix a bug in the blocksize < pagesize buffered io path where we
could fail to initialize the many-blocks-per-page uptodate bitmap
correctly when the backing page is actually up to date. This could
cause us to forget to write out dirty pages.
- Split out the generic_write_sync at the end of the directio write
path so that btrfs can drop the inode lock before sync'ing the
file.
- Call inode_dio_end before trying to sync the file after a O_DSYNC
direct write (instead of afterwards) to match the behavior of the
old directio code"
* tag 'iomap-5.10-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Call inode_dio_end() before generic_write_sync()
iomap: Allow filesystem to call iomap_dio_complete without i_rwsem
iomap: Set all uptodate bits for an Uptodate page
iomap: Change calling convention for zeroing
iomap: Convert iomap_write_end types
iomap: Convert write_count to write_bytes_pending
iomap: Convert read_count to read_bytes_pending
iomap: Support arbitrarily many blocks per page
iomap: Use bitmap ops to set uptodate bits
iomap: Use kzalloc to allocate iomap_page
fs: Introduce i_blocks_per_page
iomap: Fix misplaced page flushing
iomap: Use round_down/round_up macros in __iomap_write_begin
iomap: Mark read blocks uptodate in write_begin
iomap: Clear page error before beginning a write
iomap: Fix direct I/O write consistency check
iomap: fix WARN_ON_ONCE() from unprivileged users
|
|
virtiofs currently maps various buffers in scatter gather list and it looks
at number of pages (ap->pages) and assumes that same number of pages will
be used both for input and output (sg_count_fuse_req()), and calculates
total number of scatterlist elements accordingly.
But looks like this assumption is not valid in all the cases. For example,
Cai Qian reported that trinity, triggers warning with virtiofs sometimes.
A closer look revealed that if one calls ioctl(fd, 0x5a004000, buf), it
will trigger following warning.
WARN_ON(out_sgs + in_sgs != total_sgs)
In this case, total_sgs = 8, out_sgs=4, in_sgs=3. Number of pages is 2
(ap->pages), but out_sgs are using both the pages but in_sgs are using
only one page. In this case, fuse_do_ioctl() sets different size values
for input and output.
args->in_args[args->in_numargs - 1].size == 6656
args->out_args[args->out_numargs - 1].size == 4096
So current method of calculating how many scatter-gather list elements
will be used is not accurate. Make calculations more precise by parsing
size and ap->descs.
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/linux-fsdevel/5ea77e9f6cb8c2db43b09fbd4158ab2d8c066a0a.camel@redhat.com/
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In current condition check, if it detects writecount, it return -EBUSY
regardless of f_mode of the file. Fixed it.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
check_swap_activate() will lookup block mapping via bmap() one by one, so
its performance is very bad, this patch introduces check_swap_activate_fast()
to use f2fs_fiemap() to boost this process, since f2fs_fiemap() will lookup
block mappings in batch, therefore, it can improve swapon()'s performance
significantly.
Note that this enhancement only works when page size is equal to f2fs' block
size.
Testcase: (backend device: zram)
- touch file
- pin & fallocate file to 8GB
- mkswap file
- swapon file
Before:
real 0m2.999s
user 0m0.000s
sys 0m2.980s
After:
real 0m0.081s
user 0m0.000s
sys 0m0.064s
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|