summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-12-28Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - large KASAN update to use arm's "software tag-based mode" - a few misc things - sh updates - ocfs2 updates - just about all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits) kernel/fork.c: mark 'stack_vm_area' with __maybe_unused memcg, oom: notify on oom killer invocation from the charge path mm, swap: fix swapoff with KSM pages include/linux/gfp.h: fix typo mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization memory_hotplug: add missing newlines to debugging output mm: remove __hugepage_set_anon_rmap() include/linux/vmstat.h: remove unused page state adjustment macro mm/page_alloc.c: allow error injection mm: migrate: drop unused argument of migrate_page_move_mapping() blkdev: avoid migration stalls for blkdev pages mm: migrate: provide buffer_migrate_page_norefs() mm: migrate: move migrate_page_lock_buffers() mm: migrate: lock buffers before migrate_page_move_mapping() mm: migration: factor out code to compute expected number of page references mm, page_alloc: enable pcpu_drain with zone capability kmemleak: add config to select auto scan mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init ...
2018-12-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "This has been a fairly typical cycle, with the usual sorts of driver updates. Several series continue to come through which improve and modernize various parts of the core code, and we finally are starting to get the uAPI command interface cleaned up. - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5, qib, rxe, usnic - Rework the entire syscall flow for uverbs to be able to run over ioctl(). Finally getting past the historic bad choice to use write() for command execution - More functional coverage with the mlx5 'devx' user API - Start of the HFI1 series for 'TID RDMA' - SRQ support in the hns driver - Support for new IBTA defined 2x lane widths - A big series to consolidate all the driver function pointers into a big struct and have drivers provide a 'static const' version of the struct instead of open coding initialization - New 'advise_mr' uAPI to control device caching/loading of page tables - Support for inline data in SRPT - Modernize how umad uses the driver core and creates cdev's and sysfs files - First steps toward removing 'uobject' from the view of the drivers" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits) RDMA/srpt: Use kmem_cache_free() instead of kfree() RDMA/mlx5: Signedness bug in UVERBS_HANDLER() IB/uverbs: Signedness bug in UVERBS_HANDLER() IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported IB/umad: Start using dev_groups of class IB/umad: Use class_groups and let core create class file IB/umad: Refactor code to use cdev_device_add() IB/umad: Avoid destroying device while it is accessed IB/umad: Simplify and avoid dynamic allocation of class IB/mlx5: Fix wrong error unwind IB/mlx4: Remove set but not used variable 'pd' RDMA/iwcm: Don't copy past the end of dev_name() string IB/mlx5: Fix long EEH recover time with NVMe offloads IB/mlx5: Simplify netdev unbinding IB/core: Move query port to ioctl RDMA/nldev: Expose port_cap_flags2 IB/core: uverbs copy to struct or zero helper IB/rxe: Reuse code which sets port state IB/rxe: Make counters thread safe IB/mlx5: Use the correct commands for UMEM and UCTX allocation ...
2018-12-28Merge tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull aio updates from Jens Axboe: "Flushing out pre-patches for the buffered/polled aio series. Some fixes in here, but also optimizations" * tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block: aio: abstract out io_event filler helper aio: split out iocb copy from io_submit_one() aio: use iocb_put() instead of open coding it aio: only use blk plugs for > 2 depth submissions aio: don't zero entire aio_kiocb aio_get_req() aio: separate out ring reservation from req allocation aio: use assigned completion handler
2018-12-28Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ...
2018-12-28Merge tag 'y2038-for-4.21' of ↵Linus Torvalds
ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull y2038 updates from Arnd Bergmann: "More syscalls and cleanups This concludes the main part of the system call rework for 64-bit time_t, which has spread over most of year 2018, the last six system calls being - ppoll - pselect6 - io_pgetevents - recvmmsg - futex - rt_sigtimedwait As before, nothing changes for 64-bit architectures, while 32-bit architectures gain another entry point that differs only in the layout of the timespec structure. Hopefully in the next release we can wire up all 22 of those system calls on all 32-bit architectures, which gives us a baseline version for glibc to start using them. This does not include the clock_adjtime, getrusage/waitid, and getitimer/setitimer system calls. I still plan to have new versions of those as well, but they are not required for correct operation of the C library since they can be emulated using the old 32-bit time_t based system calls. Aside from the system calls, there are also a few cleanups here, removing old kernel internal interfaces that have become unused after all references got removed. The arch/sh cleanups are part of this, there were posted several times over the past year without a reaction from the maintainers, while the corresponding changes made it into all other architectures" * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: timekeeping: remove obsolete time accessors vfs: replace current_kernel_time64 with ktime equivalent timekeeping: remove timespec_add/timespec_del timekeeping: remove unused {read,update}_persistent_clock sh: remove board_time_init() callback sh: remove unused rtc_sh_get/set_time infrastructure sh: sh03: rtc: push down rtc class ops into driver sh: dreamcast: rtc: push down rtc class ops into driver y2038: signal: Add compat_sys_rt_sigtimedwait_time64 y2038: signal: Add sys_rt_sigtimedwait_time32 y2038: socket: Add compat_sys_recvmmsg_time64 y2038: futex: Add support for __kernel_timespec y2038: futex: Move compat implementation into futex.c io_pgetevents: use __kernel_timespec pselect6: use __kernel_timespec ppoll: use __kernel_timespec signal: Add restore_user_sigmask() signal: Add set_user_sigmask()
2018-12-28hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate raceMike Kravetz
hugetlbfs page faults can race with truncate and hole punch operations. Current code in the page fault path attempts to handle this by 'backing out' operations if we encounter the race. One obvious omission in the current code is removing a page newly added to the page cache. This is pretty straight forward to address, but there is a more subtle and difficult issue of backing out hugetlb reservations. To handle this correctly, the 'reservation state' before page allocation needs to be noted so that it can be properly backed out. There are four distinct possibilities for reservation state: shared/reserved, shared/no-resv, private/reserved and private/no-resv. Backing out a reservation may require memory allocation which could fail so that needs to be taken into account as well. Instead of writing the required complicated code for this rare occurrence, just eliminate the race. i_mmap_rwsem is now held in read mode for the duration of page fault processing. Hold i_mmap_rwsem longer in truncation and hold punch code to cover the call to remove_inode_hugepages. With this modification, code in remove_inode_hugepages checking for races becomes 'dead' as it can not longer happen. Remove the dead code and expand comments to explain reasoning. Similarly, checks for races with truncation in the page fault path can be simplified and removed. [mike.kravetz@oracle.com: incorporat suggestions from Kirill] Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: migrate: drop unused argument of migrate_page_move_mapping()Jan Kara
All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28blkdev: avoid migration stalls for blkdev pagesJan Kara
Currently, block device pages don't provide a ->migratepage callback and thus fallback_migrate_page() is used for them. This handler cannot deal with dirty pages in async mode and also with the case a buffer head is in the LRU buffer head cache (as it has elevated b_count). Thus such page can block memory offlining. Fix the problem by using buffer_migrate_page_norefs() for migrating block device pages. That function takes care of dropping bh LRU in case migration would fail due to elevated buffer refcount to avoid stalls and can also migrate dirty pages without writing them. Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28userfaultfd: clear flag if remap event not enabledPeter Xu
When the process being tracked does mremap() without UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle, we should not generate the remap event, and at the same time we should clear all the uffd flags on the new VMA. Without this patch, we can still have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault handling process does not even know the existance of the VMA. Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Pravin Shedge <pravin.shedge4linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, proc: report PR_SET_THP_DISABLE in procMichal Hocko
David Rientjes has reported that commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") has changed the way how we report THPable VMAs to the userspace. Their monitoring tool is triggering false alarms on PR_SET_THP_DISABLE tasks because it considers an insufficient THP usage as a memory fragmentation resp. memory pressure issue. Before the said commit each newly created VMA inherited VM_NOHUGEPAGE flag and that got exposed to the userspace via /proc/<pid>/smaps file. This implementation had its downsides as explained in the commit message but it is true that the userspace doesn't have any means to query for the process wide THP enabled/disabled status. PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to export in the process wide context rather than per-vma. Introduce a new field to /proc/<pid>/status which export this status. If PR_SET_THP_DISABLE is used then it reports false same as when the THP is not compiled in. It doesn't consider the global THP status because we already export that information via sysfs Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, thp, proc: report THP eligibility for each vmaMichal Hocko
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28/proc/kpagecount: return 0 for special pages that are never mappedAnthony Yznaga
Certain pages that are never mapped to userspace have a type indicated in the page_type field of their struct pages (e.g. PG_buddy). page_type overlaps with _mapcount so set the count to 0 and avoid calling page_mapcount() for these pages. [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox] Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28userfaultfd: convert userfaultfd_ctx::refcount to refcount_tEric Biggers
Reference counters should use refcount_t rather than atomic_t, since the refcount_t implementation can prevent overflows, reducing the exploitability of reference leak bugs. userfaultfd_ctx::refcount is a reference counter with the usual semantics, so convert it to refcount_t. Note: I replaced the BUG() on incrementing a 0 refcount with just refcount_inc(), since part of the semantics of refcount_t is that that incrementing a 0 refcount is not allowed; with CONFIG_REFCOUNT_FULL, refcount_inc() already checks for it and warns. Link: http://lkml.kernel.org/r/20181115003916.63381-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: reference totalram_pages and managed_pages once per functionArun KS
Patch series "mm: convert totalram_pages, totalhigh_pages and managed pages to atomic", v5. This series converts totalram_pages, totalhigh_pages and zone->managed_pages to atomic variables. totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better to remove the lock and convert variables to atomic. With the change, preventing poteintial store-to-read tearing comes as a bonus. This patch (of 4): This is in preparation to a later patch which converts totalram_pages and zone->managed_pages to atomic variables. Please note that re-reading the value might lead to a different value and as such it could lead to unexpected behavior. There are no known bugs as a result of the current code but it is better to prevent from them in principle. Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: don't clear bh uptodate for block readJunxiao Bi
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag and submit the io, second wait io done, last check whether bh uptodate, if not return io error. If two sync io for the same bh were issued, it could be the first io done and set uptodate flag, but just before check that flag, the second io came in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io will return IO error. Indeed it's not necessary to clear uptodate flag, as the io end handler end_buffer_read_sync() will set or clear it based on io succeed or failed. The following message was found from a nfs server but the underlying storage returned no error. [4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5 [4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5 [4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5 [4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5 [4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5 Same issue in non sync read ocfs2_read_blocks(), fixed it as well. Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: clear journal dirty flag after shutdown journalJunxiao Bi
Dirty flag of the journal should be cleared at the last stage of umount, if do it before jbd2_journal_destroy(), then some metadata in uncommitted transaction could be lost due to io error, but as dirty flag of journal was already cleared, we can't find that until run a full fsck. This may cause system panic or other corruption. Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: fix panic due to unrecovered local allocJunxiao Bi
mount.ocfs2 ignore the inconsistent error that journal is clean but local alloc is unrecovered. After mount, local alloc not empty, then reserver cluster didn't alloc a new local alloc window, reserveration map is empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the following panic. This issue was reported at https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html and was advised to fixed during mount. But this is a very unusual inconsistent state, usually journal dirty flag should be cleared at the last stage of umount until every other things go right. We may need do further debug to check that. Any way to avoid possible futher corruption, mount should be abort and fsck should be run. (mount.ocfs2,1765,1):ocfs2_load_local_alloc:353 ERROR: Local alloc hasn't been recovered! found = 6518, set = 6518, taken = 8192, off = 15912372 ocfs2: Mounting device (202,64) on (node 0, slot 3) with ordered data mode. o2dlm: Joining domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 8 ) 8 nodes ocfs2: Mounting device (202,80) on (node 0, slot 3) with ordered data mode. o2hb: Region 89CEAC63CC4F4D03AC185B44E0EE0F3F (xvdf) is now a quorum device o2net: Accepted connection from node yvwsoa17p (num 7) at 172.22.77.88:7777 o2dlm: Node 7 joins domain 64FE421C8C984E6D96ED12C55FEE2435 ( 0 1 2 3 4 5 6 7 8 ) 9 nodes o2dlm: Node 7 joins domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 7 8 ) 9 nodes ------------[ cut here ]------------ kernel BUG at fs/ocfs2/reservations.c:507! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev parport_pc parport xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom xen_blkfront pata_acpi ata_generic ata_piix floppy dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 4349 Comm: startWebLogic.s Not tainted 4.1.12-124.19.2.el6uek.x86_64 #2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 09/06/2018 task: ffff8803fb04e200 ti: ffff8800ea4d8000 task.ti: ffff8800ea4d8000 RIP: 0010:[<ffffffffa05e96a8>] [<ffffffffa05e96a8>] __ocfs2_resv_find_window+0x498/0x760 [ocfs2] Call Trace: ocfs2_resmap_resv_bits+0x10d/0x400 [ocfs2] ocfs2_claim_local_alloc_bits+0xd0/0x640 [ocfs2] __ocfs2_claim_clusters+0x178/0x360 [ocfs2] ocfs2_claim_clusters+0x1f/0x30 [ocfs2] ocfs2_convert_inline_data_to_extents+0x634/0xa60 [ocfs2] ocfs2_write_begin_nolock+0x1c6/0x1da0 [ocfs2] ocfs2_write_begin+0x13e/0x230 [ocfs2] generic_perform_write+0xbf/0x1c0 __generic_file_write_iter+0x19c/0x1d0 ocfs2_file_write_iter+0x589/0x1360 [ocfs2] __vfs_write+0xb8/0x110 vfs_write+0xa9/0x1b0 SyS_write+0x46/0xb0 system_call_fastpath+0x18/0xd7 Code: ff ff 8b 75 b8 39 75 b0 8b 45 c8 89 45 98 0f 84 e5 fe ff ff 45 8b 74 24 18 41 8b 54 24 1c e9 56 fc ff ff 85 c0 0f 85 48 ff ff ff <0f> 0b 48 8b 05 cf c3 de ff 48 ba 00 00 00 00 00 00 00 10 48 85 RIP __ocfs2_resv_find_window+0x498/0x760 [ocfs2] RSP <ffff8800ea4db668> ---[ end trace 566f07529f2edf3c ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: improve ocfs2 MakefileLarry Chen
Included file path was hard-wired in the ocfs2 makefile, which might causes some confusion when compiling ocfs2 as an external module. Say if we compile ocfs2 module as following. cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2 cd /other/dir/ocfs2 make -C /path/to/kernel_source M=`pwd` modules Acutally, the compiler wil try to find included file in /kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2. To fix this little bug, we introduce the var $(src) provided by kbuild. $(src) means the absolute path of the running kbuild file. Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: remove set but not used variable 'lastzero'zhong jiang
lastzero is not used after setting its value. It is safe to remove the unused variable. Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: dlmfs: remove set but not used variable 'status'zhong jiang
status is not used after setting its value. It is safe to remove the unused variable. Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28ocfs2: optimize the reading of heartbeat dataJia Guo
Reading heartbeat data from lowest node rather than from zero, in cases where the node is not defined from zero, can reduce the number of sectors read. Here is a simple test data obtained with 'iostat -dmx dm-5 2', with two nodes in the cluster, node number 10, 20, respectively. Before optimization: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-5 0.00 0.00 0.50 0.50 0.01 0.00 11.00 0.00 1.00 1.00 1.00 1.50 0.15 After the optimization: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-5 0.00 0.00 0.50 0.50 0.00 0.00 6.00 0.00 0.50 1.00 0.00 0.50 0.05 Link: http://lkml.kernel.org/r/99fe4988-69ac-3615-a218-3042fe6fbe72@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-27Merge tag 'locks-v4.21-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change in this set is Neil Brown's work to reduce the thundering herd problem when a heavily-contended file lock is released. Previously we'd always wake up all waiters when this occurred. With this set, we'll now we only wake up waiters that were blocked on the range being released" * tag 'locks-v4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: Use inode_is_open_for_write fs/locks: remove unnecessary white space. fs/locks: merge posix_unblock_lock() and locks_delete_block() fs/locks: create a tree of dependent requests. fs/locks: change all *_conflict() functions to return bool. fs/locks: always delete_block after waiting. fs/locks: allow a lock request to block other requests. fs/locks: use properly initialized file_lock when unlocking. ocfs2: properly initial file_lock used for unlock. gfs2: properly initial file_lock used for unlock. NFS: use locks_copy_lock() to copy locks. fs/locks: split out __locks_wake_up_blocks(). fs/locks: rename some lists and pointers.
2018-12-27Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "All cleanups and bug fixes; most notably, fix some problems discovered in ext4's NFS support, and fix an ioctl (EXT4_IOC_GROUP_ADD) used by old versions of e2fsprogs which we accidentally broke a while back. Also fixed some error paths in ext4's quota and inline data support. Finally, improve tail latency in jbd2's commit code" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: check for shutdown and r/o file system in ext4_write_inode() ext4: force inode writes when nfsd calls commit_metadata() ext4: avoid declaring fs inconsistent due to invalid file handles ext4: include terminating u32 in size of xattr entries when expanding inodes ext4: compare old and new mode before setting update_mode flag ext4: fix EXT4_IOC_GROUP_ADD ioctl ext4: hard fail dax mount on unsupported devices jbd2: update locking documentation for transaction_t ext4: remove redundant condition check jbd2: clean up indentation issue, replace spaces with tab ext4: clean up indentation issues, remove extraneous tabs ext4: missing unlock/put_page() in ext4_try_to_write_inline_data() ext4: fix possible use after free in ext4_quota_enable jbd2: avoid long hold times of j_state_lock while committing a transaction ext4: add ext4_sb_bread() to disambiguate ENOMEM cases
2018-12-27Merge tag 'iomap-4.21-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap update from Darrick Wong: "Fix a memory overflow bug for blocksize < pagesize" * tag 'iomap-4.21-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: don't search past page end in iomap_is_partially_uptodate
2018-12-27Merge tag 'xfs-4.21-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull XFS updates from Darrick Wong: - Fix CoW remapping of extremely fragmented file areas - Fix a zero-length symlink verifier error - Constify some of the rmap owner structures for per-AG metadata - Precalculate inode geometry for later use - Fix scrub counting problems - Don't crash when rtsummary inode is null - Fix x32 ioctl operation - Fix enum->string mappings for ftrace output - Cache realtime summary information in memory * tag 'xfs-4.21-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (24 commits) xfs: reallocate realtime summary cache on growfs xfs: stringify scrub types in ftrace output xfs: stringify btree cursor types in ftrace output xfs: move XFS_INODE_FORMAT_STR mappings to libxfs xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfs xfs: fix symbolic enum printing in ftrace output xfs: fix function pointer type in ftrace format xfs: Fix x32 ioctls when cmd numbers differ from ia32. xfs: Fix bulkstat compat ioctls on x32 userspace. xfs: Align compat attrlist_by_handle with native implementation. xfs: require both realtime inodes to mount xfs: cache minimum realtime summary level xfs: count inode blocks correctly in inobt scrub xfs: precalculate cluster alignment in inodes and blocks xfs: precalculate inodes and blocks per inode cluster xfs: add a block to inode count converter xfs: remove xfs_rmap_ag_owner and friends xfs: const-ify xfs_owner_info arguments xfs: streamline defer op type handling xfs: idiotproof defer op type configuration ...
2018-12-27Merge tag 'fs_for_4.21-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, udf, and quota update from Jan Kara: "Some ext2 cleanups, a fix for UDF crash on corrupted media, and one quota locking fix" * tag 'fs_for_4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. udf: Fix BUG on corrupted inode ext2: change reusable parameter to true when calling mb_cache_entry_create() ext2: remove redundant condition check ext2: avoid unnecessary operation in ext2_error()
2018-12-27Merge tag 'fsnotify_for_v4.21-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Support for new FAN_OPEN_EXEC event and couple of cleanups around fsnotify" * tag 'fsnotify_for_v4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: Use inode_is_open_for_write fanotify: Make sure to check event_len when copying fsnotify/fdinfo: include fdinfo.h for inotify_show_fdinfo() fanotify: introduce new event mask FAN_OPEN_EXEC_PERM fsnotify: refactor fsnotify_parent()/fsnotify() paired calls when event is on path fanotify: introduce new event mask FAN_OPEN_EXEC fanotify: return only user requested event types in event mask
2018-12-27Merge tag 'dlm-4.21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set is entirely trivial fixes, mainly around correct cleanup on error paths and improved error checks. One patch adds scheduling in a potentially long recovery loop" * tag 'dlm-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: fix invalid cluster name warning dlm: NULL check before some freeing functions is not needed dlm: NULL check before kmem_cache_destroy is not needed dlm: fix missing idr_destroy for recover_idr dlm: memory leaks on error path in dlm_user_request() dlm: lost put_lkb on error path in receive_convert() and receive_unlock() dlm: possible memory leak on error path in create_lkb() dlm: fixed memory leaks after failed ls_remove_names allocation dlm: fix possible call to kfree() for non-initialized pointer dlm: Don't swamp the CPU with callbacks queued during recovery dlm: don't leak kernel pointer to userspace dlm: don't allow zero length names dlm: fix invalid free
2018-12-27Merge tag 'for-4.21-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - swapfile support - after a long time it's here, with some limitations where COW design does not work well with the swap implementation (nodatacow file, no compression, cannot be snapshotted, not possible on multiple devices, ...), as this is the most restricted but working setup, we'll try to improve that in the future - metadata uuid - an optional incompat feature to assign a new filesystem UUID without overwriting all metadata blocks, stored only in superblock - more balance messages are printed to system log, initial is in the format of the command line that would be used to start it Fixes: - tag pages of a snapshot to better separate pages that are involved in the snapshot (and need to get synced) from newly dirtied pages that could slow down or even livelock the snapshot operation - improved check of filesystem id associated with a device during scan to detect duplicate devices that could be mixed up during mount - fix device replace state transitions, eg. when it ends up interrupted and reboot tries to restart balance too, or when start/cancel ioctls race - fix a crash due to a race when quotas are enabled during snapshot creation - GFP_NOFS/memalloc_nofs_* fixes due to GFP_KERNEL allocations in transaction context - fix fsync of files with multiple hard links in new directories - fix race of send with transaction commits that create snapshots Core changes: - cleanups: * further removals of now-dead fsync code * core function for finding free extent has been split and provides a base for further cleanups to make the logic more understandable * removed lot of indirect callbacks for data and metadata inodes * simplified refcounting and locking for cloned extent buffers * removed redundant function arguments * defines converted to enums where appropriate - separate reserve for delayed refs from global reserve, update logic to do less trickery and ad-hoc heuristics, move out some related expensive operations from transaction commit or file truncate - dev-replace switched from custom locking scheme to semaphore - remove first phase of balance that tried to make some space for the relocation by calling shrink and grow, this did not work as expected and only introduced more error states due to potential resize failures, slightly improves the runtime as the chunks on all devices are not needlessly enumerated - clone and deduplication now use generic helper that adds a few more checks that were missing from the original btrfs implementation of the ioctls" * tag 'for-4.21-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (125 commits) btrfs: Fix typos in comments and strings btrfs: improve error handling of btrfs_add_link Btrfs: use generic_remap_file_range_prep() for cloning and deduplication btrfs: Refactor main loop in extent_readpages btrfs: Remove 1st shrink/grow phase from balance Btrfs: send, fix race with transaction commits that create snapshots Btrfs: use nofs context when initializing security xattrs to avoid deadlock btrfs: run delayed items before dropping the snapshot btrfs: catch cow on deleting snapshots btrfs: extent-tree: cleanup one-shot usage of @blocksize in do_walk_down Btrfs: scrub, move setup of nofs contexts higher in the stack btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex btrfs: scrub: pass fs_info to scrub_setup_ctx btrfs: fix truncate throttling btrfs: don't run delayed refs in the end transaction logic btrfs: rework btrfs_check_space_for_delayed_refs btrfs: add new flushing states for the delayed refs rsv btrfs: update may_commit_transaction to use the delayed refs rsv btrfs: introduce delayed_refs_rsv btrfs: only track ref_heads in delayed_ref_updates ...
2018-12-27Merge tag 'gfs2-4.21.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Bob Peterson: - Enhancements and performance improvements to journal replay (Abhi Das) - Cleanup of gfs2_is_ordered and gfs2_is_writeback (Andreas Gruenbacher) - Fix a potential double-free in inode creation (Andreas Gruenbacher) - Fix the bitmap search loop that was searching too far (Andreas Gruenbacher) - Various cleanups (Andreas Gruenbacher, Bob Peterson) - Implement Steve Whitehouse's patch to dump nrpages for inodes (Bob Peterson) - Fix a withdraw bug where stuffed journaled data files didn't allocate enough journal space to be grown (Bob Peterson) * tag 'gfs2-4.21.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: take jdata unstuff into account in do_grow gfs2: Dump nrpages for inodes and their glocks gfs2: Fix loop in gfs2_rbm_find gfs2: Get rid of potential double-freeing in gfs2_create_inode gfs2: Remove vestigial bd_ops gfs2: read journal in large chunks to locate the head gfs2: add a helper function to get_log_header that can be used elsewhere gfs2: changes to gfs2_log_XXX_bio gfs2: add more timing info to journal recovery process gfs2: Fix the gfs2_invalidatepage description gfs2: Clean up gfs2_is_{ordered,writeback}
2018-12-27Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add 1472-byte test to tcrypt for IPsec - Reintroduced crypto stats interface with numerous changes - Support incremental algorithm dumps Algorithms: - Add xchacha12/20 - Add nhpoly1305 - Add adiantum - Add streebog hash - Mark cts(cbc(aes)) as FIPS allowed Drivers: - Improve performance of arm64/chacha20 - Improve performance of x86/chacha20 - Add NEON-accelerated nhpoly1305 - Add SSE2 accelerated nhpoly1305 - Add AVX2 accelerated nhpoly1305 - Add support for 192/256-bit keys in gcmaes AVX - Add SG support in gcmaes AVX - ESN for inline IPsec tx in chcr - Add support for CryptoCell 703 in ccree - Add support for CryptoCell 713 in ccree - Add SM4 support in ccree - Add SM3 support in ccree - Add support for chacha20 in caam/qi2 - Add support for chacha20 + poly1305 in caam/jr - Add support for chacha20 + poly1305 in caam/qi2 - Add AEAD cipher support in cavium/nitrox" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits) crypto: skcipher - remove remnants of internal IV generators crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS crypto: salsa20-generic - don't unnecessarily use atomic walk crypto: skcipher - add might_sleep() to skcipher_walk_virt() crypto: x86/chacha - avoid sleeping under kernel_fpu_begin() crypto: cavium/nitrox - Added AEAD cipher support crypto: mxc-scc - fix build warnings on ARM64 crypto: api - document missing stats member crypto: user - remove unused dump functions crypto: chelsio - Fix wrong error counter increments crypto: chelsio - Reset counters on cxgb4 Detach crypto: chelsio - Handle PCI shutdown event crypto: chelsio - cleanup:send addr as value in function argument crypto: chelsio - Use same value for both channel in single WR crypto: chelsio - Swap location of AAD and IV sent in WR crypto: chelsio - remove set but not used variable 'kctx_len' crypto: ux500 - Use proper enum in hash_set_dma_transfer crypto: ux500 - Use proper enum in cryp_set_dma_transfer crypto: aesni - Add scatter/gather avx stubs, and use them in C crypto: aesni - Introduce partial block macro ..
2018-12-27Merge tag 'pstore-v4.21-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "Improvements and refactorings: - Improve compression handling - Refactor argument handling during initialization - Avoid needless locking for saner EFI backend handling - Add more kern-doc and improve debugging output" * tag 'pstore-v4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Avoid NULL deref in ftrace merging failure path pstore: Convert buf_lock to semaphore pstore: Fix bool initialization/comparison pstore/ram: Do not treat empty buffers as valid pstore/ram: Simplify ramoops_get_next_prz() arguments pstore: Map PSTORE_TYPE_* to strings pstore: Replace open-coded << with BIT() pstore: Improve and update some comments and status output pstore/ram: Add kern-doc for struct persistent_ram_zone pstore/ram: Report backend assignments with finer granularity pstore/ram: Standardize module name in ramoops pstore: Avoid duplicate call of persistent_ram_zap() pstore: Remove needless lock during console writes pstore: Do not use crash buffer for decompression
2018-12-26Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The biggest RCU changes in this cycle were: - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar. - Replace calls of RCU-bh and RCU-sched update-side functions to their vanilla RCU counterparts. This series is a step towards complete removal of the RCU-bh and RCU-sched update-side functions. ( Note that some of these conversions are going upstream via their respective maintainers. ) - Documentation updates, including a number of flavor-consolidation updates from Joel Fernandes. - Miscellaneous fixes. - Automate generation of the initrd filesystem used for rcutorture testing. - Convert spin_is_locked() assertions to instead use lockdep. ( Note that some of these conversions are going upstream via their respective maintainers. ) - SRCU updates, especially including a fix from Dennis Krein for a bag-on-head-class bug. - RCU torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits) rcutorture: Don't do busted forward-progress testing rcutorture: Use 100ms buckets for forward-progress callback histograms rcutorture: Recover from OOM during forward-progress tests rcutorture: Print forward-progress test age upon failure rcutorture: Print time since GP end upon forward-progress failure rcutorture: Print histogram of CB invocation at OOM time rcutorture: Print GP age upon forward-progress failure rcu: Print per-CPU callback counts for forward-progress failures rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings rcutorture: Dump grace-period diagnostics upon forward-progress OOM rcutorture: Prepare for asynchronous access to rcu_fwd_startat torture: Remove unnecessary "ret" variables rcutorture: Affinity forward-progress test to avoid housekeeping CPUs rcutorture: Break up too-long rcu_torture_fwd_prog() function rcutorture: Remove cbflood facility torture: Bring any extra CPUs online during kernel startup rcutorture: Add call_rcu() flooding forward-progress tests rcutorture/formal: Replace synchronize_sched() with synchronize_rcu() tools/kernel.h: Replace synchronize_sched() with synchronize_rcu() net/decnet: Replace rcu_barrier_bh() with rcu_barrier() ...
2018-12-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-nextLinus Torvalds
Pull sparc updates from David Miller: - Automatic system call table generation, from Firoz Khan. - Clean up accesses to the OF device names by using full_name instead of path_component_name. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next: ALSA: sparc: Use of_node_name_eq for node name comparisons sbus: Use of_node_name_eq for node name comparisons sparc: generate uapi header and system call table files sparc: add system call table generation support sparc: add __NR_syscalls along with NR_syscalls sparc: move __IGNORE* entries to non uapi header sparc: Use DT node full_name instead of name for resources sparc: Remove unused leon_trans_init sparc: Use device_type helpers to access the node type sparc: Use of_node_name_eq for node name comparisons sparc: Convert to using %pOFn instead of device_node.name sparc: prom: use property "name" directly to construct node names of: Drop full path from full_name for PDT systems sparc: Convert to using %pOF instead of full_name fs/openpromfs: Use of_node_name_eq for node name comparisons fs/openpromfs: use full_name instead of path_component_name
2018-12-25Merge tag 'mtd/for-4.21' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull mtd updates from Boris Brezillon: "SPI NOR Core changes: - Parse the 4BAIT SFDP section - Add a bunch of SPI NOR entries to the flash_info table - Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F - A bunch of minor cleanups/comestic changes NAND core changes: - kernel-doc miscellaneous fixes. - Third batch of fixes/cleanup to the raw NAND core impacting various controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610): * Stop to pass mtd_info objects to internal functions * Reorganize code to avoid forward declarations * Drop useless test in nand_legacy_set_defaults() * Move nand_exec_op() to internal.h * Add nand_[de]select_target() helpers * Pass the CS line to be selected in struct nand_operation * Make ->select_chip() optional when ->exec_op() is implemented * Deprecate the ->select_chip() hook * Move the ->exec_op() method to nand_controller_ops * Move ->setup_data_interface() to nand_controller_ops * Deprecate the dummy_controller field * Fix JEDEC detection * Provide a helper for polling GPIO R/B pin Raw NAND chip drivers changes: - Macronix: * Flag 1.8V AC chips with a broken GET_FEATURES(TIMINGS) Raw NAND controllers drivers changes: - Ams-delta: * Fix the error path * SPDX tag added * May be compiled with COMPILE_TEST=y * Conversion to ->exec_op() interface * Drop .IOADDR_R/W use * Use GPIO API for data I/O - Denali: * Remove denali_reset_banks() * Remove ->dev_ready() hook * Include <linux/bits.h> instead of <linux/bitops.h> * Changes to comply with the above fixes/cleanup done in the core. - FSMC: * Add an SPDX tag to replace the license text * Make conversion from chip to fsmc consistent * Fix unchecked return value in fsmc_read_page_hwecc * Changes to comply with the above fixes/cleanup done in the core. - Marvell: * Prevent timeouts on a loaded machine (fix) * Changes to comply with the above fixes/cleanup done in the core. - OMAP2: * Pass the parent of pdev to dma_request_chan() (fix) - R852: * Use generic DMA API - sh_flctl: * Convert to SPDX identifiers - Sunxi: * Write pageprog related opcodes to the right register: WCMD_SET (fix) - Tegra: * Stop implementing ->select_chip() - VF610: * Add an SPDX tag to replace the license text * Changes to comply with the above fixes/cleanup done in the core. - Various trivial/spelling/coding style fixes. SPI-NAND drivers changes: - Remove the depreacated mt29f_spinand driver from staging. - Add support for: * Toshiba TC58CVG2S0H * GigaDevice GD5FxGQ4xA * Winbond W25N01GV JFFS2 changes: - Fix a lockdep issue MTD changes: - Rework the physmap driver to merge gpio-addr-flash and physmap_of in it - Add a new compatible for RedBoot partitions - Make sub-partitions RW if the parent partition was RO because of a mis-alignment - Add pinctrl support to the - Addition of /* fall-through */ comments where appropriate - Various minor fixes and cleanups Other changes: - Update my email address" * tag 'mtd/for-4.21' of git://git.infradead.org/linux-mtd: (108 commits) mtd: rawnand: sunxi: Write pageprog related opcodes to WCMD_SET MAINTAINERS: Update my email address mtd: rawnand: marvell: prevent timeouts on a loaded machine mtd: rawnand: omap2: Pass the parent of pdev to dma_request_chan() mtd: rawnand: Fix JEDEC detection mtd: spi-nor: Add support for is25lp016d mtd: spi-nor: parse SFDP 4-byte Address Instruction Table mtd: spi-nor: Add 4B_OPCODES flag to is25lp256 mtd: spi-nor: Add an SPDX tag to spi-nor.{c,h} mtd: spi-nor: Make the enable argument passed to set_byte() a bool mtd: spi-nor: Stop passing flash_info around mtd: spi-nor: Avoid forward declaration of internal functions mtd: spi-nor: Drop inline on all internal helpers mtd: spi-nor: Add a post BFPT fixup for MX25L25635E mtd: spi-nor: Add a post BFPT parsing fixup hook mtd: spi-nor: Add the SNOR_F_4B_OPCODES flag mtd: spi-nor: cast to u64 to avoid uint overflows mtd: spi-nor: Add support for IS25LP032/064 mtd: spi-nor: add entry for mt35xu512aba flash mtd: spi-nor: add macros related to MICRON flash ...
2018-12-25Merge tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm updates from Dave Airlie: "Core: - shared fencing staging removal - drop transactional atomic helpers and move helpers to new location - DP/MST atomic cleanup - Leasing cleanups and drop EXPORT_SYMBOL - Convert drivers to atomic helpers and generic fbdev. - removed deprecated obj_ref/unref in favour of get/put - Improve dumb callback documentation - MODESET_LOCK_BEGIN/END helpers panels: - CDTech panels, Banana Pi Panel, DLC1010GIG, - Olimex LCD-O-LinuXino, Samsung S6D16D0, Truly NT35597 WQXGA, - Himax HX8357D, simulated RTSM AEMv8. - GPD Win2 panel - AUO G101EVN010 vgem: - render node support ttm: - move global init out of drivers - fix LRU handling for ghost objects - Support for simultaneous submissions to multiple engines scheduler: - timeout/fault handling changes to help GPU recovery - helpers for hw with preemption support i915: - Scaler/Watermark fixes - DP MST + powerwell fixes - PSR fixes - Break long get/put shmemfs pages - Icelake fixes - Icelake DSI video mode enablement - Engine workaround improvements amdgpu: - freesync support - GPU reset enabled on CI, VI, SOC15 dGPUs - ABM support in DC - KFD support for vega12/polaris12 - SDMA paging queue on vega - More amdkfd code sharing - DCC scanout on GFX9 - DC kerneldoc - Updated SMU firmware for GFX8 chips - XGMI PSP + hive reset support - GPU reset - DC trace support - Powerplay updates for newer Polaris - Cursor plane update fast path - kfd dma-buf support virtio-gpu: - add EDID support vmwgfx: - pageflip with damage support nouveau: - Initial Turing TU104/TU106 modesetting support msm: - a2xx gpu support for apq8060 and imx5 - a2xx gpummu support - mdp4 display support for apq8060 - DPU fixes and cleanups - enhanced profiling support - debug object naming interface - get_iova/page pinning decoupling tegra: - Tegra194 host1x, VIC and display support enabled - Audio over HDMI for Tegra186 and Tegra194 exynos: - DMA/IOMMU refactoring - plane alpha + blend mode support - Color format fixes for mixer driver rcar-du: - R8A7744 and R8A77470 support - R8A77965 LVDS support imx: - fbdev emulation fix - multi-tiled scalling fixes - SPDX identifiers rockchip - dw_hdmi support - dw-mipi-dsi + dual dsi support - mailbox read size fix qxl: - fix cursor pinning vc4: - YUV support (scaling + cursor) v3d: - enable TFU (Texture Formatting Unit) mali-dp: - add support for linear tiled formats sun4i: - Display Engine 3 support - H6 DE3 mixer 0 support - H6 display engine support - dw-hdmi support - H6 HDMI phy support - implicit fence waiting - BGRX8888 support meson: - Overlay plane support - implicit fence waiting - HDMI 1.4 4k modes bridge: - i2c fixes for sii902x" * tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm: (1403 commits) drm/amd/display: Add fast path for cursor plane updates drm/amdgpu: Enable GPU recovery by default for CI drm/amd/display: Fix duplicating scaling/underscan connector state drm/amd/display: Fix unintialized max_bpc state values Revert "drm/amd/display: Set RMX_ASPECT as default" drm/amdgpu: Fix stub function name drm/msm/dpu: Fix clock issue after bind failure drm/msm/dpu: Clean up dpu_media_info.h static inline functions drm/msm/dpu: Further cleanups for static inline functions drm/msm/dpu: Cleanup the debugfs functions drm/msm/dpu: Remove dpu_irq and unused functions drm/msm: Make irq_postinstall optional drm/msm/dpu: Cleanup callers of dpu_hw_blk_init drm/msm/dpu: Remove unused functions drm/msm/dpu: Remove dpu_crtc_is_enabled() drm/msm/dpu: Remove dpu_crtc_get_mixer_height drm/msm/dpu: Remove dpu_dbg drm/msm: dpu: Remove crtc_lock drm/msm: dpu: Remove vblank_requested flag from dpu_crtc drm/msm: dpu: Separate crtc assignment from vblank enable ...
2018-12-23Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "A couple of fixes - no common topic ;-)" [ The aio spectre patch also came in from Jens, so now we have that doubly fixed .. ] * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: proc/sysctl: don't return ENOMEM on lookup when a table is unregistering aio: fix spectre gadget in lookup_ioctx
2018-12-22Revert "vfs: Allow userns root to call mknod on owned filesystems."Christian Brauner
This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd. commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: https://github.com/systemd/systemd/pull/9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-21xfs: reallocate realtime summary cache on growfsOmar Sandoval
At mount time, we allocate m_rsum_cache with the number of realtime bitmap blocks. However, xfs_growfs_rt() can increase the number of realtime bitmap blocks. Using the cache after this happens may access out of the bounds of the cache. Fix it by reallocating the cache in this case. Fixes: 355e3532132b ("xfs: cache minimum realtime summary level") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-21Merge tag '4.20-rc7-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb3 fix from Steve French: "An important smb3 fix for an regression to some servers introduced by compounding optimization to rmdir. This fix has been tested by multiple developers (including me) with the usual private xfstesting, but also by the new cifs/smb3 "buildbot" xfstest VMs (thank you Ronnie and Aurelien for good work on this automation). The automated testing has been updated so that it will catch problems like this in the future. Note that Pavel discovered (very recently) some unrelated but extremely important bugs in credit handling (smb3 flow control problem that can lead to disconnects/reconnects) when compounding, that I would have liked to send in ASAP but the complete testing of those two fixes may not be done in time and have to wait for 4.21" * tag '4.20-rc7-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: Fix rmdir compounding regression to strict servers
2018-12-21iomap: don't search past page end in iomap_is_partially_uptodateEric Sandeen
iomap_is_partially_uptodate() is intended to check wither blocks within the selected range of a not-uptodate page are uptodate; if the range we care about is up to date, it's an optimization. However, the iomap implementation continues to check all blocks up to from+count, which is beyond the page, and can even be well beyond the iop->uptodate bitmap. I think the worst that will happen is that we may eventually find a zero bit and return "not partially uptodate" when it would have otherwise returned true, and skip the optimization. Still, it's clearly an invalid memory access that must be fixed. So: fix this by limiting the search to within the page as is done in the non-iomap variant, block_is_partially_uptodate(). Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k page system: BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0 Read of size 8 at addr ffff800120c3a318 by task fsstress/22337 Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-20Merge tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull UBI/UBIFS fixes from Richard Weinberger: - Kconfig dependency fixes for our new auth feature - Fix for selecting the right compressor when creating a fs - Bugfix for a bug in UBIFS's O_TMPFILE implementation - Refcounting fixes for UBI * tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifs: ubifs: Handle re-linking of inodes correctly while recovery ubi: Do not drop UBI device reference before using ubi: Put MTD device after it is not used ubifs: Fix default compression selection in ubifs ubifs: Fix memory leak on error condition ubifs: auth: Add CONFIG_KEYS dependency ubifs: CONFIG_UBIFS_FS_AUTHENTICATION should depend on UBIFS_FS ubifs: replay: Fix high stack usage
2018-12-20iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"Dave Chinner
This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3. The reverted commit added page reference counting to iomap page structures that are used to track block size < page size state. This was supposed to align the code with page migration page accounting assumptions, but what it has done instead is break XFS filesystems. Every fstests run I've done on sub-page block size XFS filesystems has since picking up this commit 2 days ago has failed with bad page state errors such as: # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038" .... SECTION -- xfs FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+ MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc MOUNT_OPTIONS -- /dev/sdc /mnt/scratch generic/038 454s ... run fstests generic/038 at 2018-12-20 18:43:05 XFS (sdc): Unmounting Filesystem XFS (sdc): Mounting V5 Filesystem XFS (sdc): Ending clean mount BUG: Bad page state in process kswapd0 pfn:3a7fa page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1 flags: 0xfffffc0000000() raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360 raw: 0000000000000001 0000000000000000 00000000ffffffff page dumped because: non-NULL mapping CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 Call Trace: dump_stack+0x67/0x90 bad_page.cold.116+0x8a/0xbd free_pcppages_bulk+0x4bf/0x6a0 free_unref_page_list+0x10f/0x1f0 shrink_page_list+0x49d/0xf50 shrink_inactive_list+0x19d/0x3b0 shrink_node_memcg.constprop.77+0x398/0x690 ? shrink_slab.constprop.81+0x278/0x3f0 shrink_node+0x7a/0x2f0 kswapd+0x34b/0x6d0 ? node_reclaim+0x240/0x240 kthread+0x11f/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x24/0x30 Disabling lock debugging due to kernel taint .... The failures are from anyway that frees pages and empties the per-cpu page magazines, so it's not a predictable failure or an easy to debug failure. generic/038 is a reliable reproducer of this problem - it has a 9 in 10 failure rate on one of my test machines. Failure on other machines have been at random points in fstests runs but every run has ended up tripping this problem. Hence generic/038 was used to bisect the failure because it was the most reliable failure. It is too close to the 4.20 release (not to mention holidays) to try to diagnose, fix and test the underlying cause of the problem, so reverting the commit is the only option we have right now. The revert has been tested against a current tot 4.20-rc7+ kernel across multiple machines running sub-page block size XFs filesystems and none of the bad page state failures have been seen. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-19xfs: stringify scrub types in ftrace outputDarrick J. Wong
Use __print_symbolic to print the scrub type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: stringify btree cursor types in ftrace outputDarrick J. Wong
Use __print_symbolic to print the btree type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: move XFS_INODE_FORMAT_STR mappings to libxfsDarrick J. Wong
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it updated, and add necessary TRACE_DEFINE_ENUM. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfsDarrick J. Wong
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to keep it updated, and TRACE_DEFINE_ENUM the values while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: fix symbolic enum printing in ftrace outputDarrick J. Wong
ftrace's __print_symbolic() has a (very poorly documented) requirement that any enum values used in the symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in the ftrace ring buffer. Fix this unsatisfied requirement. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>