Age | Commit message (Collapse) | Author |
|
Remove the linked list use in favour of the vma iterator.
Link: https://lkml.kernel.org/r/20220906194824.2110408-67-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead.
Link: https://lkml.kernel.org/r/20220906194824.2110408-40-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead. Since VMA can no longer be NULL in the
loop, then deal with out-of-memory outside the loop. This means a
slightly longer run time in the failure case (-ENOMEM) - it will run to
the end of the VMAs before erroring instead of in the middle of the loop.
Link: https://lkml.kernel.org/r/20220906194824.2110408-37-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead.
Link: https://lkml.kernel.org/r/20220906194824.2110408-36-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead.
Link: https://lkml.kernel.org/r/20220906194824.2110408-35-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead.
Link: https://lkml.kernel.org/r/20220906194824.2110408-34-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead.
Link: https://lkml.kernel.org/r/20220906194824.2110408-33-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Rework for_each_mte_vma() to use a VMA iterator instead of an explicit
linked-list.
Link: https://lkml.kernel.org/r/20220906194824.2110408-32-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220218023650.672072-1-Liam.Howlett@oracle.com
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the VMA iterator instead.
Link: https://lkml.kernel.org/r/20220906194824.2110408-31-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove the RB tree and start using the maple tree for vm_area_struct
tracking.
Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
lock is not held.
Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree. Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.
The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.
This also moves the mmap_lock() in exit_mmap() since the oom reaper call
does walk the VMAs. Otherwise lockdep will be unhappy if oom happens.
When splitting a vma fails due to allocations of the maple tree nodes,
the error path in __split_vma() calls new->vm_ops->close(new). The page
accounting for hugetlb is actually in the close() operation, so it
accounts for the removal of 1/2 of the VMA which was not adjusted. This
results in a negative exit value. To avoid the negative charge, set
vm_start = vm_end and vm_pgoff = 0.
There is also a potential accounting issue in special mappings from
insert_vm_struct() failing to allocate, so reverse the charge there in
the failure scenario.
Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
linear address translation [1]. Page table walkers that clear the
accessed bit may use this capability to reduce their search space.
Note that:
1. Although an inline function is preferable, this capability is added
as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
only tested on Intel and AMD CPUs.
Thanks to the following developers for their efforts [2][3].
Randy Dunlap <rdunlap@infradead.org>
Stephen Rothwell <sfr@canb.auug.org.au>
[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (June 2021), section 4.8
[2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
[3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Multi-Gen LRU Framework", v14.
What's new
==========
1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
machines. The old direct reclaim backoff, which tries to enforce a
minimum fairness among all eligible memcgs, over-swapped by about
(total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
pulls the plug on swapping once the target is met, trades some
fairness for curtailed latency:
https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
3. Fixed minior build warnings and conflicts. More comments and nits.
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.
Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.
03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
its sole caller"
Minor refactors to improve readability for the following patches.
05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.
06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.
07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.
08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.
09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.
10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.
11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.
13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.
Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis
An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.
On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.
On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.
Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx
[01] https://trends.google.com
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had not a
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.
Vaibhav from IBM reported [12]:
In a synthetic MongoDB Benchmark, seeing an average of ~19%
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
three different request distributions, namely, Exponential, Uniform
and Zipfan.
Shuang from U of Rochester reported [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2.
Daniel from Michigan Tech reported [14]:
With Memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.
Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of ChromeOS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.
The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. ChromeOS [19]
5. Liquorix [20]
6. OpenWrt [21]
7. post-factum [22]
8. XanMod [23]
[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://openwrt.org
[22] https://codeberg.org/pf-kernel
[23] https://xanmod.org
Summary
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
demonstrated similar effects.
Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
will likely be put to use for both personal computers and data
centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects with other approaches.
This patch (of 14):
Some architectures automatically set the accessed bit in PTEs, e.g., x86
and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault following
the TLB miss of this PTE (to emulate the accessed bit).
Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty page
faults when trying to clear the accessed bit in many PTEs.
Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it should
not be used in architecture-independent code that involves correctness,
e.g., to determine whether TLB flushes are required (in combination with
the accessed bit).
Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.
As the caller of max_swapfile_size() grows, this patch introduce a
variable "swapfile_maximum_size" and cache the value of old
max_swapfile_size(), so that we don't need to calculate the value every
time.
Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information. Here the
major arch to take care of is x86, which defines the max swapfile size
based on L1TF mitigation.
Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init(). As a reference, the code path looks like
this for x86:
- start_kernel
- setup_arch
- early_cpu_init
- early_identify_cpu --> setup X86_BUG_L1TF
- parse_early_param
- l1tf_cmdline --> set l1tf_mitigation
- check_bugs
- l1tf_select_mitigation --> set l1tf_mitigation
- arch_call_rest_init
- rest_init
- kernel_init
- kernel_init_freeable
- do_basic_setup
- do_initcalls --> calls swapfile_init() (initcall level 4)
The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.
Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
because arch can define its own function, so it's more straightforward to
have "arch_" as its prefix. At the meantime, export swapfile_maximum_size
to replace the old usages of max_swapfile_size().
[peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls
swp_offset() assuming that'll be the PFN.
Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check
with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp
offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
any more very soon.
Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:
(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.
For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more
information within the swp offset they'll be carried over naturally in
those cases.
Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.
Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: Remember a/d bits for migration entries", v4.
Problem
=======
When migrating a page, right now we always mark the migrated page as old &
clean.
However that could lead to at least two problems:
(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,
(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again for reads, and dirty bits for write, as long as the
hardware MMU supports these bits.
Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to
4ms if young set.
This issue is originally reported by Andrea Arcangeli.
Solution
========
To solve this problem, this patchset tries to remember the young/dirty
bits in the migration entries and carry them over when recovering the
ptes.
We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.
max_swapfile_size() is used here to detect per-arch offset length in swp
entries. We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.
Since max_swapfile_size() can be slow, the last two patches cache the
results for it and also swap_migration_ad_supported as a whole.
Known Issues / TODOs
====================
We still haven't taught madvise() to recognize the new A/D bits in
migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon
a migration entry. It's not clear yet on whether we should clear the A
bit, or we should just drop the entry directly.
We didn't teach idle page tracking on the new migration entries, because
it'll need larger rework on the tree on rmap pgtable walk. However it
should make it already better because before this patchset page will be
old page after migration, so the series will fix potential false negative
of idle page tracking when pages were migrated before observing.
The other thing is migration A/D bits will not start to working for
private device swap entries. The code is there for completeness but since
private device swap entries do not yet have fields to store A/D bits, even
if we'll persistent A/D across present pte switching to migration entry,
we'll lose it again when the migration entry converted to private device
swap entry.
Tests
=====
After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps. The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Similar effect can also be measured when writting the memory the 1st time
after migration.
After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.
Patch Layout
============
Patch 1-2: Cleanups from either previous versions or on swapops.h macros.
Patch 3-4: Prepare for the introduction of migration A/D bits
Patch 5: The core patch to remember young/dirty bit in swap offsets.
Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast.
[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
This patch (of 7):
Replace all the magic "5" with the macro.
Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed
to skip any checks where the length is known at compile time as a
reasonable heuristic to avoid "likely known-good" cases. However, it can
only do this when the copy_*_user() helpers are, themselves, inline too.
Using find_vmap_area() requires taking a spinlock. The
check_object_size() helper can call find_vmap_area() when the destination
is in vmap memory. If show_regs() is called in interrupt context, it will
attempt a call to copy_from_user_nmi(), which may call check_object_size()
and then find_vmap_area(). If something in normal context happens to be
in the middle of calling find_vmap_area() (with the spinlock held), the
interrupt handler will hang forever.
The copy_from_user_nmi() call is actually being called with a fixed-size
length, so check_object_size() should never have been called in the first
place. Given the narrow constraints, just replace the
__copy_from_user_inatomic() call with an open-coded version that calls
only into the sanitizers and not check_object_size(), followed by a call
to raw_copy_from_user().
[akpm@linux-foundation.org: no instrument_copy_from_user() in my tree...]
Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org
Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com
Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Yu Zhao <yuzhao@google.com>
Reported-by: Florian Lehner <dev@der-flo.net>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Florian Lehner <dev@der-flo.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
move to use RCU instead of disabling local interrupts in fast-GUP. Using
an IPI is the old-styled way of serializing against fast-GUP although it
still works as expected now.
And fast-GUP now fixed the potential race with THP collapse by checking
whether PMD is changed or not. So IPI broadcast in radix pmd collapse
flush is not necessary anymore. But it is still needed for hash TLB.
Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When pud-sized hugepages were introduced for s390, the generic version of
follow_huge_pud() was using pte_page() instead of pud_page(). This would
be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use
pmd_page() in follow_huge_pmd()"). Therefore, and probably because not
all archs were supporting pud_page() at that time, a private version of
follow_huge_pud() was added for s390, correctly using pud_page().
Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry"), the generic version of
follow_huge_pud() is now also using pud_page(), and in general behaves
similar to follow_huge_pmd().
Therefore we can now switch to the generic version and get rid of the
s390-specific follow_huge_pud().
Link: https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Haiyue Wang <haiyue.wang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This Kconfig option is used by individual arch to set its desired
MAX_ORDER. Rename it to reflect its actual use.
Link: https://lkml.kernel.org/r/20220815143959.1511278-1-zi.yan@sent.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: Guo Ren <guoren@kernel.org> [csky]
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Taichi Sugaya <sugaya.taichi@socionext.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Qin Jian <qinjian@cqplus1.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This idea was introduced by David Rientjes[1].
Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
a synchronous collapse of memory at their own expense.
The benefits of this approach are:
* CPU is charged to the process that wants to spend the cycles for the
THP
* Avoid unpredictable timing of khugepaged collapse
Semantics
This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others. This implies a hugepage cannot cross a VMA boundary. If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.
The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range. The memory ranges must span at least one
hugepage-sized region.
All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage. Unmapped pages will have their data directly
initialized to 0 in the new hugepage. However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).
Allocation for the new hugepage may enter direct reclaim and/or
compaction, regardless of VMA flags. When the system has multiple NUMA
nodes, the hugepage will be allocated from the node providing the most
native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how
pages will be mapped, constructed, or faulted in the future
Return Value
If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful. On success, process_madvise(2)
returns the number of bytes advised, and madvise(2) returns 0. Else, -1
is returned and errno is set to indicate the error for the most-recently
attempted hugepage collapse. Note that many failures might have occurred,
since the operation may continue to collapse in the event a single
hugepage-sized/aligned region fails.
ENOMEM Memory allocation failed or VMA not found
EBUSY Memcg charging failed
EAGAIN Required resource temporarily unavailable. Try again
might succeed.
EINVAL Other error: No PMD found, subpage doesn't have Present
bit set, "Special" page no backed by struct page, VMA
incorrectly sized, address not page-aligned, ...
Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
to provide the caller with actionable feedback so they may take an
appropriate fallback measure.
Use Cases
An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd. Later, when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
coverage and dTLB performance. TCMalloc is such an implementation that
could benefit from this[2].
Only privately-mapped anon memory is supported for now, but additional
support for file, shmem, and HugeTLB high-granularity mappings[2] is
expected. File and tmpfs/shmem support would permit:
* Backing executable text by THPs. Current support provided by
CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
might impair services from serving at their full rated load after
(re)starting. Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint. With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.
* Backing guest memory by hugapages after the memory contents have been
migrated in native-page-sized chunks to a new host, in a
userfaultfd-based live-migration stack.
[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc
[jrdr.linux@gmail.com: avoid possible memory leak in failure path]
Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
[zokeefe@google.com add missing kfree() to madvise_collapse()]
Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
[zokeefe@google.com: delay computation of hpage boundaries until use]]
Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
Suggested-by: David Rientjes <rientjes@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform")
moved accesses to protection_map[] from mem_encrypt_amd.c to pgprot.c. As
a result, the accesses are now targets of KASAN (and other
instrumentations), leading to the crash during the boot process.
Disable the instrumentations for pgprot.c like commit 67bb8e999e0a
("x86/mm: Disable various instrumentations of mm/mem_encrypt.c and
mm/tlb.c").
Before this patch, my AMD machine cannot boot since v6.0-rc1 with KASAN
enabled, without anything printed. After the change, it successfully
boots up.
Fixes: 4867fbbdd6b3 ("x86/mm: move protection_map[] inside the platform")
Link: https://lkml.kernel.org/r/20220824084726.2174758-1-naohiro.aota@wdc.com
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix PAT on Xen, which caused i915 driver failures
- Fix compat INT 80 entry crash on Xen PV guests
- Fix 'MMIO Stale Data' mitigation status reporting on older Intel CPUs
- Fix RSB stuffing regressions
- Fix ORC unwinding on ftrace trampolines
- Add Intel Raptor Lake CPU model number
- Fix (work around) a SEV-SNP bootloader bug providing bogus values in
boot_params->cc_blob_address, by ignoring the value on !SEV-SNP
bootups.
- Fix SEV-SNP early boot failure
- Fix the objtool list of noreturn functions and annotate snp_abort(),
which bug confused objtool on gcc-12.
- Fix the documentation for retbleed
* tag 'x86-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/ABI: Mention retbleed vulnerability info file for sysfs
x86/sev: Mark snp_abort() noreturn
x86/sev: Don't use cc_platform_has() for early SEV-SNP calls
x86/boot: Don't propagate uninitialized boot_params->cc_blob_address
x86/cpu: Add new Raptor Lake CPU model number
x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
x86/nospec: Fix i386 RSB stuffing
x86/nospec: Unwreck the RSB stuffing
x86/bugs: Add "unknown" reporting for MMIO Stale Data
x86/entry: Fix entry_INT80_compat for Xen PV guests
x86/PAT: Have pat_enabled() properly reflect state when running on Xen
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Ingo Molnar:
"Misc fixes: an Arch-LBR fix, a PEBS enumeration fix, an Intel DS fix,
PEBS constraints fix on Alder Lake CPUs and an Intel uncore PMU fix"
* tag 'perf-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix broken read_counter() for SNB IMC PMU
perf/x86/intel: Fix pebs event constraints for ADL
perf/x86/intel/ds: Fix precise store latency handling
perf/x86/core: Set pebs_capable and PMU_FL_PEBS_ALL for the Baseline
perf/x86/lbr: Enable the branch type for the Arch LBR by default
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix double free of guarded storage and runtime instrumentation
control blocks on fork() failure
- Fix triggering write fault when VMA does not allow VM_WRITE
* tag 's390-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/mm: do not trigger write fault when vma does not allow VM_WRITE
s390: fix double free of GS and RI CBs on fork() failure
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- two minor cleanups
- a fix of the xen/privcmd driver avoiding a possible NULL dereference
in an error case
* tag 'for-linus-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: fix error exit of privcmd_ioctl_dm_op()
xen: move from strlcpy with unused retval to strscpy
xen: x86: remove setting the obsolete config XEN_MAX_DOMAIN_MEMORY
|
|
Some architectures define their own arch_test_bit and they also need
arch_test_bit_acquire, otherwise they won't compile. We also clean up
the code by using the generic test_bit if that is equivalent to the
arch-specific version.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 8238b4579866 ("wait_on_bit: add an acquire memory barrier")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Existing code was generating bogus counts for the SNB IMC bandwidth counters:
$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/
1.000327813 1,024.03 MiB uncore_imc/data_reads/
1.000327813 20.73 MiB uncore_imc/data_writes/
2.000580153 261,120.00 MiB uncore_imc/data_reads/
2.000580153 23.28 MiB uncore_imc/data_writes/
The problem was introduced by commit:
07ce734dd8ad ("perf/x86/intel/uncore: Clean up client IMC")
Where the read_counter callback was replace to point to the generic
uncore_mmio_read_counter() function.
The SNB IMC counters are freerunnig 32-bit counters laid out contiguously in
MMIO. But uncore_mmio_read_counter() is using a readq() call to read from
MMIO therefore reading 64-bit from MMIO. Although this is okay for the
uncore_perf_event_update() function because it is shifting the value based
on the actual counter width to compute a delta, it is not okay for the
uncore_pmu_event_start() which is simply reading the counter and therefore
priming the event->prev_count with a bogus value which is responsible for
causing bogus deltas in the perf stat command above.
The fix is to reintroduce the custom callback for read_counter for the SNB
IMC PMU and use readl() instead of readq(). With the change the output of
perf stat is back to normal:
$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/
1.000120987 296.94 MiB uncore_imc/data_reads/
1.000120987 138.42 MiB uncore_imc/data_writes/
2.000403144 175.91 MiB uncore_imc/data_reads/
2.000403144 68.50 MiB uncore_imc/data_writes/
Fixes: 07ce734dd8ad ("perf/x86/intel/uncore: Clean up client IMC")
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20220803160031.1379788-1-eranian@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"A bumper crop of arm64 fixes for -rc3.
The largest change is fixing our parsing of the 'rodata=full' command
line option, which kstrtobool() started treating as 'rodata=false'.
The fix actually makes the parsing of that option much less fragile
and updates the documentation at the same time.
We still have a boot issue pending when KASLR is disabled at compile
time, but there's a fresh fix on the list which I'll send next week if
it holds up to testing.
Summary:
- Fix workaround for Cortex-A76 erratum #1286807
- Add workaround for AMU erratum #2457168 on Cortex-A510
- Drop reference to removed CONFIG_ARCH_RANDOM #define
- Fix parsing of the "rodata=full" cmdline option
- Fix a bunch of issues in the SME register state switching and sigframe code
- Fix incorrect extraction of the CTR_EL0.CWG register field
- Fix ACPI cache topology probing when the PPTT is not present
- Trivial comment and whitespace fixes"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/sme: Don't flush SVE register state when handling SME traps
arm64/sme: Don't flush SVE register state when allocating SME storage
arm64/signal: Flush FPSIMD register state when disabling streaming mode
arm64/signal: Raise limit on stack frames
arm64/cache: Fix cache_type_cwg() for register generation
arm64/sysreg: Guard SYS_FIELD_ macros for asm
arm64/sysreg: Directly include bitfield.h
arm64: cacheinfo: Fix incorrect assignment of signed error value to unsigned fw_level
arm64: errata: add detection for AMEVCNTR01 incrementing incorrectly
arm64: fix rodata=full
arm64: Fix comment typo
docs/arm64: elf_hwcaps: unify newlines in HWCAP lists
arm64: adjust KASLR relocation after ARCH_RANDOM removal
arm64: Fix match_list for erratum 1286807 on Arm Cortex-A76
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A handful of fixes for the Microchip device trees
- A pair of fixes to eliminate build warnings
* tag 'riscv-for-linus-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: microchip: mpfs: remove pci axi address translation property
riscv: dts: microchip: mpfs: remove bogus card-detect-delay
riscv: dts: microchip: mpfs: remove ti,fifo-depth property
riscv: dts: microchip: mpfs: fix incorrect pcie child node name
riscv: traps: add missing prototype
riscv: signal: fix missing prototype warning
riscv: dts: microchip: correct L2 cache interrupts
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Fix a bunch of build errors/warnings, a poweroff error and an
unbalanced locking in do_page_fault()"
* tag 'loongarch-fixes-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: mm: Avoid unnecessary page fault retires on shared memory types
LoongArch: Add subword xchg/cmpxchg emulation
LoongArch: Cleanup headers to avoid circular dependency
LoongArch: Cleanup reset routines with new API
LoongArch: Fix build warnings in VDSO
LoongArch: Select PCI_QUIRKS to avoid build error
|
|
There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read).
On architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and
they may return invalid data.
Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux.git into fixes
This contains a pair of fixes for build-time warnings.
* 'riscv-variable_fixes_without_kvm' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux.git:
riscv: traps: add missing prototype
riscv: signal: fix missing prototype warning
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git into fixes
Microchip RISC-V devicetree fixes for 6.0-rc3
Two sets of fixes this time around:
- A fix for the interrupt ordering of the l2-cache controller. If the
driver is enabled, it would spam the console /constantly/, rendering
the system useless.
- General cleanup for some bogus properties in the dt, part of my quest
for zero dtbs_check warnings.
On that note, the interrupt ordering adds a dtbs_check warning - but I
considered that fixing the potentially useless system was more of a
priority.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'dt-fixes-for-palmer-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git:
riscv: dts: microchip: mpfs: remove pci axi address translation property
riscv: dts: microchip: mpfs: remove bogus card-detect-delay
riscv: dts: microchip: mpfs: remove ti,fifo-depth property
riscv: dts: microchip: mpfs: fix incorrect pcie child node name
riscv: dts: microchip: correct L2 cache interrupts
|
|
Mark both the function prototype and definition as noreturn in order to
prevent the compiler from doing transformations which confuse objtool
like so:
vmlinux.o: warning: objtool: sme_enable+0x71: unreachable instruction
This triggers with gcc-12.
Add it and sev_es_terminate() to the objtool noreturn tracking array
too. Sort it while at it.
Suggested-by: Michael Matz <matz@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220824152420.20547-1-bp@alien8.de
|
|
For non-protection pXd_none() page faults in do_dat_exception(), we
call do_exception() with access == (VM_READ | VM_WRITE | VM_EXEC).
In do_exception(), vma->vm_flags is checked against that before
calling handle_mm_fault().
Since commit 92f842eac7ee3 ("[S390] store indication fault optimization"),
we call handle_mm_fault() with FAULT_FLAG_WRITE, when recognizing that
it was a write access. However, the vma flags check is still only
checking against (VM_READ | VM_WRITE | VM_EXEC), and therefore also
calling handle_mm_fault() with FAULT_FLAG_WRITE in cases where the vma
does not allow VM_WRITE.
Fix this by changing access check in do_exception() to VM_WRITE only,
when recognizing write access.
Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com
Fixes: 92f842eac7ee3 ("[S390] store indication fault optimization")
Cc: <stable@vger.kernel.org>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
The pointers for guarded storage and runtime instrumentation control
blocks are stored in the thread_struct of the associated task. These
pointers are initially copied on fork() via arch_dup_task_struct()
and then cleared via copy_thread() before fork() returns. If fork()
happens to fail after the initial task dup and before copy_thread(),
the newly allocated task and associated thread_struct memory are
freed via free_task() -> arch_release_task_struct(). This results in
a double free of the guarded storage and runtime info structs
because the fields in the failed task still refer to memory
associated with the source task.
This problem can manifest as a BUG_ON() in set_freepointer() (with
CONFIG_SLAB_FREELIST_HARDENED enabled) or KASAN splat (if enabled)
when running trinity syscall fuzz tests on s390x. To avoid this
problem, clear the associated pointer fields in
arch_dup_task_struct() immediately after the new task is copied.
Note that the RI flag is still cleared in copy_thread() because it
resides in thread stack memory and that is where stack info is
copied.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: 8d9047f8b967c ("s390/runtime instrumentation: simplify task exit handling")
Fixes: 7b83c6297d2fc ("s390/guarded storage: simplify task exit handling")
Cc: <stable@vger.kernel.org> # 4.15
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20220816155407.537372-1-bfoster@redhat.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Commit c70727a5bc18 ("xen: allow more than 512 GB of RAM for 64 bit
pv-domains") from July 2015 replaces the config XEN_MAX_DOMAIN_MEMORY with
a new config XEN_512GB, but misses to adjust arch/x86/configs/xen.config.
As XEN_512GB defaults to yes, there is no need to explicitly set any config
in xen.config.
Just remove setting the obsolete config XEN_MAX_DOMAIN_MEMORY.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220817044333.22310-1-lukas.bulwahn@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Commit d92725256b4f22d0 ("mm: avoid unnecessary page fault retires on
shared memory types") modifies do_page_fault() to handle the VM_FAULT_
COMPLETED case, but forget to change for LoongArch, so fix it as other
architectures does.
Fixes: d92725256b4f22d0 ("mm: avoid unnecessary page fault retires on shared memory types")
Reviewed-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
LoongArch only support 32-bit/64-bit xchg/cmpxchg in native. But percpu
operation, qspinlock and some drivers need 8-bit/16-bit xchg/cmpxchg. We
add subword xchg/cmpxchg emulation in this patch because the emulation
has better performance than the generic implementation (on NUMA system),
and it can fix some build errors meanwhile [1].
LoongArch's guarantee for forward progress (avoid many ll/sc happening
at the same time and no one succeeds):
We have the "exclusive access (with timeout) of ll" feature to avoid
simultaneous ll (which also blocks other memory load/store on the same
address), and the "random delay of sc" feature to avoid simultaneous
sc. It is a mandatory requirement for multi-core LoongArch processors
to implement such features, only except those single-core and dual-core
processors (they also don't support multi-chip interconnection).
Feature bits are introduced in CPUCFG3, bit 3 and bit 4 [2].
[1] https://lore.kernel.org/loongarch/CAAhV-H6vvkuOzy8OemWdYK3taj5Jn3bFX0ZTwE=twM8ywpBUYA@mail.gmail.com/T/#t
[2] https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#_cpucfg
Reported-by: Sudip Mukherjee (Codethink) <sudipm.mukherjee@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rui Wang <wangrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When enable GENERIC_IOREMAP, there will be circular dependency to cause
build errors. The root cause is that pgtable.h shouldn't include io.h
but pgtable.h need some macros defined in io.h. So cleanup those macros
and remove the unnecessary inclusions, as other architectures do.
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Cleanup reset routines by using new do_kernel_power_off() instead of old
pm_power_off(), and then simplify the whole file (reset.c) organization
by inlining some functions. This cleanup also fix a poweroff error if EFI
runtime is disabled.
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Fix build warnings in VDSO as below:
arch/loongarch/vdso/vgettimeofday.c:9:5: warning: no previous prototype for '__vdso_clock_gettime' [-Wmissing-prototypes]
9 | int __vdso_clock_gettime(clockid_t clock,
| ^~~~~~~~~~~~~~~~~~~~
arch/loongarch/vdso/vgettimeofday.c:15:5: warning: no previous prototype for '__vdso_gettimeofday' [-Wmissing-prototypes]
15 | int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
| ^~~~~~~~~~~~~~~~~~~
arch/loongarch/vdso/vgettimeofday.c:21:5: warning: no previous prototype for '__vdso_clock_getres' [-Wmissing-prototypes]
21 | int __vdso_clock_getres(clockid_t clock_id,
| ^~~~~~~~~~~~~~~~~~~
arch/loongarch/vdso/vgetcpu.c:27:5: warning: no previous prototype for '__vdso_getcpu' [-Wmissing-prototypes]
27 | int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
PCI_LOONGSON is a mandatory for LoongArch and it is selected in Kconfig
unconditionally, but its dependency PCI_QUIRKS is missing and may cause
a build error when "make randconfig":
arch/loongarch/pci/acpi.c: In function 'pci_acpi_setup_ecam_mapping':
>> arch/loongarch/pci/acpi.c:103:29: error: 'loongson_pci_ecam_ops' undeclared (first use in this function)
103 | ecam_ops = &loongson_pci_ecam_ops;
| ^~~~~~~~~~~~~~~~~~~~~
arch/loongarch/pci/acpi.c:103:29: note: each undeclared identifier is reported only once for each function it appears in
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for PCI_LOONGSON
Depends on [n]: PCI [=y] && (MACH_LOONGSON64 [=y] || COMPILE_TEST [=y]) && (OF [=y] || ACPI [=y]) && PCI_QUIRKS [=n]
Selected by [y]:
- LOONGARCH [=y]
Fix it by selecting PCI_QUIRKS unconditionally, too.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When running identity-mapped and depending on the kernel configuration,
it is possible that the compiler uses jump tables when generating code
for cc_platform_has().
This causes a boot failure because the jump table uses un-mapped kernel
virtual addresses, not identity-mapped addresses. This has been seen
with CONFIG_RETPOLINE=n.
Similar to sme_encrypt_kernel(), use an open-coded direct check for the
status of SNP rather than trying to eliminate the jump table. This
preserves any code optimization in cc_platform_has() that can be useful
post boot. It also limits the changes to SEV-specific files so that
future compiler features won't necessarily require possible build changes
just because they are not compatible with running identity-mapped.
[ bp: Massage commit message. ]
Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Reported-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 5.19.x
Link: https://lore.kernel.org/all/YqfabnTRxFSM+LoX@google.com/
|
|
In some cases, bootloaders will leave boot_params->cc_blob_address
uninitialized rather than zeroing it out. This field is only meant to be
set by the boot/compressed kernel in order to pass information to the
uncompressed kernel when SEV-SNP support is enabled.
Therefore, there are no cases where the bootloader-provided values
should be treated as anything other than garbage. Otherwise, the
uncompressed kernel may attempt to access this bogus address, leading to
a crash during early boot.
Normally, sanitize_boot_params() would be used to clear out such fields
but that happens too late: sev_enable() may have already initialized
it to a valid value that should not be zeroed out. Instead, have
sev_enable() zero it out unconditionally beforehand.
Also ensure this happens for !CONFIG_AMD_MEM_ENCRYPT as well by also
including this handling in the sev_enable() stub function.
[ bp: Massage commit message and comments. ]
Fixes: b190a043c49a ("x86/sev: Add SEV-SNP feature detection/setup")
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Reported-by: watnuss@gmx.de
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216387
Link: https://lore.kernel.org/r/20220823160734.89036-1-michael.roth@amd.com
|
|
An AXI master address translation table property was inadvertently
added to the device tree & this was not caught by dtbs_check at the
time. Remove the property - it should not be in mpfs.dtsi anyway as
it would be more suitable in -fabric.dtsi nor does it actually apply
to the version of the reference design we are using for upstream.
Link: https://www.microsemi.com/document-portal/doc_download/1245812-polarfire-fpga-and-polarfire-soc-fpga-pci-express-user-guide # Section 1.3.3
Fixes: 528a5b1f2556 ("riscv: dts: microchip: add new peripherals to icicle kit device tree")
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
|
|
Recent versions of dt-schema warn about a previously undetected
undocumented property:
arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dtb: mmc@20008000: Unevaluated properties are not allowed ('card-detect-delay' was unexpected)
From schema: Documentation/devicetree/bindings/mmc/cdns,sdhci.yaml
There are no GPIOs connected to MSSIO6B4 pin K3 so adding the common
cd-debounce-delay-ms property makes no sense. The Cadence IP has a
register that sets the card detect delay as "DP * tclk". On MPFS, this
clock frequency is not configurable (it must be 200 MHz) & the FPGA
comes out of reset with this register already set.
Fixes: bc47b2217f24 ("riscv: dts: microchip: add the sundance polarberry")
Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE board")
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
|
|
Recent versions of dt-schema warn about a previously undetected
undocument property on the icicle & polarberry devicetrees:
arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dtb: ethernet@20112000: ethernet-phy@8: Unevaluated properties are not allowed ('ti,fifo-depth' was unexpected)
From schema: Documentation/devicetree/bindings/net/cdns,macb.yaml
I know what you're thinking, the binding doesn't look to be the problem
and I agree. I am not sure why a TI vendor property was ever actually
added since it has no meaning... just get rid of it.
Fixes: bc47b2217f24 ("riscv: dts: microchip: add the sundance polarberry")
Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE board")
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
|