summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/LSM/index.rst1
-rw-r--r--Documentation/admin-guide/LSM/landlock.rst158
-rw-r--r--Documentation/admin-guide/README.rst4
-rw-r--r--Documentation/admin-guide/abi-obsolete-files.rst7
-rw-r--r--Documentation/admin-guide/abi-obsolete.rst6
-rw-r--r--Documentation/admin-guide/abi-removed-files.rst7
-rw-r--r--Documentation/admin-guide/abi-removed.rst6
-rw-r--r--Documentation/admin-guide/abi-stable-files.rst7
-rw-r--r--Documentation/admin-guide/abi-stable.rst6
-rw-r--r--Documentation/admin-guide/abi-testing-files.rst7
-rw-r--r--Documentation/admin-guide/abi-testing.rst6
-rw-r--r--Documentation/admin-guide/abi.rst18
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst36
-rw-r--r--Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst4
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst5
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst48
-rw-r--r--Documentation/admin-guide/device-mapper/dm-crypt.rst5
-rw-r--r--Documentation/admin-guide/device-mapper/dm-integrity.rst5
-rw-r--r--Documentation/admin-guide/device-mapper/verity.rst20
-rw-r--r--Documentation/admin-guide/ext4.rst7
-rw-r--r--Documentation/admin-guide/gpio/gpio-sim.rst2
-rw-r--r--Documentation/admin-guide/gpio/gpio-virtuser.rst2
-rw-r--r--Documentation/admin-guide/highuid.rst80
-rw-r--r--Documentation/admin-guide/hw-vuln/index.rst1
-rw-r--r--Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst8
-rw-r--r--Documentation/admin-guide/hw-vuln/rsb.rst268
-rw-r--r--Documentation/admin-guide/hw-vuln/srso.rst13
-rw-r--r--Documentation/admin-guide/index.rst1
-rw-r--r--Documentation/admin-guide/iostats.rst89
-rw-r--r--Documentation/admin-guide/kdump/kdump.rst4
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt110
-rw-r--r--Documentation/admin-guide/kernel-per-CPU-kthreads.rst7
-rw-r--r--Documentation/admin-guide/laptops/index.rst1
-rw-r--r--Documentation/admin-guide/laptops/samsung-galaxybook.rst174
-rw-r--r--Documentation/admin-guide/media/cec.rst2
-rw-r--r--Documentation/admin-guide/media/mgb4.rst4
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst10
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst87
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst10
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst21
-rw-r--r--Documentation/admin-guide/mm/zswap.rst10
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst17
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst29
-rw-r--r--Documentation/admin-guide/pm/intel_idle.rst18
-rw-r--r--Documentation/admin-guide/pm/intel_pstate.rst3
-rw-r--r--Documentation/admin-guide/pnp.rst3
-rw-r--r--Documentation/admin-guide/serial-console.rst4
-rw-r--r--Documentation/admin-guide/sysctl/fs.rst25
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst11
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst9
-rw-r--r--Documentation/admin-guide/tainted-kernels.rst5
-rw-r--r--Documentation/admin-guide/thunderbolt.rst2
-rw-r--r--Documentation/admin-guide/workload-tracing.rst2
53 files changed, 1101 insertions, 294 deletions
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst
index ce63be6d64ad..b44ef68f6e4d 100644
--- a/Documentation/admin-guide/LSM/index.rst
+++ b/Documentation/admin-guide/LSM/index.rst
@@ -48,3 +48,4 @@ subdirectories.
Yama
SafeSetID
ipe
+ landlock
diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst
new file mode 100644
index 000000000000..9e61607def08
--- /dev/null
+++ b/Documentation/admin-guide/LSM/landlock.rst
@@ -0,0 +1,158 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright © 2025 Microsoft Corporation
+
+================================
+Landlock: system-wide management
+================================
+
+:Author: Mickaël Salaün
+:Date: March 2025
+
+Landlock can leverage the audit framework to log events.
+
+User space documentation can be found here:
+Documentation/userspace-api/landlock.rst.
+
+Audit
+=====
+
+Denied access requests are logged by default for a sandboxed program if `audit`
+is enabled. This default behavior can be changed with the
+sys_landlock_restrict_self() flags (cf.
+Documentation/userspace-api/landlock.rst). Landlock logs can also be masked
+thanks to audit rules. Landlock can generate 2 audit record types.
+
+Record types
+------------
+
+AUDIT_LANDLOCK_ACCESS
+ This record type identifies a denied access request to a kernel resource.
+ The ``domain`` field indicates the ID of the domain that blocked the
+ request. The ``blockers`` field indicates the cause(s) of this denial
+ (separated by a comma), and the following fields identify the kernel object
+ (similar to SELinux). There may be more than one of this record type per
+ audit event.
+
+ Example with a file link request generating two records in the same event::
+
+ domain=195ba459b blockers=fs.refer path="/usr/bin" dev="vda2" ino=351
+ domain=195ba459b blockers=fs.make_reg,fs.refer path="/usr/local" dev="vda2" ino=365
+
+AUDIT_LANDLOCK_DOMAIN
+ This record type describes the status of a Landlock domain. The ``status``
+ field can be either ``allocated`` or ``deallocated``.
+
+ The ``allocated`` status is part of the same audit event and follows
+ the first logged ``AUDIT_LANDLOCK_ACCESS`` record of a domain. It identifies
+ Landlock domain information at the time of the sys_landlock_restrict_self()
+ call with the following fields:
+
+ - the ``domain`` ID
+ - the enforcement ``mode``
+ - the domain creator's ``pid``
+ - the domain creator's ``uid``
+ - the domain creator's executable path (``exe``)
+ - the domain creator's command line (``comm``)
+
+ Example::
+
+ domain=195ba459b status=allocated mode=enforcing pid=300 uid=0 exe="/root/sandboxer" comm="sandboxer"
+
+ The ``deallocated`` status is an event on its own and it identifies a
+ Landlock domain release. After such event, it is guarantee that the
+ related domain ID will never be reused during the lifetime of the system.
+ The ``domain`` field indicates the ID of the domain which is released, and
+ the ``denials`` field indicates the total number of denied access request,
+ which might not have been logged according to the audit rules and
+ sys_landlock_restrict_self()'s flags.
+
+ Example::
+
+ domain=195ba459b status=deallocated denials=3
+
+
+Event samples
+--------------
+
+Here are two examples of log events (see serial numbers).
+
+In this example a sandboxed program (``kill``) tries to send a signal to the
+init process, which is denied because of the signal scoping restriction
+(``LL_SCOPED=s``)::
+
+ $ LL_FS_RO=/ LL_FS_RW=/ LL_SCOPED=s LL_FORCE_LOG=1 ./sandboxer kill 1
+
+This command generates two events, each identified with a unique serial
+number following a timestamp (``msg=audit(1729738800.268:30)``). The first
+event (serial ``30``) contains 4 records. The first record
+(``type=LANDLOCK_ACCESS``) shows an access denied by the domain `1a6fdc66f`.
+The cause of this denial is signal scopping restriction
+(``blockers=scope.signal``). The process that would have receive this signal
+is the init process (``opid=1 ocomm="systemd"``).
+
+The second record (``type=LANDLOCK_DOMAIN``) describes (``status=allocated``)
+domain `1a6fdc66f`. This domain was created by process ``286`` executing the
+``/root/sandboxer`` program launched by the root user.
+
+The third record (``type=SYSCALL``) describes the syscall, its provided
+arguments, its result (``success=no exit=-1``), and the process that called it.
+
+The fourth record (``type=PROCTITLE``) shows the command's name as an
+hexadecimal value. This can be translated with ``python -c
+'print(bytes.fromhex("6B696C6C0031"))'``.
+
+Finally, the last record (``type=LANDLOCK_DOMAIN``) is also the only one from
+the second event (serial ``31``). It is not tied to a direct user space action
+but an asynchronous one to free resources tied to a Landlock domain
+(``status=deallocated``). This can be useful to know that the following logs
+will not concern the domain ``1a6fdc66f`` anymore. This record also summarize
+the number of requests this domain denied (``denials=1``), whether they were
+logged or not.
+
+.. code-block::
+
+ type=LANDLOCK_ACCESS msg=audit(1729738800.268:30): domain=1a6fdc66f blockers=scope.signal opid=1 ocomm="systemd"
+ type=LANDLOCK_DOMAIN msg=audit(1729738800.268:30): domain=1a6fdc66f status=allocated mode=enforcing pid=286 uid=0 exe="/root/sandboxer" comm="sandboxer"
+ type=SYSCALL msg=audit(1729738800.268:30): arch=c000003e syscall=62 success=no exit=-1 [..] ppid=272 pid=286 auid=0 uid=0 gid=0 [...] comm="kill" [...]
+ type=PROCTITLE msg=audit(1729738800.268:30): proctitle=6B696C6C0031
+ type=LANDLOCK_DOMAIN msg=audit(1729738800.324:31): domain=1a6fdc66f status=deallocated denials=1
+
+Here is another example showcasing filesystem access control::
+
+ $ LL_FS_RO=/ LL_FS_RW=/tmp LL_FORCE_LOG=1 ./sandboxer sh -c "echo > /etc/passwd"
+
+The related audit logs contains 8 records from 3 different events (serials 33,
+34 and 35) created by the same domain `1a6fdc679`::
+
+ type=LANDLOCK_ACCESS msg=audit(1729738800.221:33): domain=1a6fdc679 blockers=fs.write_file path="/dev/tty" dev="devtmpfs" ino=9
+ type=LANDLOCK_DOMAIN msg=audit(1729738800.221:33): domain=1a6fdc679 status=allocated mode=enforcing pid=289 uid=0 exe="/root/sandboxer" comm="sandboxer"
+ type=SYSCALL msg=audit(1729738800.221:33): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...]
+ type=PROCTITLE msg=audit(1729738800.221:33): proctitle=7368002D63006563686F203E202F6574632F706173737764
+ type=LANDLOCK_ACCESS msg=audit(1729738800.221:34): domain=1a6fdc679 blockers=fs.write_file path="/etc/passwd" dev="vda2" ino=143821
+ type=SYSCALL msg=audit(1729738800.221:34): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...]
+ type=PROCTITLE msg=audit(1729738800.221:34): proctitle=7368002D63006563686F203E202F6574632F706173737764
+ type=LANDLOCK_DOMAIN msg=audit(1729738800.261:35): domain=1a6fdc679 status=deallocated denials=2
+
+
+Event filtering
+---------------
+
+If you get spammed with audit logs related to Landlock, this is either an
+attack attempt or a bug in the security policy. We can put in place some
+filters to limit noise with two complementary ways:
+
+- with sys_landlock_restrict_self()'s flags if we can fix the sandboxed
+ programs,
+- or with audit rules (see :manpage:`auditctl(8)`).
+
+Additional documentation
+========================
+
+* `Linux Audit Documentation`_
+* Documentation/userspace-api/landlock.rst
+* Documentation/security/landlock.rst
+* https://landlock.io
+
+.. Links
+.. _Linux Audit Documentation:
+ https://github.com/linux-audit/audit-documentation/wiki
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
index eb9452668909..70b02f30013a 100644
--- a/Documentation/admin-guide/README.rst
+++ b/Documentation/admin-guide/README.rst
@@ -165,7 +165,7 @@ Configuring the kernel
"make xconfig" Qt based configuration tool.
- "make gconfig" GTK+ based configuration tool.
+ "make gconfig" GTK based configuration tool.
"make oldconfig" Default all questions based on the contents of
your existing ./.config file and asking about
@@ -176,7 +176,7 @@ Configuring the kernel
values without prompting.
"make defconfig" Create a ./.config file by using the default
- symbol values from either arch/$ARCH/defconfig
+ symbol values from either arch/$ARCH/configs/defconfig
or arch/$ARCH/configs/${PLATFORM}_defconfig,
depending on the architecture.
diff --git a/Documentation/admin-guide/abi-obsolete-files.rst b/Documentation/admin-guide/abi-obsolete-files.rst
new file mode 100644
index 000000000000..3061a916b4b5
--- /dev/null
+++ b/Documentation/admin-guide/abi-obsolete-files.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Obsolete ABI Files
+==================
+
+.. kernel-abi:: obsolete
+ :no-symbols:
diff --git a/Documentation/admin-guide/abi-obsolete.rst b/Documentation/admin-guide/abi-obsolete.rst
index 594e697aa1b2..640f3903e847 100644
--- a/Documentation/admin-guide/abi-obsolete.rst
+++ b/Documentation/admin-guide/abi-obsolete.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
ABI obsolete symbols
====================
@@ -7,5 +9,5 @@ marked to be removed at some later point in time.
The description of the interface will document the reason why it is
obsolete and when it can be expected to be removed.
-.. kernel-abi:: ABI/obsolete
- :rst:
+.. kernel-abi:: obsolete
+ :no-files:
diff --git a/Documentation/admin-guide/abi-removed-files.rst b/Documentation/admin-guide/abi-removed-files.rst
new file mode 100644
index 000000000000..f1bdfadd2ec4
--- /dev/null
+++ b/Documentation/admin-guide/abi-removed-files.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Removed ABI Files
+=================
+
+.. kernel-abi:: removed
+ :no-symbols:
diff --git a/Documentation/admin-guide/abi-removed.rst b/Documentation/admin-guide/abi-removed.rst
index f9e000c81828..88832d3eacd6 100644
--- a/Documentation/admin-guide/abi-removed.rst
+++ b/Documentation/admin-guide/abi-removed.rst
@@ -1,5 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+
ABI removed symbols
===================
-.. kernel-abi:: ABI/removed
- :rst:
+.. kernel-abi:: removed
+ :no-files:
diff --git a/Documentation/admin-guide/abi-stable-files.rst b/Documentation/admin-guide/abi-stable-files.rst
new file mode 100644
index 000000000000..f867738fc178
--- /dev/null
+++ b/Documentation/admin-guide/abi-stable-files.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Stable ABI Files
+================
+
+.. kernel-abi:: stable
+ :no-symbols:
diff --git a/Documentation/admin-guide/abi-stable.rst b/Documentation/admin-guide/abi-stable.rst
index fc3361d847b1..528c68401f4b 100644
--- a/Documentation/admin-guide/abi-stable.rst
+++ b/Documentation/admin-guide/abi-stable.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
ABI stable symbols
==================
@@ -10,5 +12,5 @@ for at least 2 years.
Most interfaces (like syscalls) are expected to never change and always
be available.
-.. kernel-abi:: ABI/stable
- :rst:
+.. kernel-abi:: stable
+ :no-files:
diff --git a/Documentation/admin-guide/abi-testing-files.rst b/Documentation/admin-guide/abi-testing-files.rst
new file mode 100644
index 000000000000..1da868e42fdb
--- /dev/null
+++ b/Documentation/admin-guide/abi-testing-files.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Testing ABI Files
+=================
+
+.. kernel-abi:: testing
+ :no-symbols:
diff --git a/Documentation/admin-guide/abi-testing.rst b/Documentation/admin-guide/abi-testing.rst
index 19767926b344..6153ebd38e2d 100644
--- a/Documentation/admin-guide/abi-testing.rst
+++ b/Documentation/admin-guide/abi-testing.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
ABI testing symbols
===================
@@ -16,5 +18,5 @@ Programs that use these interfaces are strongly encouraged to add their
name to the description of these interfaces, so that the kernel
developers can easily notify them if any changes occur.
-.. kernel-abi:: ABI/testing
- :rst:
+.. kernel-abi:: testing
+ :no-files:
diff --git a/Documentation/admin-guide/abi.rst b/Documentation/admin-guide/abi.rst
index bcab3ef2597c..c6039359e585 100644
--- a/Documentation/admin-guide/abi.rst
+++ b/Documentation/admin-guide/abi.rst
@@ -1,7 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
=====================
Linux ABI description
=====================
+.. kernel-abi:: README
+
+ABI symbols
+-----------
+
.. toctree::
:maxdepth: 2
@@ -9,3 +16,14 @@ Linux ABI description
abi-testing
abi-obsolete
abi-removed
+
+ABI files
+---------
+
+.. toctree::
+ :maxdepth: 2
+
+ abi-stable-files
+ abi-testing-files
+ abi-obsolete-files
+ abi-removed-files
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 1576fb93f06c..9bdb30901a93 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -54,7 +54,7 @@ The list of possible return codes:
If you use 'echo', the returned value is set by the 'echo' utility,
and, in general case, something like::
- echo 3 > /sys/block/zram0/max_comp_streams
+ echo foo > /sys/block/zram0/comp_algorithm
if [ $? -ne 0 ]; then
handle_error
fi
@@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3}
num_devices parameter is optional and tells zram how many devices should be
pre-created. Default: 1.
-2) Set max number of compression streams
-========================================
-
-Regardless of the value passed to this attribute, ZRAM will always
-allocate multiple compression streams - one per online CPU - thus
-allowing several concurrent compression operations. The number of
-allocated compression streams goes down when some of the CPUs
-become offline. There is no single-compression-stream mode anymore,
-unless you are running a UP system or have only 1 CPU online.
-
-To find out how many streams are currently available::
-
- cat /sys/block/zram0/max_comp_streams
-
-3) Select compression algorithm
+2) Select compression algorithm
===============================
Using comp_algorithm device attribute one can see available and
@@ -107,7 +93,7 @@ Examples::
For the time being, the `comp_algorithm` content shows only compression
algorithms that are supported by zram.
-4) Set compression algorithm parameters: Optional
+3) Set compression algorithm parameters: Optional
=================================================
Compression algorithms may support specific parameters which can be
@@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some
algorithms), for other algorithms `level` is acceleration level (the higher
the value the lower the compression ratio).
-5) Set Disksize
+4) Set Disksize
===============
Set disk size by writing the value to sysfs node 'disksize'.
@@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory
since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful.
-6) Set memory limit: Optional
+5) Set memory limit: Optional
=============================
Set memory limit by writing the value to sysfs node 'mem_limit'.
@@ -177,7 +163,7 @@ Examples::
# To disable memory limit
echo 0 > /sys/block/zram0/mem_limit
-7) Activate
+6) Activate
===========
::
@@ -188,7 +174,7 @@ Examples::
mkfs.ext4 /dev/zram1
mount /dev/zram1 /tmp
-8) Add/remove zram devices
+7) Add/remove zram devices
==========================
zram provides a control interface, which enables dynamic (on-demand) device
@@ -208,7 +194,7 @@ execute::
echo X > /sys/class/zram-control/hot_remove
-9) Stats
+8) Stats
========
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
@@ -228,8 +214,6 @@ mem_limit WO specifies the maximum amount of memory ZRAM can
writeback_limit WO specifies the maximum amount of write IO zram
can write out to backing device as 4KB unit
writeback_limit_enable RW show and set writeback_limit feature
-max_comp_streams RW the number of possible concurrent compress
- operations
comp_algorithm RW show and change the compression algorithm
algorithm_params WO setup compression algorithm parameters
compact WO trigger memory compaction
@@ -310,7 +294,7 @@ a single line of text and contains the following stats separated by whitespace:
Unit: 4K bytes
============== =============================================================
-10) Deactivate
+9) Deactivate
==============
::
@@ -318,7 +302,7 @@ a single line of text and contains the following stats separated by whitespace:
swapoff /dev/zram0
umount /dev/zram1
-11) Reset
+10) Reset
=========
Write any positive value to 'reset' sysfs node::
diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
index 582d3427de3f..a964aff373b1 100644
--- a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
+++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
@@ -125,3 +125,7 @@ to unfreeze all tasks in the container::
This is the basic mechanism which should do the right thing for user space task
in a simple scenario.
+
+This freezer implementation is affected by shortcomings (see commit
+76f969e8948d8 ("cgroup: cgroup v2 freezer")) and cgroup v2 freezer is
+recommended.
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 286d16fc22eb..d6b1db8cc7eb 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -90,6 +90,7 @@ Brief summary of control files.
used.
memory.swappiness set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
+ Per memcg knob does not exist in cgroup v2.
memory.move_charge_at_immigrate This knob is deprecated.
memory.oom_control set/show oom controls.
This knob is deprecated and shouldn't be
@@ -609,6 +610,10 @@ memory.stat file includes following statistics:
'rss + mapped_file" will give you resident set size of cgroup.
+ Note that some kernel configurations might account complete larger
+ allocations (e.g., THP) towards 'rss' and 'mapped_file', even if
+ only some, but not all that memory is mapped.
+
(Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
cache.)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cb1b4e759b7e..1a16ce68a4d7 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1076,15 +1076,20 @@ cpufreq governor about the minimum desired frequency which should always be
provided by a CPU, as well as the maximum desired frequency, which should not
be exceeded by a CPU.
-WARNING: cgroup2 doesn't yet support control of realtime processes. For
-a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group
-scheduling of realtime processes, the cpu controller can only be enabled
-when all RT processes are in the root cgroup. This limitation does
-not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system
-management software may already have placed RT processes into nonroot
-cgroups during the system boot process, and these processes may need
-to be moved to the root cgroup before the cpu controller can be enabled
-with a CONFIG_RT_GROUP_SCHED enabled kernel.
+WARNING: cgroup2 cpu controller doesn't yet fully support the control of
+realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option
+enabled for group scheduling of realtime processes, the cpu controller can only
+be enabled when all RT processes are in the root cgroup. Be aware that system
+management software may already have placed RT processes into non-root cgroups
+during the system boot process, and these processes may need to be moved to the
+root cgroup before the cpu controller can be enabled with a
+CONFIG_RT_GROUP_SCHED enabled kernel.
+
+With CONFIG_RT_GROUP_SCHED disabled, this limitation does not apply and some of
+the interface files either affect realtime processes or account for them. See
+the following section for details. Only the cpu controller is affected by
+CONFIG_RT_GROUP_SCHED. Other controllers can be used for the resource control of
+realtime processes irrespective of CONFIG_RT_GROUP_SCHED.
CPU Interface Files
@@ -1440,7 +1445,10 @@ The following nested keys are defined.
anon
Amount of memory used in anonymous mappings such as
- brk(), sbrk(), and mmap(MAP_ANONYMOUS)
+ brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that
+ some kernel configurations might account complete larger
+ allocations (e.g., THP) if only some, but not all the
+ memory of such an allocation is mapped anymore.
file
Amount of memory used to cache filesystem data,
@@ -1483,7 +1491,10 @@ The following nested keys are defined.
Amount of application memory swapped out to zswap.
file_mapped
- Amount of cached filesystem data mapped with mmap()
+ Amount of cached filesystem data mapped with mmap(). Note
+ that some kernel configurations might account complete
+ larger allocations (e.g., THP) if only some, but not
+ not all the memory of such an allocation is mapped.
file_dirty
Amount of cached filesystem data that was modified but
@@ -1555,6 +1566,12 @@ The following nested keys are defined.
workingset_nodereclaim
Number of times a shadow node has been reclaimed
+ pswpin (npn)
+ Number of pages swapped into memory
+
+ pswpout (npn)
+ Number of pages swapped out of memory
+
pgscan (npn)
Amount of scanned pages (in an inactive LRU list)
@@ -1570,6 +1587,9 @@ The following nested keys are defined.
pgscan_khugepaged (npn)
Amount of scanned pages by khugepaged (in an inactive LRU list)
+ pgscan_proactive (npn)
+ Amount of scanned pages proactively (in an inactive LRU list)
+
pgsteal_kswapd (npn)
Amount of reclaimed pages by kswapd
@@ -1579,6 +1599,9 @@ The following nested keys are defined.
pgsteal_khugepaged (npn)
Amount of reclaimed pages by khugepaged
+ pgsteal_proactive (npn)
+ Amount of reclaimed pages proactively
+
pgfault (npn)
Total number of page faults incurred
@@ -1656,6 +1679,9 @@ The following nested keys are defined.
pgdemote_khugepaged
Number of pages demoted by khugepaged.
+ pgdemote_proactive
+ Number of pages demoted by proactively.
+
hugetlb
Amount of memory used by hugetlb pages. This metric only shows
up if hugetlb usage is accounted for in memory.current (i.e.
diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst
index 9f8139ff97d6..4467f6d4b632 100644
--- a/Documentation/admin-guide/device-mapper/dm-crypt.rst
+++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst
@@ -146,6 +146,11 @@ integrity:<bytes>:<type>
integrity for the encrypted device. The additional space is then
used for storing authentication tag (and persistent IV if needed).
+integrity_key_size:<bytes>
+ Optionally set the integrity key size if it differs from the digest size.
+ It allows the use of wrapped key algorithms where the key size is
+ independent of the cryptographic key size.
+
sector_size:<bytes>
Use <bytes> as the encryption unit instead of 512 bytes sectors.
This option can be in range 512 - 4096 bytes and must be power of two.
diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst
index d8a5f14d0e3c..c2e18ecc065c 100644
--- a/Documentation/admin-guide/device-mapper/dm-integrity.rst
+++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst
@@ -92,6 +92,11 @@ Target arguments:
allowed. This mode is useful for data recovery if the
device cannot be activated in any of the other standard
modes.
+ I - inline mode - in this mode, dm-integrity will store integrity
+ data directly in the underlying device sectors.
+ The underlying device must have an integrity profile that
+ allows storing user integrity data and provides enough
+ space for the selected integrity tag.
5. the number of additional arguments
diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
index a65c1602cb23..8c3f1f967a3c 100644
--- a/Documentation/admin-guide/device-mapper/verity.rst
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -87,6 +87,15 @@ panic_on_corruption
Panic the device when a corrupted block is discovered. This option is
not compatible with ignore_corruption and restart_on_corruption.
+restart_on_error
+ Restart the system when an I/O error is detected.
+ This option can be combined with the restart_on_corruption option.
+
+panic_on_error
+ Panic the device when an I/O error is detected. This option is
+ not compatible with the restart_on_error option but can be combined
+ with the panic_on_corruption option.
+
ignore_zero_blocks
Do not verify blocks that are expected to contain zeroes and always return
zeroes instead. This may be useful if the partition contains unused blocks
@@ -142,8 +151,15 @@ root_hash_sig_key_desc <key_description>
already in the secondary trusted keyring.
try_verify_in_tasklet
- If verity hashes are in cache, verify data blocks in kernel tasklet instead
- of workqueue. This option can reduce IO latency.
+ If verity hashes are in cache and the IO size does not exceed the limit,
+ verify data blocks in bottom half instead of workqueue. This option can
+ reduce IO latency. The size limits can be configured via
+ /sys/module/dm_verity/parameters/use_bh_bytes. The four parameters
+ correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT,
+ IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn.
+ For example:
+ <none>,<rt>,<be>,<idle>
+ 4096,4096,4096,4096
Theory of operation
===================
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index 2418b0c2d3df..b857eb6ca1b6 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -238,11 +238,10 @@ When mounting an ext4 filesystem, the following option are accepted:
configured using tune2fs)
data_err=ignore(*)
- Just print an error message if an error occurs in a file data buffer in
- ordered mode.
+ Just print an error message if an error occurs in a file data buffer.
+
data_err=abort
- Abort the journal if an error occurs in a file data buffer in ordered
- mode.
+ Abort the journal if an error occurs in a file data buffer.
grpid | bsdgroups
New objects have the group ID of their parent.
diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst
index 1cc5567a4bbe..35d49ccd49e0 100644
--- a/Documentation/admin-guide/gpio/gpio-sim.rst
+++ b/Documentation/admin-guide/gpio/gpio-sim.rst
@@ -71,7 +71,7 @@ specific lines. The name of those subdirectories must take the form of:
``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be
used by the module to assign the config to the specific line at given offset.
-Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in
+Once the configuration is complete, the ``'live'`` attribute must be set to 1 in
order to instantiate the chip. It can be set back to 0 to destroy the simulated
chip. The module will synchronously wait for the new simulated device to be
successfully probed and if this doesn't happen, writing to ``'live'`` will
diff --git a/Documentation/admin-guide/gpio/gpio-virtuser.rst b/Documentation/admin-guide/gpio/gpio-virtuser.rst
index 2aca70db9f3b..7e7c0df51640 100644
--- a/Documentation/admin-guide/gpio/gpio-virtuser.rst
+++ b/Documentation/admin-guide/gpio/gpio-virtuser.rst
@@ -92,7 +92,7 @@ struct. The first two take string values as arguments:
Activating GPIO consumers
-------------------------
-Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in
+Once the configuration is complete, the ``'live'`` attribute must be set to 1 in
order to instantiate the consumer. It can be set back to 0 to destroy the
virtual device. The module will synchronously wait for the new simulated device
to be successfully probed and if this doesn't happen, writing to ``'live'`` will
diff --git a/Documentation/admin-guide/highuid.rst b/Documentation/admin-guide/highuid.rst
deleted file mode 100644
index 6ee70465c0ea..000000000000
--- a/Documentation/admin-guide/highuid.rst
+++ /dev/null
@@ -1,80 +0,0 @@
-===================================================
-Notes on the change from 16-bit UIDs to 32-bit UIDs
-===================================================
-
-:Author: Chris Wing <wingc@umich.edu>
-:Last updated: January 11, 2000
-
-- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t
- when communicating between user and kernel space in an ioctl or data
- structure.
-
-- kernel code should use uid_t and gid_t in kernel-private structures and
- code.
-
-What's left to be done for 32-bit UIDs on all Linux architectures:
-
-- Disk quotas have an interesting limitation that is not related to the
- maximum UID/GID. They are limited by the maximum file size on the
- underlying filesystem, because quota records are written at offsets
- corresponding to the UID in question.
- Further investigation is needed to see if the quota system can cope
- properly with huge UIDs. If it can deal with 64-bit file offsets on all
- architectures, this should not be a problem.
-
-- Decide whether or not to keep backwards compatibility with the system
- accounting file, or if we should break it as the comments suggest
- (currently, the old 16-bit UID and GID are still written to disk, and
- part of the former pad space is used to store separate 32-bit UID and
- GID)
-
-- Need to validate that OS emulation calls the 16-bit UID
- compatibility syscalls, if the OS being emulated used 16-bit UIDs, or
- uses the 32-bit UID system calls properly otherwise.
-
- This affects at least:
-
- - iBCS on Intel
-
- - sparc32 emulation on sparc64
- (need to support whatever new 32-bit UID system calls are added to
- sparc32)
-
-- Validate that all filesystems behave properly.
-
- At present, 32-bit UIDs _should_ work for:
-
- - ext2
- - ufs
- - isofs
- - nfs
- - coda
- - udf
-
- Ioctl() fixups have been made for:
-
- - ncpfs
- - smbfs
-
- Filesystems with simple fixups to prevent 16-bit UID wraparound:
-
- - minix
- - sysv
- - qnx4
-
- Other filesystems have not been checked yet.
-
-- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in
- all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but
- more are needed. (as well as new user<->kernel data structures)
-
-- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k,
- sh, and sparc32. Fixing this is probably not that important, but would
- require adding a new ELF section.
-
-- The ioctl()s used to control the in-kernel NFS server only support
- 16-bit UIDs on arm, i386, m68k, sh, and sparc32.
-
-- make sure that the UID mapping feature of AX25 networking works properly
- (it should be safe because it's always used a 32-bit integer to
- communicate between user and kernel)
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ff0b440ef2dc..451874b8135d 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -22,3 +22,4 @@ are configurable at compile, boot or run time.
srso
gather_data_sampling
reg-file-data-sampling
+ rsb
diff --git a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
index 0585d02b9a6c..ad15417d39f9 100644
--- a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
+++ b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
@@ -29,14 +29,6 @@ Below is the list of affected Intel processors [#f1]_:
RAPTORLAKE_S 06_BFH
=================== ============
-As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and
-RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as
-vulnerable in Linux because they share the same family/model with an affected
-part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or
-CPUID.HYBRID. This information could be used to distinguish between the
-affected and unaffected parts, but it is deemed not worth adding complexity as
-the reporting is fixed automatically when these parts enumerate RFDS_NO.
-
Mitigation
==========
Intel released a microcode update that enables software to clear sensitive
diff --git a/Documentation/admin-guide/hw-vuln/rsb.rst b/Documentation/admin-guide/hw-vuln/rsb.rst
new file mode 100644
index 000000000000..21dbf9cf25f8
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/rsb.rst
@@ -0,0 +1,268 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+RSB-related mitigations
+=======================
+
+.. warning::
+ Please keep this document up-to-date, otherwise you will be
+ volunteered to update it and convert it to a very long comment in
+ bugs.c!
+
+Since 2018 there have been many Spectre CVEs related to the Return Stack
+Buffer (RSB) (sometimes referred to as the Return Address Stack (RAS) or
+Return Address Predictor (RAP) on AMD).
+
+Information about these CVEs and how to mitigate them is scattered
+amongst a myriad of microarchitecture-specific documents.
+
+This document attempts to consolidate all the relevant information in
+once place and clarify the reasoning behind the current RSB-related
+mitigations. It's meant to be as concise as possible, focused only on
+the current kernel mitigations: what are the RSB-related attack vectors
+and how are they currently being mitigated?
+
+It's *not* meant to describe how the RSB mechanism operates or how the
+exploits work. More details about those can be found in the references
+below.
+
+Rather, this is basically a glorified comment, but too long to actually
+be one. So when the next CVE comes along, a kernel developer can
+quickly refer to this as a refresher to see what we're actually doing
+and why.
+
+At a high level, there are two classes of RSB attacks: RSB poisoning
+(Intel and AMD) and RSB underflow (Intel only). They must each be
+considered individually for each attack vector (and microarchitecture
+where applicable).
+
+----
+
+RSB poisoning (Intel and AMD)
+=============================
+
+SpectreRSB
+~~~~~~~~~~
+
+RSB poisoning is a technique used by SpectreRSB [#spectre-rsb]_ where
+an attacker poisons an RSB entry to cause a victim's return instruction
+to speculate to an attacker-controlled address. This can happen when
+there are unbalanced CALLs/RETs after a context switch or VMEXIT.
+
+* All attack vectors can potentially be mitigated by flushing out any
+ poisoned RSB entries using an RSB filling sequence
+ [#intel-rsb-filling]_ [#amd-rsb-filling]_ when transitioning between
+ untrusted and trusted domains. But this has a performance impact and
+ should be avoided whenever possible.
+
+ .. DANGER::
+ **FIXME**: Currently we're flushing 32 entries. However, some CPU
+ models have more than 32 entries. The loop count needs to be
+ increased for those. More detailed information is needed about RSB
+ sizes.
+
+* On context switch, the user->user mitigation requires ensuring the
+ RSB gets filled or cleared whenever IBPB gets written [#cond-ibpb]_
+ during a context switch:
+
+ * AMD:
+ On Zen 4+, IBPB (or SBPB [#amd-sbpb]_ if used) clears the RSB.
+ This is indicated by IBPB_RET in CPUID [#amd-ibpb-rsb]_.
+
+ On Zen < 4, the RSB filling sequence [#amd-rsb-filling]_ must be
+ always be done in addition to IBPB [#amd-ibpb-no-rsb]_. This is
+ indicated by X86_BUG_IBPB_NO_RET.
+
+ * Intel:
+ IBPB always clears the RSB:
+
+ "Software that executed before the IBPB command cannot control
+ the predicted targets of indirect branches executed after the
+ command on the same logical processor. The term indirect branch
+ in this context includes near return instructions, so these
+ predicted targets may come from the RSB." [#intel-ibpb-rsb]_
+
+* On context switch, user->kernel attacks are prevented by SMEP. User
+ space can only insert user space addresses into the RSB. Even
+ non-canonical addresses can't be inserted due to the page gap at the
+ end of the user canonical address space reserved by TASK_SIZE_MAX.
+ A SMEP #PF at instruction fetch prevents the kernel from speculatively
+ executing user space.
+
+ * AMD:
+ "Finally, branches that are predicted as 'ret' instructions get
+ their predicted targets from the Return Address Predictor (RAP).
+ AMD recommends software use a RAP stuffing sequence (mitigation
+ V2-3 in [2]) and/or Supervisor Mode Execution Protection (SMEP)
+ to ensure that the addresses in the RAP are safe for
+ speculation. Collectively, we refer to these mitigations as "RAP
+ Protection"." [#amd-smep-rsb]_
+
+ * Intel:
+ "On processors with enhanced IBRS, an RSB overwrite sequence may
+ not suffice to prevent the predicted target of a near return
+ from using an RSB entry created in a less privileged predictor
+ mode. Software can prevent this by enabling SMEP (for
+ transitions from user mode to supervisor mode) and by having
+ IA32_SPEC_CTRL.IBRS set during VM exits." [#intel-smep-rsb]_
+
+* On VMEXIT, guest->host attacks are mitigated by eIBRS (and PBRSB
+ mitigation if needed):
+
+ * AMD:
+ "When Automatic IBRS is enabled, the internal return address
+ stack used for return address predictions is cleared on VMEXIT."
+ [#amd-eibrs-vmexit]_
+
+ * Intel:
+ "On processors with enhanced IBRS, an RSB overwrite sequence may
+ not suffice to prevent the predicted target of a near return
+ from using an RSB entry created in a less privileged predictor
+ mode. Software can prevent this by enabling SMEP (for
+ transitions from user mode to supervisor mode) and by having
+ IA32_SPEC_CTRL.IBRS set during VM exits. Processors with
+ enhanced IBRS still support the usage model where IBRS is set
+ only in the OS/VMM for OSes that enable SMEP. To do this, such
+ processors will ensure that guest behavior cannot control the
+ RSB after a VM exit once IBRS is set, even if IBRS was not set
+ at the time of the VM exit." [#intel-eibrs-vmexit]_
+
+ Note that some Intel CPUs are susceptible to Post-barrier Return
+ Stack Buffer Predictions (PBRSB) [#intel-pbrsb]_, where the last
+ CALL from the guest can be used to predict the first unbalanced RET.
+ In this case the PBRSB mitigation is needed in addition to eIBRS.
+
+AMD RETBleed / SRSO / Branch Type Confusion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On AMD, poisoned RSB entries can also be created by the AMD RETBleed
+variant [#retbleed-paper]_ [#amd-btc]_ or by Speculative Return Stack
+Overflow [#amd-srso]_ (Inception [#inception-paper]_). The kernel
+protects itself by replacing every RET in the kernel with a branch to a
+single safe RET.
+
+----
+
+RSB underflow (Intel only)
+==========================
+
+RSB Alternate (RSBA) ("Intel Retbleed")
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some Intel Skylake-generation CPUs are susceptible to the Intel variant
+of RETBleed [#retbleed-paper]_ (Return Stack Buffer Underflow
+[#intel-rsbu]_). If a RET is executed when the RSB buffer is empty due
+to mismatched CALLs/RETs or returning from a deep call stack, the branch
+predictor can fall back to using the Branch Target Buffer (BTB). If a
+user forces a BTB collision then the RET can speculatively branch to a
+user-controlled address.
+
+* Note that RSB filling doesn't fully mitigate this issue. If there
+ are enough unbalanced RETs, the RSB may still underflow and fall back
+ to using a poisoned BTB entry.
+
+* On context switch, user->user underflow attacks are mitigated by the
+ conditional IBPB [#cond-ibpb]_ on context switch which effectively
+ clears the BTB:
+
+ * "The indirect branch predictor barrier (IBPB) is an indirect branch
+ control mechanism that establishes a barrier, preventing software
+ that executed before the barrier from controlling the predicted
+ targets of indirect branches executed after the barrier on the same
+ logical processor." [#intel-ibpb-btb]_
+
+* On context switch and VMEXIT, user->kernel and guest->host RSB
+ underflows are mitigated by IBRS or eIBRS:
+
+ * "Enabling IBRS (including enhanced IBRS) will mitigate the "RSBU"
+ attack demonstrated by the researchers. As previously documented,
+ Intel recommends the use of enhanced IBRS, where supported. This
+ includes any processor that enumerates RRSBA but not RRSBA_DIS_S."
+ [#intel-rsbu]_
+
+ However, note that eIBRS and IBRS do not mitigate intra-mode attacks.
+ Like RRSBA below, this is mitigated by clearing the BHB on kernel
+ entry.
+
+ As an alternative to classic IBRS, call depth tracking (combined with
+ retpolines) can be used to track kernel returns and fill the RSB when
+ it gets close to being empty.
+
+Restricted RSB Alternate (RRSBA)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some newer Intel CPUs have Restricted RSB Alternate (RRSBA) behavior,
+which, similar to RSBA described above, also falls back to using the BTB
+on RSB underflow. The only difference is that the predicted targets are
+restricted to the current domain when eIBRS is enabled:
+
+* "Restricted RSB Alternate (RRSBA) behavior allows alternate branch
+ predictors to be used by near RET instructions when the RSB is
+ empty. When eIBRS is enabled, the predicted targets of these
+ alternate predictors are restricted to those belonging to the
+ indirect branch predictor entries of the current prediction domain.
+ [#intel-eibrs-rrsba]_
+
+When a CPU with RRSBA is vulnerable to Branch History Injection
+[#bhi-paper]_ [#intel-bhi]_, an RSB underflow could be used for an
+intra-mode BTI attack. This is mitigated by clearing the BHB on
+kernel entry.
+
+However if the kernel uses retpolines instead of eIBRS, it needs to
+disable RRSBA:
+
+* "Where software is using retpoline as a mitigation for BHI or
+ intra-mode BTI, and the processor both enumerates RRSBA and
+ enumerates RRSBA_DIS controls, it should disable this behavior."
+ [#intel-retpoline-rrsba]_
+
+----
+
+References
+==========
+
+.. [#spectre-rsb] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://arxiv.org/pdf/1807.07940.pdf>`_
+
+.. [#intel-rsb-filling] "Empty RSB Mitigation on Skylake-generation" in `Retpoline: A Branch Target Injection Mitigation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html#inpage-nav-5-1>`_
+
+.. [#amd-rsb-filling] "Mitigation V2-3" in `Software Techniques for Managing Speculation <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf>`_
+
+.. [#cond-ibpb] Whether IBPB is written depends on whether the prev and/or next task is protected from Spectre attacks. It typically requires opting in per task or system-wide. For more details see the documentation for the ``spectre_v2_user`` cmdline option in Documentation/admin-guide/kernel-parameters.txt.
+
+.. [#amd-sbpb] IBPB without flushing of branch type predictions. Only exists for AMD.
+
+.. [#amd-ibpb-rsb] "Function 8000_0008h -- Processor Capacity Parameters and Extended Feature Identification" in `AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf>`_. SBPB behaves the same way according to `this email <https://lore.kernel.org/5175b163a3736ca5fd01cedf406735636c99a>`_.
+
+.. [#amd-ibpb-no-rsb] `Spectre Attacks: Exploiting Speculative Execution <https://comsec.ethz.ch/wp-content/files/ibpb_sp25.pdf>`_
+
+.. [#intel-ibpb-rsb] "Introduction" in `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_
+
+.. [#amd-smep-rsb] "Existing Mitigations" in `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_
+
+.. [#intel-smep-rsb] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_
+
+.. [#amd-eibrs-vmexit] "Extended Feature Enable Register (EFER)" in `AMD64 Architecture Programmer's Manual Volume 2: System Programming <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_
+
+.. [#intel-eibrs-vmexit] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_
+
+.. [#intel-pbrsb] `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_
+
+.. [#retbleed-paper] `RETBleed: Arbitrary Speculative Code Execution with Return Instruction <https://comsec.ethz.ch/wp-content/files/retbleed_sec22.pdf>`_
+
+.. [#amd-btc] `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_
+
+.. [#amd-srso] `Technical Update Regarding Speculative Return Stack Overflow <https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf>`_
+
+.. [#inception-paper] `Inception: Exposing New Attack Surfaces with Training in Transient Execution <https://comsec.ethz.ch/wp-content/files/inception_sec23.pdf>`_
+
+.. [#intel-rsbu] `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_
+
+.. [#intel-ibpb-btb] `Indirect Branch Predictor Barrier' <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-predictor-barrier.html>`_
+
+.. [#intel-eibrs-rrsba] "Guidance for RSBU" in `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_
+
+.. [#bhi-paper] `Branch History Injection: On the Effectiveness of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks <http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf>`_
+
+.. [#intel-bhi] `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_
+
+.. [#intel-retpoline-rrsba] "Retpoline" in `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_
diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst
index 2ad1c05b8c88..66af95251a3d 100644
--- a/Documentation/admin-guide/hw-vuln/srso.rst
+++ b/Documentation/admin-guide/hw-vuln/srso.rst
@@ -104,7 +104,20 @@ The possible values in this file are:
(spec_rstack_overflow=ibpb-vmexit)
+ * 'Mitigation: Reduced Speculation':
+ This mitigation gets automatically enabled when the above one "IBPB on
+ VMEXIT" has been selected and the CPU supports the BpSpecReduce bit.
+
+ It gets automatically enabled on machines which have the
+ SRSO_USER_KERNEL_NO=1 CPUID bit. In that case, the code logic is to switch
+ to the above =ibpb-vmexit mitigation because the user/kernel boundary is
+ not affected anymore and thus "safe RET" is not needed.
+
+ After enabling the IBPB on VMEXIT mitigation option, the BpSpecReduce bit
+ is detected (functionality present on all such machines) and that
+ practically overrides IBPB on VMEXIT as it has a lot less performance
+ impact and takes care of the guest->host attack vector too.
In order to exploit vulnerability, an attacker needs to:
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index c8af32a8f800..259d79fbeb94 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -187,7 +187,6 @@ A few hard-to-categorize and generally obsolete documents.
.. toctree::
:maxdepth: 1
- highuid
ldm
unicode
diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
index 609a3201fd4e..9453196ade51 100644
--- a/Documentation/admin-guide/iostats.rst
+++ b/Documentation/admin-guide/iostats.rst
@@ -2,62 +2,39 @@
I/O statistics fields
=====================
-Since 2.4.20 (and some versions before, with patches), and 2.5.45,
-more extensive disk statistics have been introduced to help measure disk
-activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
-the work for you, but in case you are interested in creating your own
-tools, the fields are explained here.
-
-In 2.4 now, the information is found as additional fields in
-``/proc/partitions``. In 2.6 and upper, the same information is found in two
-places: one is in the file ``/proc/diskstats``, and the other is within
-the sysfs file system, which must be mounted in order to obtain
-the information. Throughout this document we'll assume that sysfs
-is mounted on ``/sys``, although of course it may be mounted anywhere.
-Both ``/proc/diskstats`` and sysfs use the same source for the information
-and so should not differ.
-
-Here are examples of these different formats::
-
- 2.4:
- 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
- 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
-
- 2.6+ sysfs:
- 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
- 35486 38030 38030 38030
-
- 2.6+ diskstats:
- 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
- 3 1 hda1 35486 38030 38030 38030
-
- 4.18+ diskstats:
- 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
-
-On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
-a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
-
-The advantage of one over the other is that the sysfs choice works well
-if you are watching a known, small set of disks. ``/proc/diskstats`` may
-be a better choice if you are watching a large number of disks because
-you'll avoid the overhead of 50, 100, or 500 or more opens/closes with
-each snapshot of your disk statistics.
-
-In 2.4, the statistics fields are those after the device name. In
-the above example, the first field of statistics would be 446216.
-By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
-find just the 15 fields, beginning with 446216. If you look at
-``/proc/diskstats``, the 15 fields will be preceded by the major and
-minor device numbers, and device name. Each of these formats provides
-15 fields of statistics, each meaning exactly the same things.
-All fields except field 9 are cumulative since boot. Field 9 should
-go to zero as I/Os complete; all others only increase (unless they
-overflow and wrap). Wrapping might eventually occur on a very busy
-or long-lived system; so applications should be prepared to deal with
-it. Regarding wrapping, the types of the fields are either unsigned
-int (32 bit) or unsigned long (32-bit or 64-bit, depending on your
-machine) as noted per-field below. Unless your observations are very
-spread in time, these fields should not wrap twice before you notice it.
+The kernel exposes disk statistics via ``/proc/diskstats`` and
+``/sys/block/<device>/stat``. These stats are usually accessed via tools
+such as ``sar`` and ``iostat``.
+
+Here are examples using a disk with two partitions::
+
+ /proc/diskstats:
+ 259 0 nvme0n1 255999 814 12369153 47919 996852 81 36123024 425995 0 301795 580470 0 0 0 0 60602 106555
+ 259 1 nvme0n1p1 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0
+ 259 2 nvme0n1p2 255401 1 12343477 47799 996004 0 36014736 425784 0 344336 473584 0 0 0 0 0 0
+
+ /sys/block/nvme0n1/stat:
+ 255999 814 12369153 47919 996858 81 36123056 426009 0 301809 580491 0 0 0 0 60605 106562
+
+ /sys/block/nvme0n1/nvme0n1p1/stat:
+ 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0
+
+Both files contain the same 17 statistics. ``/sys/block/<device>/stat``
+contains the fields for ``<device>``. In ``/proc/diskstats`` the fields
+are prefixed with the major and minor device numbers and the device
+name. In the example above, the first stat value for ``nvme0n1`` is
+255999 in both files.
+
+The sysfs ``stat`` file is efficient for monitoring a small, known set
+of disks. If you're tracking a large number of devices,
+``/proc/diskstats`` is often the better choice since it avoids the
+overhead of opening and closing multiple files for each snapshot.
+
+All fields are cumulative, monotonic counters, except for field 9, which
+resets to zero as I/Os complete. The remaining fields reset at boot, on
+device reattachment or reinitialization, or when the underlying counter
+overflows. Applications reading these counters should detect and handle
+resets when comparing stat snapshots.
Each set of stats only applies to the indicated device; if you want
system-wide stats you'll have to find all the devices and sum them all up.
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5376890adbeb..1f7f14c6e184 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -180,10 +180,6 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
1) On i386, enable high memory support under "Processor type and
features"::
- CONFIG_HIGHMEM64G=y
-
- or::
-
CONFIG_HIGHMEM4G
2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..d9fd26b95b34 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -416,10 +416,6 @@
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
- For X86-32, this can also be used to specify an APIC
- driver name.
- Format: apic=driver_name
- Examples: apic=bigsmp
apic_extnmi= [APIC,X86,EARLY] External NMI delivery setting
Format: { bsp (default) | all | none }
@@ -1411,7 +1407,8 @@
earlyprintk=serial[,0x...[,baudrate]]
earlyprintk=ttySn[,baudrate]
earlyprintk=dbgp[debugController#]
- earlyprintk=pciserial[,force],bus:device.function[,baudrate]
+ earlyprintk=mmio32,membase[,{nocfg|baudrate}]
+ earlyprintk=pciserial[,force],bus:device.function[,{nocfg|baudrate}]
earlyprintk=xdbc[xhciController#]
earlyprintk=bios
@@ -1419,6 +1416,9 @@
the normal console is initialized. It is not enabled by
default because it has some cosmetic problems.
+ Use "nocfg" to skip UART configuration, assume
+ BIOS/firmware has configured UART correctly.
+
Append ",keep" to not disable it when the real console
takes over.
@@ -1785,7 +1785,9 @@
allocation boundaries as a proactive defense
against bounds-checking flaws in the kernel's
copy_to_user()/copy_from_user() interface.
- on Perform hardened usercopy checks (default).
+ The default is determined by
+ CONFIG_HARDENED_USERCOPY_DEFAULT_ON.
+ on Perform hardened usercopy checks.
off Disable hardened usercopy checks.
hardlockup_all_cpu_backtrace=
@@ -1861,7 +1863,7 @@
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
- hugepages= [HW] Number of HugeTLB pages to allocate at boot.
+ hugepages= [HW,EARLY] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
@@ -1873,15 +1875,24 @@
<node>:<integer>[,<node>:<integer>]
hugepagesz=
- [HW] The size of the HugeTLB pages. This is used in
- conjunction with hugepages (above) to allocate huge
- pages of a specific size at boot. The pair
- hugepagesz=X hugepages=Y can be specified once for
- each supported huge page size. Huge page sizes are
- architecture dependent. See also
+ [HW,EARLY] The size of the HugeTLB pages. This is
+ used in conjunction with hugepages (above) to
+ allocate huge pages of a specific size at boot. The
+ pair hugepagesz=X hugepages=Y can be specified once
+ for each supported huge page size. Huge page sizes
+ are architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
+ hugepage_alloc_threads=
+ [HW] The number of threads that should be used to
+ allocate hugepages during boot. This option can be
+ used to improve system bootup time when allocating
+ a large amount of huge pages.
+ The default value is 25% of the available hardware threads.
+
+ Note that this parameter only applies to non-gigantic huge pages.
+
hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
@@ -1892,6 +1903,13 @@
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
+ hugetlb_cma_only=
+ [HW,CMA,EARLY] When allocating new HugeTLB pages, only
+ try to allocate from the CMA areas.
+
+ This option does nothing if hugetlb_cma= is not also
+ specified.
+
hugetlb_free_vmemmap=
[KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
enabled.
@@ -1933,6 +1951,12 @@
which allow the hypervisor to 'idle' the guest
on lock contention.
+ hw_protection= [HW]
+ Format: reboot | shutdown
+
+ Hardware protection action taken on critical events like
+ overtemperature or imminent voltage loss.
+
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
registered from board initialization code.
@@ -2316,6 +2340,9 @@
per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
+ no_cas
+ Do not enable capacity-aware scheduling (CAS) on
+ hybrid systems
intremap= [X86-64,Intel-IOMMU,EARLY]
on enable Interrupt Remapping (default)
@@ -3116,6 +3143,8 @@
* max_sec_lba48: Set or clear transfer size limit to
65535 sectors.
+ * external: Mark port as external (hotplug-capable).
+
* [no]lpm: Enable or disable link power management.
* [no]setxfer: Indicate if transfer speed mode setting
@@ -4233,10 +4262,10 @@
nosmp [SMP,EARLY] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC. legacy for "maxcpus=0".
- nosmt [KNL,MIPS,PPC,S390,EARLY] Disable symmetric multithreading (SMT).
+ nosmt [KNL,MIPS,PPC,EARLY] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
- [KNL,X86,PPC] Disable symmetric multithreading (SMT).
+ [KNL,X86,PPC,S390] Disable symmetric multithreading (SMT).
nosmt=force: Force disable SMT, cannot be undone
via the sysfs control file.
@@ -5017,6 +5046,14 @@
Format: <bool>
default: 0 (auto_verbose is enabled)
+ printk.debug_non_panic_cpus=
+ Allows storing messages from non-panic CPUs into
+ the printk log buffer during panic(). They are
+ flushed to consoles by the panic-CPU on
+ a best-effort basis.
+ Format: <bool> (1/Y/y=enable, 0/N/n=disable)
+ Default: disabled
+
printk.devkmsg={on,off,ratelimit}
Control writing to /dev/kmsg.
on - unlimited logging to /dev/kmsg from userspace
@@ -5758,6 +5795,11 @@
rcutorture.test_boost_duration= [KNL]
Duration (s) of each individual boost test.
+ rcutorture.test_boost_holdoff= [KNL]
+ Holdoff time (s) from start of test to the start
+ of RCU priority-boost testing. Defaults to zero,
+ that is, no holdoff.
+
rcutorture.test_boost_interval= [KNL]
Interval (s) between each boost test.
@@ -6082,7 +6124,7 @@
is assumed to be I/O ports; otherwise it is memory.
reserve_mem= [RAM]
- Format: nn[KNG]:<align>:<label>
+ Format: nn[KMG]:<align>:<label>
Reserve physical memory and label it with a name that
other subsystems can use to access it. This is typically
used for systems that do not wipe the RAM, and this command
@@ -6582,6 +6624,8 @@
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
+ Selecting specific mitigation does not force enable
+ user mitigations.
Selecting 'off' will disable both the kernel and
the user space protections.
@@ -7241,6 +7285,8 @@
This is just one of many ways that can clear memory. Make sure your system
keeps the content of memory across reboots before relying on this option.
+ NB: Both the mapped address and size must be page aligned for the architecture.
+
See also Documentation/trace/debugging.rst
@@ -7279,6 +7325,15 @@
See also "Event triggers" in Documentation/trace/events.rst
+ traceoff_after_boot
+ [FTRACE] Sometimes tracing is used to debug issues
+ during the boot process. Since the trace buffer has a
+ limited amount of storage, it may be prudent to
+ disable tracing after the boot is finished, otherwise
+ the critical information may be overwritten. With this
+ option, the main tracing buffer will be turned off at
+ the end of the boot process.
+
traceoff_on_warning
[FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can
@@ -7477,6 +7532,22 @@
Note that genuine overcurrent events won't be
reported either.
+ unaligned_scalar_speed=
+ [RISCV]
+ Format: {slow | fast | unsupported}
+ Allow skipping scalar unaligned access speed tests. This
+ is useful for testing alternative code paths and to skip
+ the tests in environments where they run too slowly. All
+ CPUs must have the same scalar unaligned access speed.
+
+ unaligned_vector_speed=
+ [RISCV]
+ Format: {slow | fast | unsupported}
+ Allow skipping vector unaligned access speed tests. This
+ is useful for testing alternative code paths and to skip
+ the tests in environments where they run too slowly. All
+ CPUs must have the same vector unaligned access speed.
+
unknown_nmi_panic
[X86] Cause panic on unknown NMI.
@@ -7672,13 +7743,6 @@
16 - SIGBUS faults
Example: user_debug=31
- userpte=
- [X86,EARLY] Flags controlling user PTE allocations.
-
- nohigh = do not allocate PTE pages in
- HIGHMEM regardless of setting
- of CONFIG_HIGHPTE.
-
vdso= [X86,SH,SPARC]
On X86_32, this is an alias for vdso32=. Otherwise:
diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
index ea7fa2a8bbf0..ee9a6c94f383 100644
--- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
@@ -278,12 +278,7 @@ To reduce its OS jitter, do any of the following:
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
- e. If running on Cell Processor, build your kernel with
- CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
- spu_gov_work().
- WARNING: Please check your CPU specifications to
- make sure that this is safe on your particular system.
- f. If running on PowerMAC, build your kernel with
+ e. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().
diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst
index cd9a1c2695fd..e71c8984c23e 100644
--- a/Documentation/admin-guide/laptops/index.rst
+++ b/Documentation/admin-guide/laptops/index.rst
@@ -11,6 +11,7 @@ Laptop Drivers
disk-shock-protection
laptop-mode
lg-laptop
+ samsung-galaxybook
sony-laptop
sonypi
thinkpad-acpi
diff --git a/Documentation/admin-guide/laptops/samsung-galaxybook.rst b/Documentation/admin-guide/laptops/samsung-galaxybook.rst
new file mode 100644
index 000000000000..752b8f1a4a74
--- /dev/null
+++ b/Documentation/admin-guide/laptops/samsung-galaxybook.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+==========================
+Samsung Galaxy Book Driver
+==========================
+
+Joshua Grisham <josh@joshuagrisham.com>
+
+This is a Linux x86 platform driver for Samsung Galaxy Book series notebook
+devices which utilizes Samsung's ``SCAI`` ACPI device in order to control
+extra features and receive various notifications.
+
+Supported devices
+=================
+
+Any device with one of the supported ACPI device IDs should be supported. This
+covers most of the "Samsung Galaxy Book" series notebooks that are currently
+available as of this writing, and could include other Samsung notebook devices
+as well.
+
+Status
+======
+
+The following features are currently supported:
+
+- :ref:`Keyboard backlight <keyboard-backlight>` control
+- :ref:`Performance mode <performance-mode>` control implemented using the
+ platform profile interface
+- :ref:`Battery charge control end threshold
+ <battery-charge-control-end-threshold>` (stop charging battery at given
+ percentage value) implemented as a battery hook
+- :ref:`Firmware Attributes <firmware-attributes>` to allow control of various
+ device settings
+- :ref:`Handling of Fn hotkeys <keyboard-hotkey-actions>` for various actions
+- :ref:`Handling of ACPI notifications and hotkeys
+ <acpi-notifications-and-hotkey-actions>`
+
+Because different models of these devices can vary in their features, there is
+logic built within the driver which attempts to test each implemented feature
+for a valid response before enabling its support (registering additional devices
+or extensions, adding sysfs attributes, etc). Therefore, it can be important to
+note that not all features may be supported for your particular device.
+
+The following features might be possible to implement but will require
+additional investigation and are therefore not supported at this time:
+
+- "Dolby Atmos" mode for the speakers
+- "Outdoor Mode" for increasing screen brightness on models with ``SAM0427``
+- "Silent Mode" on models with ``SAM0427``
+
+.. _keyboard-backlight:
+
+Keyboard backlight
+==================
+
+A new LED class named ``samsung-galaxybook::kbd_backlight`` is created which
+will then expose the device using the standard sysfs-based LED interface at
+``/sys/class/leds/samsung-galaxybook::kbd_backlight``. Brightness can be
+controlled by writing the desired value to the ``brightness`` sysfs attribute or
+with any other desired userspace utility.
+
+.. note::
+ Most of these devices have an ambient light sensor which also turns
+ off the keyboard backlight under well-lit conditions. This behavior does not
+ seem possible to control at this time, but can be good to be aware of.
+
+.. _performance-mode:
+
+Performance mode
+================
+
+This driver implements the
+Documentation/userspace-api/sysfs-platform_profile.rst interface for working
+with the "performance mode" function of the Samsung ACPI device.
+
+Mapping of each Samsung "performance mode" to its respective platform profile is
+performed dynamically by the driver, as not all models support all of the same
+performance modes. Your device might have one or more of the following mappings:
+
+- "Silent" maps to ``low-power``
+- "Quiet" maps to ``quiet``
+- "Optimized" maps to ``balanced``
+- "High performance" maps to ``performance``
+
+The result of the mapping can be printed in the kernel log when the module is
+loaded. Supported profiles can also be retrieved from
+``/sys/firmware/acpi/platform_profile_choices``, while
+``/sys/firmware/acpi/platform_profile`` can be used to read or write the
+currently selected profile.
+
+The ``balanced`` platform profile will be set during module load if no profile
+has been previously set.
+
+.. _battery-charge-control-end-threshold:
+
+Battery charge control end threshold
+====================================
+
+This platform driver will add the ability to set the battery's charge control
+end threshold, but does not have the ability to set a start threshold.
+
+This feature is typically called "Battery Saver" by the various Samsung
+applications in Windows, but in Linux we have implemented the standardized
+"charge control threshold" sysfs interface on the battery device to allow for
+controlling this functionality from the userspace.
+
+The sysfs attribute
+``/sys/class/power_supply/BAT1/charge_control_end_threshold`` can be used to
+read or set the desired charge end threshold.
+
+If you wish to maintain interoperability with the Samsung Settings application
+in Windows, then you should set the value to 100 to represent "off", or enable
+the feature using only one of the following values: 50, 60, 70, 80, or 90.
+Otherwise, the driver will accept any value between 1 and 100 as the percentage
+that you wish the battery to stop charging at.
+
+.. note::
+ Some devices have been observed as automatically "turning off" the charge
+ control end threshold if an input value of less than 30 is given.
+
+.. _firmware-attributes:
+
+Firmware Attributes
+===================
+
+The following enumeration-typed firmware attributes are set up by this driver
+and should be accessible under
+``/sys/class/firmware-attributes/samsung-galaxybook/attributes/`` if your device
+supports them:
+
+- ``power_on_lid_open`` (device should power on when the lid is opened)
+- ``usb_charging`` (USB ports can deliver power to connected devices even when
+ the device is powered off or in a low sleep state)
+- ``block_recording`` (blocks access to camera and microphone)
+
+All of these attributes are simple boolean-like enumeration values which use 0
+to represent "off" and 1 to represent "on". Use the ``current_value`` attribute
+to get or change the setting on the device.
+
+Note that when ``block_recording`` is updated, the input device "Samsung Galaxy
+Book Lens Cover" will receive a ``SW_CAMERA_LENS_COVER`` switch event which
+reflects the current state.
+
+.. _keyboard-hotkey-actions:
+
+Keyboard hotkey actions (i8042 filter)
+======================================
+
+The i8042 filter will swallow the keyboard events for the Fn+F9 hotkey (Multi-
+level keyboard backlight toggle) and Fn+F10 hotkey (Block recording toggle)
+and instead execute their actions within the driver itself.
+
+Fn+F9 will cycle through the brightness levels of the keyboard backlight. A
+notification will be sent using ``led_classdev_notify_brightness_hw_changed``
+so that the userspace can be aware of the change. This mimics the behavior of
+other existing devices where the brightness level is cycled internally by the
+embedded controller and then reported via a notification.
+
+Fn+F10 will toggle the value of the "block recording" setting, which blocks
+or allows usage of the built-in camera and microphone (and generates the same
+Lens Cover switch event mentioned above).
+
+.. _acpi-notifications-and-hotkey-actions:
+
+ACPI notifications and hotkey actions
+=====================================
+
+ACPI notifications will generate ACPI netlink events under the device class
+``samsung-galaxybook`` and bus ID matching the Samsung ACPI device ID found on
+your device. The events can be received using userspace tools such as
+``acpi_listen`` and ``acpid``.
+
+The Fn+F11 Performance mode hotkey will be handled by the driver; each keypress
+will cycle to the next available platform profile.
diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst
index 92690e1f2183..b2e7a300494a 100644
--- a/Documentation/admin-guide/media/cec.rst
+++ b/Documentation/admin-guide/media/cec.rst
@@ -451,7 +451,7 @@ configure the CEC devices for HDMI Input and the HDMI Outputs manually.
---------------------
A three character manufacturer name that is used in the EDID for the HDMI
-Input. If not set, then userspace is reponsible for configuring an EDID.
+Input. If not set, then userspace is responsible for configuring an EDID.
If set, then the driver will update the EDID automatically based on the
resolutions supported by the connected displays, and it will not be possible
anymore to manually set the EDID for the HDMI Input.
diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst
index b9da127c074d..f69d331e3cb1 100644
--- a/Documentation/admin-guide/media/mgb4.rst
+++ b/Documentation/admin-guide/media/mgb4.rst
@@ -22,7 +22,9 @@ Global (PCI card) parameters
| 0 - No module present
| 1 - FPDL3
- | 2 - GMSL
+ | 2 - GMSL (one serializer, two daisy chained deserializers)
+ | 3 - GMSL (one serializer, two deserializers)
+ | 4 - GMSL (two deserializers with two daisy chain outputs)
**module_version** (R):
Module version number. Zero in case of a missing module.
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
index 7367e6294ef6..4120e9cb0cd5 100644
--- a/Documentation/admin-guide/mm/cma_debugfs.rst
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -12,10 +12,16 @@ its CMA name like below:
The structure of the files created under that directory is as follows:
- - [RO] base_pfn: The base PFN (Page Frame Number) of the zone.
+ - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area.
+ This is the same as ranges/0/base_pfn.
- [RO] count: Amount of memory in the CMA area.
- [RO] order_per_bit: Order of pages represented by one bit.
- - [RO] bitmap: The bitmap of page states in the zone.
+ - [RO] bitmap: The bitmap of allocated pages in the area.
+ This is the same as ranges/0/base_pfn.
+ - [RO] ranges/N/base_pfn: The base PFN of contiguous range N
+ in the CMA area.
+ - [RO] ranges/N/bitmap: The bit map of allocated pages in
+ range N in the CMA area.
- [WO] alloc: Allocate N pages from that CMA area. For example::
echo 5 > <debugfs>/cma/<cma_name>/alloc
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 47a44bd348ab..ced2013db3df 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -64,6 +64,7 @@ comma (",").
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
+ │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us
│ │ │ │ │ │ nr_regions/min,max
│ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
│ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
@@ -82,8 +83,8 @@ comma (",").
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
- │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
- │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx
+ │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
+ │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
@@ -132,6 +133,11 @@ Users can write below commands for the kdamond to the ``state`` file.
- ``off``: Stop running.
- ``commit``: Read the user inputs in the sysfs files except ``state`` file
again.
+- ``update_tuned_intervals``: Update the contents of ``sample_us`` and
+ ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling
+ interval`` and ``aggregation interval`` for the files. Please refer to
+ :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>`
+ for more details.
- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
:ref:`quota goals <sysfs_schemes_quota_goals>`.
- ``update_schemes_stats``: Update the contents of stats files for each
@@ -213,6 +219,25 @@ writing to and rading from the files.
For more details about the intervals and monitoring regions range, please refer
to the Design document (:doc:`/mm/damon/design`).
+.. _damon_usage_sysfs_monitoring_intervals_goal:
+
+contexts/<N>/monitoring_attrs/intervals/intervals_goal/
+-------------------------------------------------------
+
+Under the ``intervals`` directory, one directory for automated tuning of
+``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists.
+Under the directory, four files for the auto-tuning control, namely
+``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist.
+Please refer to the :ref:`design document of the feature
+<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning
+mechanism. Reading and writing the four files under ``intervals_goal``
+directory shows and updates the tuning parameters that described in the
+:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same
+names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The
+tuning-applied current values of the two intervals can be read from the
+``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to
+the ``state`` file.
+
.. _sysfs_targets:
contexts/<N>/targets/
@@ -282,9 +307,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
schemes/<N>/
------------
-In each scheme directory, five directories (``access_pattern``, ``quotas``,
-``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
-(``action``, ``target_nid`` and ``apply_interval``) exist.
+In each scheme directory, seven directories (``access_pattern``, ``quotas``,
+``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and
+``tried_regions``) and three files (``action``, ``target_nid`` and
+``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
@@ -395,33 +421,43 @@ The ``interval`` should written in microseconds unit.
.. _sysfs_filters:
-schemes/<N>/filters/
---------------------
+schemes/<N>/{core\_,ops\_,}filters/
+-----------------------------------
-The directory for the :ref:`filters <damon_design_damos_filters>` of the given
+Directories for :ref:`filters <damon_design_damos_filters>` of the given
DAMON-based operation scheme.
-In the beginning, this directory has only one file, ``nr_filters``. Writing a
+``core_filters`` and ``ops_filters`` directories are for the filters handled by
+the DAMON core layer and operations set layer, respectively. ``filters``
+directory can be used for installing filters regardless of their handled
+layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be
+installed before those of ``filters``. All three directories have same files.
+
+Use of ``filters`` directory can make expecting evaluation orders of given
+filters with the files under directory bit confusing. Users are hence
+recommended to use ``core_filters`` and ``ops_filters`` directories. The
+``filters`` directory could be deprecated in future.
+
+In the beginning, the directory has only one file, ``nr_filters``. Writing a
number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
-Each filter directory contains seven files, namely ``type``, ``matching``,
-``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``.
-To ``type`` file, you can write one of five special keywords: ``anon`` for
-anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young
-pages, ``addr`` for specific address range (an open-ended interval), or
-``target`` for specific DAMON monitoring target filtering. Meaning of the
-types are same to the description on the :ref:`design doc
-<damon_design_damos_filters>`.
-
-In case of the memory cgroup filtering, you can specify the memory cgroup of
-the interest by writing the path of the memory cgroup from the cgroups mount
-point to ``memcg_path`` file. In case of the address range filtering, you can
-specify the start and end address of the range to ``addr_start`` and
-``addr_end`` files, respectively. For the DAMON monitoring target filtering,
-you can specify the index of the target between the list of the DAMON context's
-monitoring targets list to ``target_idx`` file.
+Each filter directory contains nine files, namely ``type``, ``matching``,
+``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max``
+and ``target_idx``. To ``type`` file, you can write the type of the filter.
+Refer to :ref:`the design doc <damon_design_damos_filters>` for available type
+names, their meaning and on what layer those are handled.
+
+For ``memcg`` type, you can specify the memory cgroup of the interest by
+writing the path of the memory cgroup from the cgroups mount point to
+``memcg_path`` file. For ``addr`` type, you can specify the start and end
+address of the range (open-ended interval) to ``addr_start`` and ``addr_end``
+files, respectively. For ``hugepage_size`` type, you can specify the minimum
+and maximum size of the range (closed interval) to ``min`` and ``max`` files,
+respectively. For ``target`` type, you can specify the index of the target
+between the list of the DAMON context's monitoring targets list to
+``target_idx`` file.
You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter
is for memory that matches the ``type``. You can write ``Y`` or ``N`` to
@@ -431,6 +467,7 @@ the ``type`` and ``matching`` should be allowed or not.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
+ # cd ops_filters/0/
# echo 2 > nr_filters
# # disallow anonymous pages
echo anon > 0/type
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f34a0d798d5b..67a941903fd2 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,7 +145,17 @@ hugepages
It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
If the node number is invalid, the parameter will be ignored.
+hugepage_alloc_threads
+ Specify the number of threads that should be used to allocate hugepages
+ during boot. This parameter can be used to improve system bootup time
+ when allocating a large amount of huge pages.
+ The default value is 25% of the available hardware threads.
+ Example to use 8 allocation threads::
+
+ hugepage_alloc_threads=8
+
+ Note that this parameter only applies to non-gigantic huge pages.
default_hugepagesz
Specify the default huge page size. This parameter can
only be specified once on the command line. default_hugepagesz can
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index caba0f52dd36..afce291649dd 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -21,7 +21,8 @@ There are four components to pagemap:
* Bit 56 page exclusively mapped (since 4.2)
* Bit 57 pte is uffd-wp write-protected (since 5.13) (see
Documentation/admin-guide/mm/userfaultfd.rst)
- * Bits 58-60 zero
+ * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page)
+ * Bits 59-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
@@ -37,12 +38,28 @@ There are four components to pagemap:
precisely which pages are mapped (or in swap) and comparing mapped
pages between processes.
+ Traditionally, bit 56 indicates that a page is mapped exactly once and bit
+ 56 is clear when a page is mapped multiple times, even when mapped in the
+ same process multiple times. In some kernel configurations, the semantics
+ for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set
+ if all pages part of the corresponding large allocation are *certainly*
+ mapped in the same process, even if the page is mapped multiple times in that
+ process. Bit 56 is clear when any page page of the larger allocation
+ is *maybe* mapped in a different process. In some cases, a large allocation
+ might be treated as "maybe mapped by multiple processes" even though this
+ is no longer the case.
+
Efficient users of this interface will use ``/proc/pid/maps`` to
determine which areas of memory are actually mapped and llseek to
skip over unmapped regions.
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
- times each page is mapped, indexed by PFN.
+ times each page is mapped, indexed by PFN. Some kernel configurations do
+ not track the precise number of times a page part of a larger allocation
+ (e.g., THP) is mapped. In these configurations, the average number of
+ mappings per page in this larger allocation is returned instead. However,
+ if any page of the large allocation is mapped, the returned value will
+ be at least 1.
The page-types tool in the tools/mm directory can be used to query the
number of times a page is mapped.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..fd3370aa43fe 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed
pages are freed. The pool is not preallocated. By default, a zpool
of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
but it can be overridden at boot time by setting the ``zpool`` attribute,
-e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
+e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs
``zpool`` attribute, e.g.::
- echo zbud > /sys/module/zswap/parameters/zpool
+ echo zsmalloc > /sys/module/zswap/parameters/zpool
-The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which
-means the compression ratio will always be 2:1 or worse (because of half-full
-zbud pages). The zsmalloc type zpool has a more complex compressed page
-storage method, and it can achieve greater storage densities.
+The zsmalloc type zpool has a complex compressed page storage method, and it
+can achieve great storage densities.
When a swap page is passed from swapout to zswap, zswap maintains a mapping
of the swap entry, a combination of the swap type and swap offset, to the zpool
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index a21369eba034..3950583f2b15 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -248,6 +248,20 @@ are the following:
If that frequency cannot be determined, this attribute should not
be present.
+``cpuinfo_avg_freq``
+ An average frequency (in KHz) of all CPUs belonging to a given policy,
+ derived from a hardware provided feedback and reported on a time frame
+ spanning at most few milliseconds.
+
+ This is expected to be based on the frequency the hardware actually runs
+ at and, as such, might require specialised hardware support (such as AMU
+ extension on ARM). If one cannot be determined, this attribute should
+ not be present.
+
+ Note, that failed attempt to retrieve current frequency for a given
+ CPU(s) will result in an appropriate error, i.e: EAGAIN for CPU that
+ remains idle (raised on ARM).
+
``cpuinfo_max_freq``
Maximum possible operating frequency the CPUs belonging to this policy
can run at (in kHz).
@@ -293,7 +307,8 @@ are the following:
Some architectures (e.g. ``x86``) may attempt to provide information
more precisely reflecting the current CPU frequency through this
attribute, but that still may not be the exact current CPU frequency as
- seen by the hardware at the moment.
+ seen by the hardware at the moment. This behavior though, is only
+ available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option.
``scaling_driver``
The scaling driver currently in use.
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
index eb58d7a5affd..0c090b076224 100644
--- a/Documentation/admin-guide/pm/cpuidle.rst
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -275,20 +275,25 @@ values and, when predicting the idle duration next time, it computes the average
and variance of them. If the variance is small (smaller than 400 square
milliseconds) or it is small relative to the average (the average is greater
that 6 times the standard deviation), the average is regarded as the "typical
-interval" value. Otherwise, the longest of the saved observed idle duration
+interval" value. Otherwise, either the longest or the shortest (depending on
+which one is farther from the average) of the saved observed idle duration
values is discarded and the computation is repeated for the remaining ones.
+
Again, if the variance of them is small (in the above sense), the average is
taken as the "typical interval" value and so on, until either the "typical
-interval" is determined or too many data points are disregarded, in which case
-the "typical interval" is assumed to equal "infinity" (the maximum unsigned
-integer value).
-
-If the "typical interval" computed this way is long enough, the governor obtains
-the time until the closest timer event with the assumption that the scheduler
-tick will be stopped. That time, referred to as the *sleep length* in what follows,
-is the upper bound on the time before the next CPU wakeup. It is used to determine
-the sleep length range, which in turn is needed to get the sleep length correction
-factor.
+interval" is determined or too many data points are disregarded. In the latter
+case, if the size of the set of data points still under consideration is
+sufficiently large, the next idle duration is not likely to be above the largest
+idle duration value still in that set, so that value is taken as the predicted
+next idle duration. Finally, if the set of data points still under
+consideration is too small, no prediction is made.
+
+If the preliminary prediction of the next idle duration computed this way is
+long enough, the governor obtains the time until the closest timer event with
+the assumption that the scheduler tick will be stopped. That time, referred to
+as the *sleep length* in what follows, is the upper bound on the time before the
+next CPU wakeup. It is used to determine the sleep length range, which in turn
+is needed to get the sleep length correction factor.
The ``menu`` governor maintains an array containing several correction factor
values that correspond to different sleep length ranges organized so that each
@@ -302,7 +307,7 @@ to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
The sleep length is multiplied by the correction factor for the range that it
falls into to obtain an approximation of the predicted idle duration that is
compared to the "typical interval" determined previously and the minimum of
-the two is taken as the idle duration prediction.
+the two is taken as the final idle duration prediction.
If the "typical interval" value is small, which means that the CPU is likely
to be woken up soon enough, the sleep length computation is skipped as it may
diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst
index 39bd6ecce7de..5940528146eb 100644
--- a/Documentation/admin-guide/pm/intel_idle.rst
+++ b/Documentation/admin-guide/pm/intel_idle.rst
@@ -192,11 +192,19 @@ even if they have been enumerated (see :ref:`cpu-pm-qos` in
Documentation/admin-guide/pm/cpuidle.rst).
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
-The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
-if the kernel has been configured with ACPI support) can be set to make the
-driver ignore the system's ACPI tables entirely or use them for all of the
-recognized processor models, respectively (they both are unset by default and
-``use_acpi`` has no effect if ``no_acpi`` is set).
+The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are
+recognized by ``intel_idle`` if the kernel has been configured with ACPI
+support. In the case that ACPI is not configured these flags have no impact
+on functionality.
+
+``no_acpi`` - Do not use ACPI at all. Only native mode is available, no
+ACPI mode.
+
+``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for
+C-states on/off status in native mode.
+
+``no_native`` - Work only in ACPI mode, no native mode available (ignore
+all custom tables).
The value of the ``states_off`` module parameter (0 by default) represents a
list of idle states to be disabled by default in the form of a bitmask.
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index bf13ad25a32f..78fc83ed2a7e 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -696,6 +696,9 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
Use per-logical-CPU P-State limits (see `Coordination of P-state
Limits`_ for details).
+``no_cas``
+ Do not enable capacity-aware scheduling (CAS) which is enabled by
+ default on hybrid systems.
Diagnostics and Tuning
======================
diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst
index 3eda08191d13..24d80e3eb309 100644
--- a/Documentation/admin-guide/pnp.rst
+++ b/Documentation/admin-guide/pnp.rst
@@ -129,9 +129,6 @@ pnp_put_protocol
pnp_register_protocol
use this to register a new PnP protocol
-pnp_unregister_protocol
- use this function to remove a PnP protocol from the Plug and Play Layer
-
pnp_register_driver
adds a PnP driver to the Plug and Play Layer
diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst
index a3dfc2c66e01..1609e7479249 100644
--- a/Documentation/admin-guide/serial-console.rst
+++ b/Documentation/admin-guide/serial-console.rst
@@ -78,7 +78,9 @@ If no console device is specified, the first device found capable of
acting as a system console will be used. At this time, the system
first looks for a VGA card and then for a serial port. So if you don't
have a VGA card in your system the first serial port will automatically
-become the console.
+become the console, unless the kernel is configured with the
+CONFIG_NULL_TTY_DEFAULT_CONSOLE option, then it will default to using the
+ttynull device.
You will need to create a new device to use ``/dev/console``. The official
``/dev/console`` is now character device 5,1.
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 08e89e031714..6c54718c9d04 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -347,3 +347,28 @@ filesystems:
``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for
setting/getting the maximum number of pages that can be used for servicing
requests in FUSE.
+
+``/proc/sys/fs/fuse/default_request_timeout`` is a read/write file for
+setting/getting the default timeout (in seconds) for a fuse server to
+reply to a kernel-issued request in the event where the server did not
+specify a timeout at mount. If the server set a timeout,
+then default_request_timeout will be ignored. The default
+"default_request_timeout" is set to 0. 0 indicates no default timeout.
+The maximum value that can be set is 65535.
+
+``/proc/sys/fs/fuse/max_request_timeout`` is a read/write file for
+setting/getting the maximum timeout (in seconds) for a fuse server to
+reply to a kernel-issued request. A value greater than 0 automatically opts
+the server into a timeout that will be set to at most "max_request_timeout",
+even if the server did not specify a timeout and default_request_timeout is
+set to 0. If max_request_timeout is greater than 0 and the server set a timeout
+greater than max_request_timeout or default_request_timeout is set to a value
+greater than max_request_timeout, the system will use max_request_timeout as the
+timeout. 0 indicates no max request timeout. The maximum value that can be set
+is 65535.
+
+For timeouts, if the server does not respond to the request by the time
+the set timeout elapses, then the connection to the fuse server will be aborted.
+Please note that the timeouts are not 100% precise (eg you may set 60 seconds but
+the timeout may kick in after 70 seconds). The upper margin of error for the
+timeout is roughly FUSE_TIMEOUT_TIMER_FREQ seconds.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index a43b78b4b646..dd49a89a62d3 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -212,6 +212,17 @@ pid>/``).
This value defaults to 0.
+core_sort_vma
+=============
+
+The default coredump writes VMAs in address order. By setting
+``core_sort_vma`` to 1, VMAs will be written from smallest size
+to largest size. This is known to break at least elfutils, but
+can be handy when dealing with very large (and truncated)
+coredumps where the more useful debugging details are included
+in the smaller VMAs.
+
+
core_uses_pid
=============
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f48eaa98d22d..8290177b4f75 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
+- defrag_mode
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -145,6 +146,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
to compaction, which would block the task from becoming active until the fault
is resolved.
+defrag_mode
+===========
+
+When set to 1, the page allocator tries harder to avoid fragmentation
+and maintain the ability to produce huge pages / higher-order pages.
+
+It is recommended to enable this right after boot, as fragmentation,
+once it occurred, can be long-lasting or even permanent.
dirty_background_bytes
======================
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 700aa72eecb1..a0cc017e4424 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
18 _/N 262144 an in-kernel test has been run
+ 19 _/J 524288 userspace used a mutating debug operation in fwctl
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@@ -184,3 +185,7 @@ More detailed explanation for tainting
build time.
18) ``N`` if an in-kernel test, such as a KUnit test, has been run.
+
+ 19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
+ to use the devices debugging features. Device debugging features could
+ cause the device to malfunction in undefined ways.
diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst
index 2ed79f41a411..d0502691dfa1 100644
--- a/Documentation/admin-guide/thunderbolt.rst
+++ b/Documentation/admin-guide/thunderbolt.rst
@@ -28,7 +28,7 @@ should be a userspace tool that handles all the low-level details, keeps
a database of the authorized devices and prompts users for new connections.
More details about the sysfs interface for Thunderbolt devices can be
-found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``.
+found in Documentation/ABI/testing/sysfs-bus-thunderbolt.
Those users who just want to connect any device without any sort of
manual work can add following line to
diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst
index 6be38c1b9c5b..d6313890ee41 100644
--- a/Documentation/admin-guide/workload-tracing.rst
+++ b/Documentation/admin-guide/workload-tracing.rst
@@ -82,7 +82,7 @@ Install tools to build Linux kernel and tools in kernel repository.
scripts/ver_linux is a good way to check if your system already has
the necessary tools::
- sudo apt-get build-essentials flex bison yacc
+ sudo apt-get install build-essential flex bison yacc
sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev
cscope is a good tool to browse kernel sources. Let's install it now::