diff options
Diffstat (limited to 'Documentation/admin-guide')
165 files changed, 7893 insertions, 2426 deletions
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst index 520a1c2c6fd2..cdd65164ca96 100644 --- a/Documentation/admin-guide/LSM/SELinux.rst +++ b/Documentation/admin-guide/LSM/SELinux.rst @@ -2,6 +2,17 @@ SELinux ======= +Information about the SELinux kernel subsystem can be found at the +following links: + + https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git/tree/README.md + + https://github.com/selinuxproject/selinux-kernel/wiki + +Information about the SELinux userspace can be found at: + + https://github.com/SELinuxProject/selinux/wiki + If you want to use SELinux, chances are you will want to use the distro-provided policies, or install the latest reference policy release from diff --git a/Documentation/admin-guide/LSM/SafeSetID.rst b/Documentation/admin-guide/LSM/SafeSetID.rst index 0ec34863c674..6d439c987563 100644 --- a/Documentation/admin-guide/LSM/SafeSetID.rst +++ b/Documentation/admin-guide/LSM/SafeSetID.rst @@ -41,7 +41,7 @@ namespace). The higher level goal is to allow for uid-based sandboxing of system services without having to give out CAP_SETUID all over the place just so that non-root programs can drop to even-lesser-privileged uids. This is especially relevant when one non-root daemon on the system should be allowed to spawn other -processes as different uids, but its undesirable to give the daemon a +processes as different uids, but it's undesirable to give the daemon a basically-root-equivalent CAP_SETUID. diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst index 6d44f4fdbf59..c5ed775f2d10 100644 --- a/Documentation/admin-guide/LSM/Smack.rst +++ b/Documentation/admin-guide/LSM/Smack.rst @@ -601,10 +601,15 @@ specification. Task Attribute ~~~~~~~~~~~~~~ -The Smack label of a process can be read from /proc/<pid>/attr/current. A -process can read its own Smack label from /proc/self/attr/current. A +The Smack label of a process can be read from ``/proc/<pid>/attr/current``. A +process can read its own Smack label from ``/proc/self/attr/current``. A privileged process can change its own Smack label by writing to -/proc/self/attr/current but not the label of another process. +``/proc/self/attr/current`` but not the label of another process. + +Format of writing is : only the label or the label followed by one of the +3 trailers: ``\n`` (by common agreement for ``/proc/...`` interfaces), +``\0`` (because some applications incorrectly include it), +``\n\0`` (because we think some applications may incorrectly include it). File Attribute ~~~~~~~~~~~~~~ @@ -696,6 +701,11 @@ sockets. A privileged program may set this to match the label of another task with which it hopes to communicate. +UNIX domain socket (UDS) with a BSD address functions both as a file in a +filesystem and as a socket. As a file, it carries the SMACK64 attribute. This +attribute is not involved in Smack security enforcement and is immutably +assigned the label "*". + Smack Netlabel Exceptions ~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst index ce63be6d64ad..b44ef68f6e4d 100644 --- a/Documentation/admin-guide/LSM/index.rst +++ b/Documentation/admin-guide/LSM/index.rst @@ -48,3 +48,4 @@ subdirectories. Yama SafeSetID ipe + landlock diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst index f93a467db628..a756d8158531 100644 --- a/Documentation/admin-guide/LSM/ipe.rst +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -95,7 +95,20 @@ languages when these scripts are invoked by passing these program files to the interpreter. This is because the way interpreters execute these files; the scripts themselves are not evaluated as executable code through one of IPE's hooks, but they are merely text files that are read -(as opposed to compiled executables) [#interpreters]_. +(as opposed to compiled executables). However, with the introduction of the +``AT_EXECVE_CHECK`` flag (:doc:`AT_EXECVE_CHECK </userspace-api/check_exec>`), +interpreters can use it to signal the kernel that a script file will be executed, +and request the kernel to perform LSM security checks on it. + +IPE's EXECUTE operation enforcement differs between compiled executables and +interpreted scripts: For compiled executables, enforcement is triggered +automatically by the kernel during ``execve()``, ``execveat()``, ``mmap()`` +and ``mprotect()`` syscalls when loading executable content. For interpreted +scripts, enforcement requires explicit interpreter integration using +``execveat()`` with ``AT_EXECVE_CHECK`` flag. Unlike exec syscalls that IPE +intercepts during the execution process, this mechanism needs the interpreter +to take the initiative, and existing interpreters won't be automatically +supported unless the signal call is added. Threat Model ------------ @@ -423,7 +436,7 @@ Field descriptions: Event Example:: - type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 errno=0 type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E @@ -433,24 +446,55 @@ This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record Field descriptions: -+----------------+------------+-----------+---------------------------------------------------+ -| Field | Value Type | Optional? | Description of Value | -+================+============+===========+===================================================+ -| policy_name | string | No | The policy_name | -+----------------+------------+-----------+---------------------------------------------------+ -| policy_version | string | No | The policy_version | -+----------------+------------+-----------+---------------------------------------------------+ -| policy_digest | string | No | The policy hash | -+----------------+------------+-----------+---------------------------------------------------+ -| auid | integer | No | The login user ID | -+----------------+------------+-----------+---------------------------------------------------+ -| ses | integer | No | The login session ID | -+----------------+------------+-----------+---------------------------------------------------+ -| lsm | string | No | The lsm name associated with the event | -+----------------+------------+-----------+---------------------------------------------------+ -| res | integer | No | The result of the audited operation(success/fail) | -+----------------+------------+-----------+---------------------------------------------------+ - ++----------------+------------+-----------+-------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+=============================================================+ +| policy_name | string | Yes | The policy_name | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_version | string | Yes | The policy_version | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_digest | string | Yes | The policy hash | ++----------------+------------+-----------+-------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+-------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+-------------------------------------------------------------+ +| errno | integer | No | Error code from policy loading operations (see table below) | ++----------------+------------+-----------+-------------------------------------------------------------+ + +Policy error codes (errno): + +The following table lists the error codes that may appear in the errno field while loading or updating the policy: + ++----------------+--------------------------------------------------------+ +| Error Code | Description | ++================+========================================================+ +| 0 | Success | ++----------------+--------------------------------------------------------+ +| -EPERM | Insufficient permission | ++----------------+--------------------------------------------------------+ +| -EEXIST | Same name policy already deployed | ++----------------+--------------------------------------------------------+ +| -EBADMSG | Policy is invalid | ++----------------+--------------------------------------------------------+ +| -ENOMEM | Out of memory (OOM) | ++----------------+--------------------------------------------------------+ +| -ERANGE | Policy version number overflow | ++----------------+--------------------------------------------------------+ +| -EINVAL | Policy version parsing error | ++----------------+--------------------------------------------------------+ +| -ENOKEY | Key used to sign the IPE policy not found in keyring | ++----------------+--------------------------------------------------------+ +| -EKEYREJECTED | Policy signature verification failed | ++----------------+--------------------------------------------------------+ +| -ESTALE | Attempting to update an IPE policy with older version | ++----------------+--------------------------------------------------------+ +| -ENOENT | Policy was deleted while updating | ++----------------+--------------------------------------------------------+ 1404 AUDIT_MAC_STATUS ^^^^^^^^^^^^^^^^^^^^^ @@ -775,8 +819,6 @@ A: .. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/ -.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_. - .. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on this topic. diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst new file mode 100644 index 000000000000..9923874e2156 --- /dev/null +++ b/Documentation/admin-guide/LSM/landlock.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. Copyright © 2025 Microsoft Corporation + +================================ +Landlock: system-wide management +================================ + +:Author: Mickaël Salaün +:Date: January 2026 + +Landlock can leverage the audit framework to log events. + +User space documentation can be found here: +Documentation/userspace-api/landlock.rst. + +Audit +===== + +Denied access requests are logged by default for a sandboxed program if `audit` +is enabled. This default behavior can be changed with the +sys_landlock_restrict_self() flags (cf. +Documentation/userspace-api/landlock.rst). Landlock logs can also be masked +thanks to audit rules. Landlock can generate 2 audit record types. + +Record types +------------ + +AUDIT_LANDLOCK_ACCESS + This record type identifies a denied access request to a kernel resource. + The ``domain`` field indicates the ID of the domain that blocked the + request. The ``blockers`` field indicates the cause(s) of this denial + (separated by a comma), and the following fields identify the kernel object + (similar to SELinux). There may be more than one of this record type per + audit event. + + Example with a file link request generating two records in the same event:: + + domain=195ba459b blockers=fs.refer path="/usr/bin" dev="vda2" ino=351 + domain=195ba459b blockers=fs.make_reg,fs.refer path="/usr/local" dev="vda2" ino=365 + + + The ``blockers`` field uses dot-separated prefixes to indicate the type of + restriction that caused the denial: + + **fs.*** - Filesystem access rights (ABI 1+): + - fs.execute, fs.write_file, fs.read_file, fs.read_dir + - fs.remove_dir, fs.remove_file + - fs.make_char, fs.make_dir, fs.make_reg, fs.make_sock + - fs.make_fifo, fs.make_block, fs.make_sym + - fs.refer (ABI 2+) + - fs.truncate (ABI 3+) + - fs.ioctl_dev (ABI 5+) + + **net.*** - Network access rights (ABI 4+): + - net.bind_tcp - TCP port binding was denied + - net.connect_tcp - TCP connection was denied + + **scope.*** - IPC scoping restrictions (ABI 6+): + - scope.abstract_unix_socket - Abstract UNIX socket connection denied + - scope.signal - Signal sending denied + + Multiple blockers can appear in a single event (comma-separated) when + multiple access rights are missing. For example, creating a regular file + in a directory that lacks both ``make_reg`` and ``refer`` rights would show + ``blockers=fs.make_reg,fs.refer``. + + The object identification fields (path, dev, ino for filesystem; opid, + ocomm for signals) depend on the type of access being blocked and provide + context about what resource was involved in the denial. + + +AUDIT_LANDLOCK_DOMAIN + This record type describes the status of a Landlock domain. The ``status`` + field can be either ``allocated`` or ``deallocated``. + + The ``allocated`` status is part of the same audit event and follows + the first logged ``AUDIT_LANDLOCK_ACCESS`` record of a domain. It identifies + Landlock domain information at the time of the sys_landlock_restrict_self() + call with the following fields: + + - the ``domain`` ID + - the enforcement ``mode`` + - the domain creator's ``pid`` + - the domain creator's ``uid`` + - the domain creator's executable path (``exe``) + - the domain creator's command line (``comm``) + + Example:: + + domain=195ba459b status=allocated mode=enforcing pid=300 uid=0 exe="/root/sandboxer" comm="sandboxer" + + The ``deallocated`` status is an event on its own and it identifies a + Landlock domain release. After such event, it is guarantee that the + related domain ID will never be reused during the lifetime of the system. + The ``domain`` field indicates the ID of the domain which is released, and + the ``denials`` field indicates the total number of denied access request, + which might not have been logged according to the audit rules and + sys_landlock_restrict_self()'s flags. + + Example:: + + domain=195ba459b status=deallocated denials=3 + + +Event samples +-------------- + +Here are two examples of log events (see serial numbers). + +In this example a sandboxed program (``kill``) tries to send a signal to the +init process, which is denied because of the signal scoping restriction +(``LL_SCOPED=s``):: + + $ LL_FS_RO=/ LL_FS_RW=/ LL_SCOPED=s LL_FORCE_LOG=1 ./sandboxer kill 1 + +This command generates two events, each identified with a unique serial +number following a timestamp (``msg=audit(1729738800.268:30)``). The first +event (serial ``30``) contains 4 records. The first record +(``type=LANDLOCK_ACCESS``) shows an access denied by the domain `1a6fdc66f`. +The cause of this denial is signal scoping restriction +(``blockers=scope.signal``). The process that would have receive this signal +is the init process (``opid=1 ocomm="systemd"``). + +The second record (``type=LANDLOCK_DOMAIN``) describes (``status=allocated``) +domain `1a6fdc66f`. This domain was created by process ``286`` executing the +``/root/sandboxer`` program launched by the root user. + +The third record (``type=SYSCALL``) describes the syscall, its provided +arguments, its result (``success=no exit=-1``), and the process that called it. + +The fourth record (``type=PROCTITLE``) shows the command's name as an +hexadecimal value. This can be translated with ``python -c +'print(bytes.fromhex("6B696C6C0031"))'``. + +Finally, the last record (``type=LANDLOCK_DOMAIN``) is also the only one from +the second event (serial ``31``). It is not tied to a direct user space action +but an asynchronous one to free resources tied to a Landlock domain +(``status=deallocated``). This can be useful to know that the following logs +will not concern the domain ``1a6fdc66f`` anymore. This record also summarize +the number of requests this domain denied (``denials=1``), whether they were +logged or not. + +.. code-block:: + + type=LANDLOCK_ACCESS msg=audit(1729738800.268:30): domain=1a6fdc66f blockers=scope.signal opid=1 ocomm="systemd" + type=LANDLOCK_DOMAIN msg=audit(1729738800.268:30): domain=1a6fdc66f status=allocated mode=enforcing pid=286 uid=0 exe="/root/sandboxer" comm="sandboxer" + type=SYSCALL msg=audit(1729738800.268:30): arch=c000003e syscall=62 success=no exit=-1 [..] ppid=272 pid=286 auid=0 uid=0 gid=0 [...] comm="kill" [...] + type=PROCTITLE msg=audit(1729738800.268:30): proctitle=6B696C6C0031 + type=LANDLOCK_DOMAIN msg=audit(1729738800.324:31): domain=1a6fdc66f status=deallocated denials=1 + +Here is another example showcasing filesystem access control:: + + $ LL_FS_RO=/ LL_FS_RW=/tmp LL_FORCE_LOG=1 ./sandboxer sh -c "echo > /etc/passwd" + +The related audit logs contains 8 records from 3 different events (serials 33, +34 and 35) created by the same domain `1a6fdc679`:: + + type=LANDLOCK_ACCESS msg=audit(1729738800.221:33): domain=1a6fdc679 blockers=fs.write_file path="/dev/tty" dev="devtmpfs" ino=9 + type=LANDLOCK_DOMAIN msg=audit(1729738800.221:33): domain=1a6fdc679 status=allocated mode=enforcing pid=289 uid=0 exe="/root/sandboxer" comm="sandboxer" + type=SYSCALL msg=audit(1729738800.221:33): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...] + type=PROCTITLE msg=audit(1729738800.221:33): proctitle=7368002D63006563686F203E202F6574632F706173737764 + type=LANDLOCK_ACCESS msg=audit(1729738800.221:34): domain=1a6fdc679 blockers=fs.write_file path="/etc/passwd" dev="vda2" ino=143821 + type=SYSCALL msg=audit(1729738800.221:34): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...] + type=PROCTITLE msg=audit(1729738800.221:34): proctitle=7368002D63006563686F203E202F6574632F706173737764 + type=LANDLOCK_DOMAIN msg=audit(1729738800.261:35): domain=1a6fdc679 status=deallocated denials=2 + + +Event filtering +--------------- + +If you get spammed with audit logs related to Landlock, this is either an +attack attempt or a bug in the security policy. We can put in place some +filters to limit noise with two complementary ways: + +- with sys_landlock_restrict_self()'s flags if we can fix the sandboxed + programs, +- or with audit rules (see :manpage:`auditctl(8)`). + +Additional documentation +======================== + +* `Linux Audit Documentation`_ +* Documentation/userspace-api/landlock.rst +* Documentation/security/landlock.rst +* https://landlock.io + +.. Links +.. _Linux Audit Documentation: + https://github.com/linux-audit/audit-documentation/wiki diff --git a/Documentation/admin-guide/RAS/main.rst b/Documentation/admin-guide/RAS/main.rst index 7ac1d4ccc509..5a45db32c49b 100644 --- a/Documentation/admin-guide/RAS/main.rst +++ b/Documentation/admin-guide/RAS/main.rst @@ -253,7 +253,7 @@ interface. Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA engines, fabric switches, main data path switches, interconnections, and various other hardware data paths. If the hardware -reports it, then a edac_device device probably can be constructed to +reports it, then an edac_device device probably can be constructed to harvest and present that to userspace. @@ -406,24 +406,8 @@ index of the MC:: |->mc2 .... -Under each ``mcX`` directory each ``csrowX`` is again represented by a -``csrowX``, where ``X`` is the csrow index:: - - .../mc/mc0/ - | - |->csrow0 - |->csrow2 - |->csrow3 - .... - -Notice that there is no csrow1, which indicates that csrow0 is composed -of a single ranked DIMMs. This should also apply in both Channels, in -order to have dual-channel mode be operational. Since both csrow2 and -csrow3 are populated, this indicates a dual ranked set of DIMMs for -channels 0 and 1. - -Within each of the ``mcX`` and ``csrowX`` directories are several EDAC -control and attribute files. +Within each of the ``mcX`` directory are several EDAC control and +attribute files. ``mcX`` directories ------------------- @@ -569,7 +553,7 @@ this ``X`` memory module: - Unbuffered-DDR .. [#f5] On some systems, the memory controller doesn't have any logic - to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. + to identify the memory module. On such systems, the directory is called ``rankX``. On modern Intel memory controllers, the memory controller identifies the memory modules directly. On such systems, the directory is called ``dimmX``. @@ -577,126 +561,6 @@ this ``X`` memory module: symlinks inside the sysfs mapping that are automatically created by the sysfs subsystem. Currently, they serve no purpose. -``csrowX`` directories ----------------------- - -When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` -directories. As this API doesn't work properly for Rambus, FB-DIMMs and -modern Intel Memory Controllers, this is being deprecated in favor of -``dimmX`` directories. - -In the ``csrowX`` directories are EDAC control and attribute files for -this ``X`` instance of csrow: - - -- ``ue_count`` - Total Uncorrectable Errors count attribute file - - This attribute file displays the total count of uncorrectable - errors that have occurred on this csrow. If panic_on_ue is set - this counter will not have a chance to increment, since EDAC - will panic the system. - - -- ``ce_count`` - Total Correctable Errors count attribute file - - This attribute file displays the total count of correctable - errors that have occurred on this csrow. This count is very - important to examine. CEs provide early indications that a - DIMM is beginning to fail. This count field should be - monitored for non-zero values and report such information - to the system administrator. - - -- ``size_mb`` - Total memory managed by this csrow attribute file - - This attribute file displays, in count of megabytes, the memory - that this csrow contains. - - -- ``mem_type`` - Memory Type attribute file - - This attribute file will display what type of memory is currently - on this csrow. Normally, either buffered or unbuffered memory. - Examples: - - - Registered-DDR - - Unbuffered-DDR - - -- ``edac_mode`` - EDAC Mode of operation attribute file - - This attribute file will display what type of Error detection - and correction is being utilized. - - -- ``dev_type`` - Device type attribute file - - This attribute file will display what type of DRAM device is - being utilized on this DIMM. - Examples: - - - x1 - - x2 - - x4 - - x8 - - -- ``ch0_ce_count`` - Channel 0 CE Count attribute file - - This attribute file will display the count of CEs on this - DIMM located in channel 0. - - -- ``ch0_ue_count`` - Channel 0 UE Count attribute file - - This attribute file will display the count of UEs on this - DIMM located in channel 0. - - -- ``ch0_dimm_label`` - Channel 0 DIMM Label control file - - - This control file allows this DIMM to have a label assigned - to it. With this label in the module, when errors occur - the output can provide the DIMM label in the system log. - This becomes vital for panic events to isolate the - cause of the UE event. - - DIMM Labels must be assigned after booting, with information - that correctly identifies the physical slot with its - silk screen label. This information is currently very - motherboard specific and determination of this information - must occur in userland at this time. - - -- ``ch1_ce_count`` - Channel 1 CE Count attribute file - - - This attribute file will display the count of CEs on this - DIMM located in channel 1. - - -- ``ch1_ue_count`` - Channel 1 UE Count attribute file - - - This attribute file will display the count of UEs on this - DIMM located in channel 0. - - -- ``ch1_dimm_label`` - Channel 1 DIMM Label control file - - This control file allows this DIMM to have a label assigned - to it. With this label in the module, when errors occur - the output can provide the DIMM label in the system log. - This becomes vital for panic events to isolate the - cause of the UE event. - - DIMM Labels must be assigned after booting, with information - that correctly identifies the physical slot with its - silk screen label. This information is currently very - motherboard specific and determination of this information - must occur in userland at this time. - System Logging -------------- diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index eb9452668909..77fec1de6dc8 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -53,7 +53,7 @@ Documentation these typically contain kernel-specific installation notes for some drivers for example. Please read the :ref:`Documentation/process/changes.rst <changes>` file, as it - contains information about the problems, which may result by upgrading + contains information about the problems which may result from upgrading your kernel. Installing the kernel source @@ -165,7 +165,7 @@ Configuring the kernel "make xconfig" Qt based configuration tool. - "make gconfig" GTK+ based configuration tool. + "make gconfig" GTK based configuration tool. "make oldconfig" Default all questions based on the contents of your existing ./.config file and asking about @@ -176,7 +176,7 @@ Configuring the kernel values without prompting. "make defconfig" Create a ./.config file by using the default - symbol values from either arch/$ARCH/defconfig + symbol values from either arch/$ARCH/configs/defconfig or arch/$ARCH/configs/${PLATFORM}_defconfig, depending on the architecture. @@ -259,7 +259,7 @@ Configuring the kernel Compiling the kernel -------------------- - - Make sure you have at least gcc 5.1 available. + - Make sure you have at least gcc 8.1 available. For more information, refer to :ref:`Documentation/process/changes.rst <changes>`. - Do a ``make`` to create a compressed kernel image. It is also possible to do diff --git a/Documentation/admin-guide/abi-obsolete-files.rst b/Documentation/admin-guide/abi-obsolete-files.rst new file mode 100644 index 000000000000..3061a916b4b5 --- /dev/null +++ b/Documentation/admin-guide/abi-obsolete-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Obsolete ABI Files +================== + +.. kernel-abi:: obsolete + :no-symbols: diff --git a/Documentation/admin-guide/abi-obsolete.rst b/Documentation/admin-guide/abi-obsolete.rst index 594e697aa1b2..640f3903e847 100644 --- a/Documentation/admin-guide/abi-obsolete.rst +++ b/Documentation/admin-guide/abi-obsolete.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI obsolete symbols ==================== @@ -7,5 +9,5 @@ marked to be removed at some later point in time. The description of the interface will document the reason why it is obsolete and when it can be expected to be removed. -.. kernel-abi:: ABI/obsolete - :rst: +.. kernel-abi:: obsolete + :no-files: diff --git a/Documentation/admin-guide/abi-removed-files.rst b/Documentation/admin-guide/abi-removed-files.rst new file mode 100644 index 000000000000..f1bdfadd2ec4 --- /dev/null +++ b/Documentation/admin-guide/abi-removed-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Removed ABI Files +================= + +.. kernel-abi:: removed + :no-symbols: diff --git a/Documentation/admin-guide/abi-removed.rst b/Documentation/admin-guide/abi-removed.rst index f9e000c81828..88832d3eacd6 100644 --- a/Documentation/admin-guide/abi-removed.rst +++ b/Documentation/admin-guide/abi-removed.rst @@ -1,5 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI removed symbols =================== -.. kernel-abi:: ABI/removed - :rst: +.. kernel-abi:: removed + :no-files: diff --git a/Documentation/admin-guide/abi-stable-files.rst b/Documentation/admin-guide/abi-stable-files.rst new file mode 100644 index 000000000000..f867738fc178 --- /dev/null +++ b/Documentation/admin-guide/abi-stable-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Stable ABI Files +================ + +.. kernel-abi:: stable + :no-symbols: diff --git a/Documentation/admin-guide/abi-stable.rst b/Documentation/admin-guide/abi-stable.rst index fc3361d847b1..528c68401f4b 100644 --- a/Documentation/admin-guide/abi-stable.rst +++ b/Documentation/admin-guide/abi-stable.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI stable symbols ================== @@ -10,5 +12,5 @@ for at least 2 years. Most interfaces (like syscalls) are expected to never change and always be available. -.. kernel-abi:: ABI/stable - :rst: +.. kernel-abi:: stable + :no-files: diff --git a/Documentation/admin-guide/abi-testing-files.rst b/Documentation/admin-guide/abi-testing-files.rst new file mode 100644 index 000000000000..1da868e42fdb --- /dev/null +++ b/Documentation/admin-guide/abi-testing-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Testing ABI Files +================= + +.. kernel-abi:: testing + :no-symbols: diff --git a/Documentation/admin-guide/abi-testing.rst b/Documentation/admin-guide/abi-testing.rst index 19767926b344..6153ebd38e2d 100644 --- a/Documentation/admin-guide/abi-testing.rst +++ b/Documentation/admin-guide/abi-testing.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI testing symbols =================== @@ -16,5 +18,5 @@ Programs that use these interfaces are strongly encouraged to add their name to the description of these interfaces, so that the kernel developers can easily notify them if any changes occur. -.. kernel-abi:: ABI/testing - :rst: +.. kernel-abi:: testing + :no-files: diff --git a/Documentation/admin-guide/abi.rst b/Documentation/admin-guide/abi.rst index bcab3ef2597c..c6039359e585 100644 --- a/Documentation/admin-guide/abi.rst +++ b/Documentation/admin-guide/abi.rst @@ -1,7 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + ===================== Linux ABI description ===================== +.. kernel-abi:: README + +ABI symbols +----------- + .. toctree:: :maxdepth: 2 @@ -9,3 +16,14 @@ Linux ABI description abi-testing abi-obsolete abi-removed + +ABI files +--------- + +.. toctree:: + :maxdepth: 2 + + abi-stable-files + abi-testing-files + abi-obsolete-files + abi-removed-files diff --git a/Documentation/admin-guide/aoe/index.rst b/Documentation/admin-guide/aoe/index.rst index d71c5df15922..564354bbce57 100644 --- a/Documentation/admin-guide/aoe/index.rst +++ b/Documentation/admin-guide/aoe/index.rst @@ -8,10 +8,3 @@ ATA over Ethernet (AoE) aoe todo examples - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/aoe/udev.txt b/Documentation/admin-guide/aoe/udev.txt index 5fb756466bc7..d55ecb411c21 100644 --- a/Documentation/admin-guide/aoe/udev.txt +++ b/Documentation/admin-guide/aoe/udev.txt @@ -2,7 +2,7 @@ # They may be installed along the following lines. Check the section # 8 udev manpage to see whether your udev supports SUBSYSTEM, and # whether it uses one or two equal signs for SUBSYSTEM and KERNEL. -# +# # ecashin@makki ~$ su # Password: # bash# find /etc -type f -name udev.conf @@ -13,7 +13,7 @@ # 10-wacom.rules 50-udev.rules # bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \ # /etc/udev/rules.d/60-aoe.rules -# +# # aoe char devices SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220" @@ -22,5 +22,5 @@ SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="02 SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220" SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220" -# aoe block devices +# aoe block devices KERNEL=="etherd*", GROUP="disk" diff --git a/Documentation/admin-guide/auxdisplay/index.rst b/Documentation/admin-guide/auxdisplay/index.rst index e466f0595248..31eae08255fd 100644 --- a/Documentation/admin-guide/auxdisplay/index.rst +++ b/Documentation/admin-guide/auxdisplay/index.rst @@ -7,10 +7,3 @@ Auxiliary Display Support ks0108.rst cfag12864b.rst - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst index 6fdb495ac466..325816edbdab 100644 --- a/Documentation/admin-guide/bcache.rst +++ b/Documentation/admin-guide/bcache.rst @@ -17,8 +17,7 @@ The latest bcache kernel code can be found from mainline Linux kernel: It's designed around the performance characteristics of SSDs - it only allocates in erase block sized buckets, and it uses a hybrid btree/log to track cached extents (which can be anywhere from a single sector to the bucket size). It's -designed to avoid random writes at all costs; it fills up an erase block -sequentially, then issues a discard before reusing it. +designed to avoid random writes at all costs. Both writethrough and writeback caching are supported. Writeback defaults to off, but can be switched on and off arbitrarily at runtime. Bcache goes to @@ -618,19 +617,11 @@ bucket_size cache_replacement_policy One of either lru, fifo or random. -discard - Boolean; if on a discard/TRIM will be issued to each bucket before it is - reused. Defaults to off, since SATA TRIM is an unqueued command (and thus - slow). - freelist_percent - Size of the freelist as a percentage of nbuckets. Can be written to to + Size of the freelist as a percentage of nbuckets. Can be written to increase the number of buckets kept on the freelist, which lets you artificially reduce the size of the cache at runtime. Mostly for testing - purposes (i.e. testing how different size caches affect your hit rate), but - since buckets are discarded when they move on to the freelist will also make - the SSD's garbage collection easier by effectively giving it more reserved - space. + purposes (i.e. testing how different size caches affect your hit rate). io_errors Number of errors that have occurred, decayed by io_error_halflife. diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst index 957ccf617797..3262397ebe8f 100644 --- a/Documentation/admin-guide/blockdev/index.rst +++ b/Documentation/admin-guide/blockdev/index.rst @@ -11,6 +11,7 @@ Block Devices nbd paride ramdisk + zoned_loop zram drbd/index diff --git a/Documentation/admin-guide/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst index e85ad37cc0e5..b2f627d4c2f8 100644 --- a/Documentation/admin-guide/blockdev/paride.rst +++ b/Documentation/admin-guide/blockdev/paride.rst @@ -118,7 +118,7 @@ and high-level drivers that you would use: ================ ============ ======== All parports and all protocol drivers are probed automatically unless probe=0 -parameter is used. So just "modprobe epat" is enough for a Imation SuperDisk +parameter is used. So just "modprobe epat" is enough for an Imation SuperDisk drive to work. Manual device creation:: diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst new file mode 100644 index 000000000000..f4f1f3121bf9 --- /dev/null +++ b/Documentation/admin-guide/blockdev/zoned_loop.rst @@ -0,0 +1,190 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Zoned Loop Block Device +======================= + +.. Contents: + + 1) Overview + 2) Creating a Zoned Device + 3) Deleting a Zoned Device + 4) Example + + +1) Overview +----------- + +The zoned loop block device driver (zloop) allows a user to create a zoned block +device using one regular file per zone as backing storage. This driver does not +directly control any hardware and uses read, write and truncate operations to +regular files of a file system to emulate a zoned block device. + +Using zloop, zoned block devices with a configurable capacity, zone size and +number of conventional zones can be created. The storage for each zone of the +device is implemented using a regular file with a maximum size equal to the zone +size. The size of a file backing a conventional zone is always equal to the zone +size. The size of a file backing a sequential zone indicates the amount of data +sequentially written to the file, that is, the size of the file directly +indicates the position of the write pointer of the zone. + +When resetting a sequential zone, its backing file size is truncated to zero. +Conversely, for a zone finish operation, the backing file is truncated to the +zone size. With this, the maximum capacity of a zloop zoned block device created +can be larger configured to be larger than the storage space available on the +backing file system. Of course, for such configuration, writing more data than +the storage space available on the backing file system will result in write +errors. + +The zoned loop block device driver implements a complete zone transition state +machine. That is, zones can be empty, implicitly opened, explicitly opened, +closed or full. The current implementation does not support any limits on the +maximum number of open and active zones. + +No user tools are necessary to create and delete zloop devices. + +2) Creating a Zoned Device +-------------------------- + +Once the zloop module is loaded (or if zloop is compiled in the kernel), the +character device file /dev/zloop-control can be used to add a zloop device. +This is done by writing an "add" command directly to the /dev/zloop-control +device:: + + $ modprobe zloop + $ ls -l /dev/zloop* + crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control + + $ mkdir -p <base directory/<device ID> + $ echo "add [options]" > /dev/zloop-control + +The options available for the add command can be listed by reading the +/dev/zloop-control device:: + + $ cat /dev/zloop-control + add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,max_open_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io,zone_append=%u,ordered_zone_append,discard_write_cache + remove id=%d + +In more details, the options that can be used with the "add" command are as +follows. + +=================== ========================================================= +id Device number (the X in /dev/zloopX). + Default: automatically assigned. +capacity_mb Device total capacity in MiB. This is always rounded up + to the nearest higher multiple of the zone size. + Default: 16384 MiB (16 GiB). +zone_size_mb Device zone size in MiB. Default: 256 MiB. +zone_capacity_mb Device zone capacity (must always be equal to or lower + than the zone size. Default: zone size. +conv_zones Total number of conventioanl zones starting from + sector 0 + Default: 8 +max_open_zones Maximum number of open sequential write required zones + (0 for no limit). + Default: 0 +base_dir Path to the base directory where to create the directory + containing the zone files of the device. + Default=/var/local/zloop. + The device directory containing the zone files is always + named with the device ID. E.g. the default zone file + directory for /dev/zloop0 is /var/local/zloop/0. +nr_queues Number of I/O queues of the zoned block device. This + value is always capped by the number of online CPUs + Default: 1 +queue_depth Maximum I/O queue depth per I/O queue. + Default: 64 +buffered_io Do buffered IOs instead of direct IOs (default: false) +zone_append Enable or disable a zloop device native zone append + support. + Default: 1 (enabled). + If native zone append support is disabled, the block layer + will emulate this operation using regular write + operations. +ordered_zone_append Enable zloop mitigation of zone append reordering. + Default: disabled. + This is useful for testing file systems file data mapping + (extents), as when enabled, this can significantly reduce + the number of data extents needed to for a file data + mapping. +discard_write_cache Discard all data that was not explicitly persisted using a + flush operation when the device is removed by truncating + each zone file to the size recorded during the last flush + operation. This simulates power fail events where + uncommitted data is lost. +=================== ========================================================= + +3) Deleting a Zoned Device +-------------------------- + +Deleting an unused zoned loop block device is done by issuing the "remove" +command to /dev/zloop-control, specifying the ID of the device to remove:: + + $ echo "remove id=X" > /dev/zloop-control + +The remove command does not have any option. + +A zoned device that was removed can be re-added again without any change to the +state of the device zones: the device zones are restored to their last state +before the device was removed. Adding again a zoned device after it was removed +must always be done using the same configuration as when the device was first +added. If a zone configuration change is detected, an error will be returned and +the zoned device will not be created. + +To fully delete a zoned device, after executing the remove operation, the device +base directory containing the backing files of the device zones must be deleted. + +4) Example +---------- + +The following sequence of commands creates a 2GB zoned device with zones of 64 +MB and a zone capacity of 63 MB:: + + $ modprobe zloop + $ mkdir -p /var/local/zloop/0 + $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity_mb=63" > /dev/zloop-control + +For the device created (/dev/zloop0), the zone backing files are all created +under the default base directory (/var/local/zloop):: + + $ ls -l /var/local/zloop/0 + total 0 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000008 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000009 + ... + +The zoned device created (/dev/zloop0) can then be used normally:: + + $ lsblk -z + NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN + zloop0 host-managed 64M 32 0 0 1M 4K + $ blkzone report /dev/zloop0 + start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + ... + +Deleting this device is done using the command:: + + $ echo "remove id=0" > /dev/zloop-control + +The removed device can be re-added again using the same "add" command as when +the device was first created. To fully delete a zoned device, its backing files +should also be deleted after executing the remove command:: + + $ rm -r /var/local/zloop/0 diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 1576fb93f06c..60b07a7e30cd 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -54,7 +54,7 @@ The list of possible return codes: If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: - echo 3 > /sys/block/zram0/max_comp_streams + echo foo > /sys/block/zram0/comp_algorithm if [ $? -ne 0 ]; then handle_error fi @@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. -2) Set max number of compression streams -======================================== - -Regardless of the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPU - thus -allowing several concurrent compression operations. The number of -allocated compression streams goes down when some of the CPUs -become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or have only 1 CPU online. - -To find out how many streams are currently available:: - - cat /sys/block/zram0/max_comp_streams - -3) Select compression algorithm +2) Select compression algorithm =============================== Using comp_algorithm device attribute one can see available and @@ -107,7 +93,7 @@ Examples:: For the time being, the `comp_algorithm` content shows only compression algorithms that are supported by zram. -4) Set compression algorithm parameters: Optional +3) Set compression algorithm parameters: Optional ================================================= Compression algorithms may support specific parameters which can be @@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some algorithms), for other algorithms `level` is acceleration level (the higher the value the lower the compression ratio). -5) Set Disksize +4) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -6) Set memory limit: Optional +5) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -177,7 +163,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -7) Activate +6) Activate =========== :: @@ -188,7 +174,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -8) Add/remove zram devices +7) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -208,7 +194,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -9) Stats +8) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram<id>/ @@ -228,8 +214,9 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature -max_comp_streams RW the number of possible concurrent compress - operations +writeback_batch_size RW show and set maximum number of in-flight + writeback operations +compressed_writeback RW show and set compressed writeback feature comp_algorithm RW show and change the compression algorithm algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction @@ -238,7 +225,6 @@ backing_dev RW set up backend storage for zram to write out idle WO mark allocated slot as idle ====================== ====== =============================================== - User space is advised to use the following files to read the device statistics. File /sys/block/zram<id>/stat @@ -310,7 +296,7 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -10) Deactivate +9) Deactivate ============== :: @@ -318,7 +304,7 @@ a single line of text and contains the following stats separated by whitespace: swapoff /dev/zram0 umount /dev/zram1 -11) Reset +10) Reset ========= Write any positive value to 'reset' sysfs node:: @@ -333,6 +319,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -347,24 +353,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -385,6 +374,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. @@ -430,13 +436,33 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. +By default zram stores written back pages in decompressed (raw) form, which +means that writeback operation involves decompression of the page before +writing it to the backing device. This behavior can be changed by enabling +`compressed_writeback` feature, which causes zram to write compressed pages +to the backing device, thus avoiding decompression overhead. To enable +this feature, execute:: + + $ echo yes > /sys/block/zramX/compressed_writeback + +Note that this feature should be configured before the `zramX` device is +initialized. + +Depending on backing device storage type, writeback operation may benefit +from a higher number of in-flight write requests (batched writes). The +number of maximum in-flight writeback operations can be configured via +`writeback_batch_size` attribute. To change the default value (which is 32), +execute:: + + $ echo 64 > /sys/block/zramX/writeback_batch_size + If admin wants to measure writeback count in a certain period, they could know it via /sys/block/zram0/bd_stat's 3rd column. recompression ------------- -With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative +With `CONFIG_ZRAM_MULTI_COMP`, zram can recompress pages using alternative (secondary) compression algorithms. The basic idea is that alternative compression algorithm can provide better compression ratio at a price of (potentially) slower compression/decompression speeds. Alternative compression @@ -445,7 +471,7 @@ that default algorithm failed to compress). Another application is idle pages recompression - pages that are cold and sit in the memory can be recompressed using more effective algorithm and, hence, reduce zsmalloc memory usage. -With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: +With `CONFIG_ZRAM_MULTI_COMP`, zram supports up to 4 compression algorithms: one primary and up to 3 secondary ones. Primary zram compressor is explained in "3) Select compression algorithm", secondary algorithms are configured using recomp_algorithm device attribute. @@ -469,58 +495,43 @@ configuration::: #select deflate recompression algorithm, priority 2 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm -Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, +Another device attribute that `CONFIG_ZRAM_MULTI_COMP` enables is `recompress`, which controls recompression. Examples::: #IDLE pages recompression is activated by `idle` mode - echo "type=idle" > /sys/block/zramX/recompress + echo "type=idle priority=1" > /sys/block/zramX/recompress #HUGE pages recompression is activated by `huge` mode - echo "type=huge" > /sys/block/zram0/recompress + echo "type=huge priority=2" > /sys/block/zram0/recompress #HUGE_IDLE pages recompression is activated by `huge_idle` mode - echo "type=huge_idle" > /sys/block/zramX/recompress + echo "type=huge_idle priority=1" > /sys/block/zramX/recompress The number of idle pages can be significant, so user-space can pass a size threshold (in bytes) to the recompress knob: zram will recompress only pages of equal or greater size::: #recompress all pages larger than 3000 bytes - echo "threshold=3000" > /sys/block/zramX/recompress + echo "threshold=3000 priority=1" > /sys/block/zramX/recompress #recompress idle pages larger than 2000 bytes - echo "type=idle threshold=2000" > /sys/block/zramX/recompress + echo "type=idle threshold=2000 priority=1" > \ + /sys/block/zramX/recompress It is also possible to limit the number of pages zram re-compression will attempt to recompress::: - echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress - -Recompression of idle pages requires memory tracking. - -During re-compression for every page, that matches re-compression criteria, -ZRAM iterates the list of registered alternative compression algorithms in -order of their priorities. ZRAM stops either when re-compression was -successful (re-compressed object is smaller in size than the original one) -and matches re-compression criteria (e.g. size threshold) or when there are -no secondary algorithms left to try. If none of the secondary algorithms can -successfully re-compressed the page such a page is marked as incompressible, -so ZRAM will not attempt to re-compress it in the future. - -This re-compression behaviour, when it iterates through the list of -registered compression algorithms, increases our chances of finding the -algorithm that successfully compresses a particular page. Sometimes, however, -it is convenient (and sometimes even necessary) to limit recompression to -only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a `algo` or `priority` parameter::: - - #use zstd algorithm only (if registered) - echo "type=huge algo=zstd" > /sys/block/zramX/recompress + echo "type=huge_idle priority=1 max_pages=42" > \ + /sys/block/zramX/recompress - #use zstd algorithm only (if zstd was registered under priority 1) - echo "type=huge priority=1" > /sys/block/zramX/recompress +It is advised to always specify `priority` parameter. While it is also +possible to specify `algo` parameter, so that `zram` will use algorithm's +name to determine the priority, it is not recommended, since it can lead to +unexpected results when the same algorithm is configured with different +priorities (e.g. different parameters). `priority` is the only way to +guarantee that the expected algorithm will be used. memory tracking =============== diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index 91339efdcb54..f712758472d5 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -20,18 +20,26 @@ Config File Syntax The boot config syntax is a simple structured key-value. Each key consists of dot-connected-words, and key and value are connected by ``=``. The value -has to be terminated by semi-colon (``;``) or newline (``\n``). -For array value, array entries are separated by comma (``,``). :: - - KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] - -Unlike the kernel command line syntax, spaces are OK around the comma and ``=``. +string has to be terminated by the following delimiters described below. Each key word must contain only alphabets, numbers, dash (``-``) or underscore (``_``). And each value only contains printable characters or spaces except for delimiters such as semi-colon (``;``), new-line (``\n``), comma (``,``), hash (``#``) and closing brace (``}``). +If the ``=`` is followed by whitespace up to one of these delimiters, the +key is assigned an empty value. + +For arrays, the array values are comma (``,``) separated, and comments and +line breaks with newline (``\n``) are allowed between array values for +readability. Thus the first entry of the array must be on the same line as +the key.:: + + KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] + +Unlike the kernel command line syntax, white spaces (including tabs) are +ignored around the comma and ``=``. + If you want to use those delimiters in a value, you can use either double- quotes (``"VALUE"``) or single-quotes (``'VALUE'``) to quote it. Note that you can not escape these quotes. @@ -138,8 +146,8 @@ This is parsed as below:: foo = value bar = 1, 2, 3 -Note that you can not put a comment between value and delimiter(``,`` or -``;``). This means following config has a syntax error :: +Note that you can NOT put a comment or a newline between value and delimiter +(``,`` or ``;``). This means following config has a syntax error :: key = 1 # comment ,2 @@ -265,7 +273,7 @@ The final kernel cmdline will be the following:: Config File Limitation ====================== -Currently the maximum config size size is 32KB and the total key-words (not +Currently the maximum config size is 32KB and the total key-words (not key-value entries) must be under 1024 nodes. Note: this is not the number of entries but nodes, an entry must consume more than 2 nodes (a key-word and a value). So theoretically, it will be diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index ce6f4e8ca487..3901b43c96df 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -52,14 +52,14 @@ line is usually required to identify and handle the bug. Along this chapter, we'll refer to "Oops" for all kinds of stack traces that need to be analyzed. If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the -quality of the stack trace by using file:`scripts/decode_stacktrace.sh`. +quality of the stack trace by using ``scripts/decode_stacktrace.sh``. Modules linked in ----------------- Modules that are tainted or are being loaded or unloaded are marked with "(...)", where the taint flags are described in -file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is +Documentation/admin-guide/tainted-kernels.rst, "being loaded" is annotated with "+", and "being unloaded" is annotated with "-". @@ -196,7 +196,7 @@ will see the assembler code for the routine shown, but if your kernel has debug symbols the C code will also be available. (Debug symbols can be enabled in the kernel hacking menu of the menu configuration.) For example:: - $ objdump -r -S -l --disassemble net/dccp/ipv4.o + $ objdump -r -S -l --disassemble net/ipv4/tcp.o .. note:: @@ -235,7 +235,7 @@ Dave Miller):: mov 0x8(%ebp), %ebx ! %ebx = skb->sk mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt -file:`scripts/decodecode` can be used to automate most of this, depending +``scripts/decodecode`` can be used to automate most of this, depending on what CPU architecture is being debugged. Reporting the bug @@ -252,7 +252,7 @@ For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c - Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) + Hans Verkuil <hverkuil@kernel.org> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%) Bhaktipriya Shridhar <bhaktipriya96@gmail.com> (commit_signer:1/1=100%,authored:1/1=100%,added_lines:4/4=100%,removed_lines:9/9=100%) diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index a3e2edb3d274..463f98453323 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -13,7 +13,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson <pj@sgi.com> -Modified by Christoph Lameter <cl@linux.com> +Modified by Christoph Lameter <cl@gentwo.org> .. CONTENTS: diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index f401af5e2f09..c7909e5ac136 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -10,7 +10,7 @@ Written by Simon.Derr@bull.net - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Modified by Paul Jackson <pj@sgi.com> -- Modified by Christoph Lameter <cl@linux.com> +- Modified by Christoph Lameter <cl@gentwo.org> - Modified by Paul Menage <menage@google.com> - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst index 582d3427de3f..a964aff373b1 100644 --- a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst +++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst @@ -125,3 +125,7 @@ to unfreeze all tasks in the container:: This is the basic mechanism which should do the right thing for user space task in a simple scenario. + +This freezer implementation is affected by shortcomings (see commit +76f969e8948d8 ("cgroup: cgroup v2 freezer")) and cgroup v2 freezer is +recommended. diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index 493a8e386700..b5f3873b7d3a 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -77,7 +77,7 @@ control group and enforces the limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to fault in HugeTLB pages beyond its limit. Therefore the application needs to know exactly how many -HugeTLB pages it uses before hand, and the sysadmin needs to make sure that +HugeTLB pages it uses beforehand, and the sysadmin needs to make sure that there are enough available on the machine for all the users to avoid processes getting SIGBUS. @@ -91,23 +91,23 @@ getting SIGBUS. hugetlb.<hugepagesize>.rsvd.usage_in_bytes hugetlb.<hugepagesize>.rsvd.failcnt -The HugeTLB controller allows to limit the HugeTLB reservations per control +The HugeTLB controller allows limiting the HugeTLB reservations per control group and enforces the controller limit at reservation time and at the fault of HugeTLB memory for which no reservation exists. Since reservation limits are -enforced at reservation time (on mmap or shget), reservation limits never causes -the application to get SIGBUS signal if the memory was reserved before hand. For +enforced at reservation time (on mmap or shget), reservation limits never cause +the application to get SIGBUS signal if the memory was reserved beforehand. For MAP_NORESERVE allocations, the reservation limit behaves the same as the fault limit, enforcing memory usage at fault time and causing the application to receive a SIGBUS if it's crossing its limit. Reservation limits are superior to page fault limits described above, since reservation limits are enforced at reservation time (on mmap or shget), and -never causes the application to get SIGBUS signal if the memory was reserved -before hand. This allows for easier fallback to alternatives such as +never cause the application to get SIGBUS signal if the memory was reserved +beforehand. This allows for easier fallback to alternatives such as non-HugeTLB memory for example. In the case of page fault accounting, it's very -hard to avoid processes getting SIGBUS since the sysadmin needs precisely know -the HugeTLB usage of all the tasks in the system and make sure there is enough -pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited +hard to avoid processes getting SIGBUS since the sysadmin needs to precisely know +the HugeTLB usage of all the tasks in the system and make sure there are enough +pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommitted systems is practically impossible with page fault accounting. diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst index 99fbc8a64ba9..14897a8d32b3 100644 --- a/Documentation/admin-guide/cgroup-v1/index.rst +++ b/Documentation/admin-guide/cgroup-v1/index.rst @@ -22,10 +22,3 @@ Control Groups version 1 net_prio pids rdma - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 9f8e27355cba..7c7cd457cf69 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -47,21 +47,19 @@ Please note that implementation details can be changed. Called when swp_entry's refcnt goes down to 0. A charge against swap disappears. -3. charge-commit-cancel +3. charge-commit ======================= Memcg pages are charged in two steps: - mem_cgroup_try_charge() - - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() + - commit_charge() At try_charge(), there are no flags to say "this page is charged". at this point, usage += PAGE_SIZE. At commit(), the page is associated with the memcg. - At cancel(), simply usage -= PAGE_SIZE. - Under below explanation, we assume CONFIG_SWAP=y. 4. Anonymous diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 286d16fc22eb..7db63c002922 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -90,6 +90,7 @@ Brief summary of control files. used. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) + Per memcg knob does not exist in cgroup v2. memory.move_charge_at_immigrate This knob is deprecated. memory.oom_control set/show oom controls. This knob is deprecated and shouldn't be @@ -310,9 +311,8 @@ Lock order is as follows:: folio_lock mm->page_table_lock or split pte_lock - folio_memcg_lock (memcg->move_lock) - mapping->i_pages lock - lruvec->lru_lock. + mapping->i_pages lock + lruvec->lru_lock. Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; the folio LRU flag is cleared before @@ -609,6 +609,10 @@ memory.stat file includes following statistics: 'rss + mapped_file" will give you resident set size of cgroup. + Note that some kernel configurations might account complete larger + allocations (e.g., THP) towards 'rss' and 'mapped_file', even if + only some, but not all that memory is mapped. + (Note: file and shmem may be shared among other cgroups. In that case, mapped_file is accounted only when the memory cgroup is owner of page cache.) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..6efd0095ed99 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -15,6 +15,9 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou .. CONTENTS + [Whenever any new section is added to this document, please also add + an entry here.] + 1. Introduction 1-1. Terminology 1-2. What is cgroup? @@ -25,9 +28,10 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 2-2-2. Threads 2-3. [Un]populated Notification 2-4. Controlling Controllers - 2-4-1. Enabling and Disabling - 2-4-2. Top-down Constraint - 2-4-3. No Internal Process Constraint + 2-4-1. Availability + 2-4-2. Enabling and Disabling + 2-4-3. Top-down Constraint + 2-4-4. No Internal Process Constraint 2-5. Delegation 2-5-1. Model of Delegation 2-5-2. Delegation Containment @@ -49,7 +53,8 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-2. Memory 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines - 5-2-3. Memory Ownership + 5-2-3. Reclaim Protection + 5-2-4. Memory Ownership 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -61,14 +66,15 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-4-1. PID Interface Files 5-5. Cpuset 5.5-1. Cpuset Interface Files - 5-6. Device + 5-6. Device controller 5-7. RDMA 5-7-1. RDMA Interface Files 5-8. DMEM + 5-8-1. DMEM Interface Files 5-9. HugeTLB 5.9-1. HugeTLB Interface Files 5-10. Misc - 5.10-1 Miscellaneous cgroup Interface Files + 5.10-1 Misc Interface Files 5.10-2 Migration and Ownership 5-11. Others 5-11-1. perf_event @@ -214,7 +220,7 @@ cgroup v2 currently supports the following mount options. memory_hugetlb_accounting Count HugeTLB memory usage towards the cgroup's overall memory usage for the memory controller (for the purpose of - statistics reporting and memory protetion). This is a new + statistics reporting and memory protection). This is a new behavior that could regress existing setups, so it must be explicitly opted in with this mount option. @@ -435,6 +441,15 @@ both cgroups. Controlling Controllers ----------------------- +Availability +~~~~~~~~~~~~ + +A controller is available in a cgroup when it is supported by the kernel (i.e., +compiled in, not disabled and not attached to a v1 hierarchy) and listed in the +"cgroup.controllers" file. Availability means the controller's interface files +are exposed in the cgroup’s directory, allowing the distribution of the target +resource to be observed or controlled within that cgroup. + Enabling and Disabling ~~~~~~~~~~~~~~~~~~~~~~ @@ -722,9 +737,6 @@ combinations are invalid and should be rejected. Also, if the resource is mandatory for execution of processes, process migrations may be rejected. -"cpu.rt.max" hard-allocates realtime slices and is an example of this -type. - Interface Files =============== @@ -992,6 +1004,24 @@ All cgroup core files are prefixed with "cgroup." Total number of dying cgroup subsystems (e.g. memory cgroup) at and beneath the current cgroup. + cgroup.stat.local + A read-only flat-keyed file which exists in non-root cgroups. + The following entry is defined: + + frozen_usec + Cumulative time that this cgroup has spent between freezing and + thawing, regardless of whether by self or ancestor groups. + NB: (not) reaching "frozen" state is not accounted here. + + Using the following ASCII representation of a cgroup's freezer + state, :: + + 1 _____ + frozen 0 __/ \__ + ab cd + + the duration being measured is the span between a and c. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". @@ -1076,33 +1106,53 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 doesn't yet support control of realtime processes. For -a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group -scheduling of realtime processes, the cpu controller can only be enabled -when all RT processes are in the root cgroup. This limitation does -not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system -management software may already have placed RT processes into nonroot -cgroups during the system boot process, and these processes may need -to be moved to the root cgroup before the cpu controller can be enabled -with a CONFIG_RT_GROUP_SCHED enabled kernel. +WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of +realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option +enabled for group scheduling of realtime processes, the cpu controller can only +be enabled when all RT processes are in the root cgroup. Be aware that system +management software may already have placed RT processes into non-root cgroups +during the system boot process, and these processes may need to be moved to the +root cgroup before the cpu controller can be enabled with a +CONFIG_RT_GROUP_SCHED enabled kernel. + +With CONFIG_RT_GROUP_SCHED disabled, this limitation does not apply and some of +the interface files either affect realtime processes or account for them. See +the following section for details. Only the cpu controller is affected by +CONFIG_RT_GROUP_SCHED. Other controllers can be used for the resource control of +realtime processes irrespective of CONFIG_RT_GROUP_SCHED. CPU Interface Files ~~~~~~~~~~~~~~~~~~~ -All time durations are in microseconds. +The interaction of a process with the cpu controller depends on its scheduling +policy and the underlying scheduler. From the point of view of the cpu controller, +processes can be categorized as follows: + +* Processes under the fair-class scheduler +* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback +* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler + without the ``cgroup_set_weight`` callback + +For details on when a process is under the fair-class scheduler or a BPF scheduler, +check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`. + +For each of the following interface files, the above categories +will be referred to. All time durations are in microseconds. cpu.stat A read-only flat-keyed file. This file exists whether the controller is enabled or not. - It always reports the following three stats: + It always reports the following three stats, which account for all the + processes in the cgroup: - usage_usec - user_usec - system_usec - and the following five when the controller is enabled: + and the following five when the controller is enabled, which account for + only the processes under the fair-class scheduler: - nr_periods - nr_throttled @@ -1120,6 +1170,10 @@ All time durations are in microseconds. If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), then the weight will show as a 0. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.weight.nice A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1132,6 +1186,10 @@ All time durations are in microseconds. granularity is coarser for the nice values, the read value is the closest approximation of the current weight. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.max A read-write two value file which exists on non-root cgroups. The default is "max 100000". @@ -1144,43 +1202,55 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + This file affects only processes under the fair-class scheduler. + cpu.max.burst A read-write single value file which exists on non-root cgroups. The default is "0". The burst in the range [0, $MAX]. + This file affects only processes under the fair-class scheduler. + cpu.pressure A read-write nested-keyed file. Shows pressure stall information for CPU. See :ref:`Documentation/accounting/psi.rst <psi>` for details. + This file accounts for all the processes in the cgroup. + cpu.uclamp.min - A read-write single value file which exists on non-root cgroups. - The default is "0", i.e. no utilization boosting. + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. - The requested minimum utilization (protection) as a percentage - rational number, e.g. 12.34 for 12.34%. + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. - This interface allows reading and setting minimum utilization clamp - values similar to the sched_setattr(2). This minimum utilization - value is used to clamp the task specific minimum utilization clamp. + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp, + including those of realtime processes. - The requested minimum utilization (protection) is always capped by - the current value for the maximum utilization (limit), i.e. - `cpu.uclamp.max`. + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. + + This file affects all the processes in the cgroup. cpu.uclamp.max - A read-write single value file which exists on non-root cgroups. - The default is "max". i.e. no utilization capping + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping - The requested maximum utilization (limit) as a percentage rational - number, e.g. 98.76 for 98.76%. + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. - This interface allows reading and setting maximum utilization clamp - values similar to the sched_setattr(2). This maximum utilization - value is used to clamp the task specific maximum utilization clamp. + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp, + including those of realtime processes. + + This file affects all the processes in the cgroup. cpu.idle A read-write single value file which exists on non-root cgroups. @@ -1192,7 +1262,7 @@ All time durations are in microseconds. own relative priorities, but the cgroup itself will be treated as very low priority relative to its peers. - + This file affects only processes under the fair-class scheduler. Memory ------ @@ -1245,7 +1315,7 @@ PAGE_SIZE multiple when read back. smaller overages. Effective min boundary is limited by memory.min values of - all ancestor cgroups. If there is memory.min overcommitment + ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its @@ -1254,9 +1324,6 @@ PAGE_SIZE multiple when read back. Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs. - If a memory cgroup is not populated with processes, - its memory.min is ignored. - memory.low A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1271,7 +1338,7 @@ PAGE_SIZE multiple when read back. smaller overages. Effective low boundary is limited by memory.low values of - all ancestor cgroups. If there is memory.low overcommitment + ancestor cgroups. If there is memory.low overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its @@ -1294,6 +1361,18 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1311,6 +1390,18 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. @@ -1343,6 +1434,9 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The valid range for swappiness is [0-200, max], setting + swappiness=max exclusively reclaims anonymous memory. + memory.peak A read-write single value file which exists on non-root cgroups. @@ -1416,6 +1510,10 @@ The following nested keys are defined. oom_group_kill The number of times a group OOM has occurred. + sock_throttled + The number of times network sockets associated with + this cgroup are throttled. + memory.events.local Similar to memory.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event @@ -1440,7 +1538,10 @@ The following nested keys are defined. anon Amount of memory used in anonymous mappings such as - brk(), sbrk(), and mmap(MAP_ANONYMOUS) + brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that + some kernel configurations might account complete larger + allocations (e.g., THP) if only some, but not all the + memory of such an allocation is mapped anymore. file Amount of memory used to cache filesystem data, @@ -1483,7 +1584,10 @@ The following nested keys are defined. Amount of application memory swapped out to zswap. file_mapped - Amount of cached filesystem data mapped with mmap() + Amount of cached filesystem data mapped with mmap(). Note + that some kernel configurations might account complete + larger allocations (e.g., THP) if only some, but not + not all the memory of such an allocation is mapped. file_dirty Amount of cached filesystem data that was modified but @@ -1555,6 +1659,12 @@ The following nested keys are defined. workingset_nodereclaim Number of times a shadow node has been reclaimed + pswpin (npn) + Number of pages swapped into memory + + pswpout (npn) + Number of pages swapped out of memory + pgscan (npn) Amount of scanned pages (in an inactive LRU list) @@ -1570,6 +1680,9 @@ The following nested keys are defined. pgscan_khugepaged (npn) Amount of scanned pages by khugepaged (in an inactive LRU list) + pgscan_proactive (npn) + Amount of scanned pages proactively (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd @@ -1579,6 +1692,9 @@ The following nested keys are defined. pgsteal_khugepaged (npn) Amount of reclaimed pages by khugepaged + pgsteal_proactive (npn) + Amount of reclaimed pages proactively + pgfault (npn) Total number of page faults incurred @@ -1618,6 +1734,11 @@ The following nested keys are defined. zswpwb Number of pages written from zswap to swap. + zswap_incomp + Number of incompressible pages currently stored in zswap + without compression. These pages could not be compressed to + a size smaller than PAGE_SIZE, so they are stored as-is. + thp_fault_alloc (npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE @@ -1656,6 +1777,9 @@ The following nested keys are defined. pgdemote_khugepaged Number of pages demoted by khugepaged. + pgdemote_proactive + Number of pages demoted by proactively. + hugetlb Amount of memory used by hugetlb pages. This metric only shows up if hugetlb usage is accounted for in memory.current (i.e. @@ -1814,6 +1938,27 @@ memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet. +Reclaim Protection +~~~~~~~~~~~~~~~~~~ + +The protection configured with "memory.low" or "memory.min" applies relatively +to the target of the reclaim (i.e. any of memory cgroup limits, proactive +memory.reclaim or global reclaim apparently located in the root cgroup). +The protection value configured for B applies unchanged to the reclaim +targeting A (i.e. caused by competition with the sibling E):: + + root - ... - A - B - C + \ ` D + ` E + +When the reclaim targets ancestors of A, the effective protection of B is +capped by the protection value configured for A (and any other intermediate +ancestors between A and the target). + +To express indifference about relative sibling protection, it is suggested to +use memory_recursiveprot. Configuring all descendants of a parent with finite +protection to "max" works but it may unnecessarily skew memory.events:low +field. Memory Ownership ~~~~~~~~~~~~~~~~ @@ -2418,10 +2563,10 @@ Cpuset Interface Files Users can manually set it to a value that is different from "cpuset.cpus". One constraint in setting it is that the list of CPUs must be exclusive with respect to "cpuset.cpus.exclusive" - of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup - isn't set, its "cpuset.cpus" value, if set, cannot be a subset - of it to leave at least one CPU available when the exclusive - CPUs are taken away. + and "cpuset.cpus.exclusive.effective" of its siblings. Another + constraint is that it cannot be a superset of "cpuset.cpus" + of its sibling in order to leave at least one CPU available to + that sibling when the exclusive CPUs are taken away. For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an @@ -2441,9 +2586,9 @@ Cpuset Interface Files of this file will always be a subset of its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" - if it is set. If "cpuset.cpus.exclusive" is not set, it is - treated to have an implicit value of "cpuset.cpus" in the - formation of local partition. + if it is set. This file should only be non-empty if either + "cpuset.cpus.exclusive" is set or when the current cpuset is + a valid partition root. cpuset.cpus.isolated A read-only and root cgroup only multiple values file. @@ -2475,13 +2620,22 @@ Cpuset Interface Files There are two types of partitions - local and remote. A local partition is one whose parent cgroup is also a valid partition root. A remote partition is one whose parent cgroup is not a - valid partition root itself. Writing to "cpuset.cpus.exclusive" - is optional for the creation of a local partition as its - "cpuset.cpus.exclusive" file will assume an implicit value that - is the same as "cpuset.cpus" if it is not set. Writing the - proper "cpuset.cpus.exclusive" values down the cgroup hierarchy - before the target partition root is mandatory for the creation - of a remote partition. + valid partition root itself. + + Writing to "cpuset.cpus.exclusive" is optional for the creation + of a local partition as its "cpuset.cpus.exclusive" file will + assume an implicit value that is the same as "cpuset.cpus" if it + is not set. Writing the proper "cpuset.cpus.exclusive" values + down the cgroup hierarchy before the target partition root is + mandatory for the creation of a remote partition. + + Not all the CPUs requested in "cpuset.cpus.exclusive" can be + used to form a new partition. Only those that were present + in its parent's "cpuset.cpus.exclusive.effective" control + file can be used. For partitions created without setting + "cpuset.cpus.exclusive", exclusive CPUs specified in sibling's + "cpuset.cpus.exclusive" or "cpuset.cpus.exclusive.effective" + also cannot be used. Currently, a remote partition cannot be created under a local partition. All the ancestors of a remote partition root except @@ -2489,6 +2643,10 @@ Cpuset Interface Files The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member". + Even though the "cpuset.cpus.exclusive*" and "cpuset.cpus" + control files are not present in the root cgroup, they are + implicitly the same as the "/sys/devices/system/cpu/possible" + sysfs file. When set to "root", the current cgroup is the root of a new partition or scheduling domain. The set of exclusive CPUs is @@ -2673,7 +2831,7 @@ DMEM Interface Files HugeTLB ------- -The HugeTLB controller allows to limit the HugeTLB usage per control group and +The HugeTLB controller allows limiting the HugeTLB usage per control group and enforces the controller limit during page fault. HugeTLB Interface Files @@ -2993,7 +3151,7 @@ Filesystem Support for Writeback -------------------------------- A filesystem can support cgroup writeback by updating -address_space_operations->writepage[s]() to annotate bio's using the +address_space_operations->writepages() to annotate bio's using the following two functions. wbc_init_bio(@wbc, @bio) diff --git a/Documentation/admin-guide/cifs/index.rst b/Documentation/admin-guide/cifs/index.rst index fad5268635f5..58ab58a71a82 100644 --- a/Documentation/admin-guide/cifs/index.rst +++ b/Documentation/admin-guide/cifs/index.rst @@ -12,10 +12,3 @@ CIFS todo changes authors - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index c09674a75a9e..d989ae5778ba 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -270,6 +270,8 @@ configured for Unix Extensions (and the client has not disabled illegal Windows/NTFS/SMB characters to a remap range (this mount parameter is the default for SMB3). This remap (``mapposix``) range is also compatible with Mac (and "Services for Mac" on some older Windows). +When POSIX Extensions for SMB 3.1.1 are negotiated, remapping is automatically +disabled. CIFS VFS Mount Options ====================== diff --git a/Documentation/admin-guide/cpu-isolation.rst b/Documentation/admin-guide/cpu-isolation.rst new file mode 100644 index 000000000000..8c65d03fd28c --- /dev/null +++ b/Documentation/admin-guide/cpu-isolation.rst @@ -0,0 +1,357 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +CPU Isolation +============= + +Introduction +============ + +"CPU Isolation" means leaving a CPU exclusive to a given workload +without any undesired code interference from the kernel. + +Those interferences, commonly pointed out as "noise", can be triggered +by asynchronous events (interrupts, timers, scheduler preemption by +workqueues and kthreads, ...) or synchronous events (syscalls and page +faults). + +Such noise usually goes unnoticed. After all, synchronous events are a +component of the requested kernel service. And asynchronous events are +either sufficiently well-distributed by the scheduler when executed +as tasks or reasonably fast when executed as interrupt. The timer +interrupt can even execute 1024 times per seconds without a significant +and measurable impact most of the time. + +However some rare and extreme workloads can be quite sensitive to +those kinds of noise. This is the case, for example, with high +bandwidth network processing that can't afford losing a single packet +or very low latency network processing. Typically those use cases +involve DPDK, bypassing the kernel networking stack and performing +direct access to the networking device from userspace. + +In order to run a CPU without or with limited kernel noise, the +related housekeeping work needs to be either shut down, migrated or +offloaded. + +Housekeeping +============ + +In the CPU isolation terminology, housekeeping is the work, often +asynchronous, that the kernel needs to process in order to maintain +all its services. It matches the noises and disturbances enumerated +above except when at least one CPU is isolated. Then housekeeping may +make use of further coping mechanisms if CPU-tied work must be +offloaded. + +Housekeeping CPUs are the non-isolated CPUs where the kernel noise +is moved away from isolated CPUs. + +The isolation can be implemented in several ways depending on the +nature of the noise: + +- Unbound work, where "unbound" means not tied to any CPU, can be + simply migrated away from isolated CPUs to housekeeping CPUs. + This is the case of unbound workqueues, kthreads and timers. + +- Bound work, where "bound" means tied to a specific CPU, usually + can't be moved away as-is by nature. Either: + + - The work must switch to a locked implementation. E.g.: + This is the case of RCU with CONFIG_RCU_NOCB_CPU. + + - The related feature must be shut down and considered + incompatible with isolated CPUs. E.g.: Lockup watchdog, + unreliable clocksources, etc... + + - An elaborate and heavyweight coping mechanism stands as a + replacement. E.g.: the timer tick is shut down on nohz_full + CPUs but with the constraint of running a single task on + them. A significant cost penalty is added on kernel entry/exit + and a residual 1Hz scheduler tick is offloaded to housekeeping + CPUs. + +In any case, housekeeping work has to be handled, which is why there +must be at least one housekeeping CPU in the system, preferably more +if the machine runs a lot of CPUs. For example one per node on NUMA +systems. + +Also CPU isolation often means a tradeoff between noise-free isolated +CPUs and added overhead on housekeeping CPUs, sometimes even on +isolated CPUs entering the kernel. + +Isolation features +================== + +Different levels of isolation can be configured in the kernel, each of +which has its own drawbacks and tradeoffs. + +Scheduler domain isolation +-------------------------- + +This feature isolates a CPU from the scheduler topology. As a result, +the target isn't part of the load balancing. Tasks won't migrate +either from or to it unless affined explicitly. + +As a side effect the CPU is also isolated from unbound workqueues and +unbound kthreads. + +Requirements +~~~~~~~~~~~~ + +- CONFIG_CPUSETS=y for the cpusets-based interface + +Tradeoffs +~~~~~~~~~ + +By nature, the system load is overall less distributed since some CPUs +are extracted from the global load balancing. + +Interfaces +~~~~~~~~~~ + +- Documentation/admin-guide/cgroup-v2.rst cpuset isolated partitions are recommended + because they are tunable at runtime. + +- The 'isolcpus=' kernel boot parameter with the 'domain' flag is a + less flexible alternative that doesn't allow for runtime + reconfiguration. + +IRQs isolation +-------------- + +Isolate the IRQs whenever possible, so that they don't fire on the +target CPUs. + +Interfaces +~~~~~~~~~~ + +- The file /proc/irq/\*/smp_affinity as explained in detail in + Documentation/core-api/irq/irq-affinity.rst page. + +- The "irqaffinity=" kernel boot parameter for a default setting. + +- The "managed_irq" flag in the "isolcpus=" kernel boot parameter + tries a best effort affinity override for managed IRQs. + +Full Dynticks (aka nohz_full) +----------------------------- + +Full dynticks extends the dynticks idle mode, which stops the tick when +the CPU is idle, to CPUs running a single task in userspace. That is, +the timer tick is stopped if the environment allows it. + +Global timer callbacks are also isolated from the nohz_full CPUs. + +Requirements +~~~~~~~~~~~~ + +- CONFIG_NO_HZ_FULL=y + +Constraints +~~~~~~~~~~~ + +- The isolated CPUs must run a single task only. Multitask requires + the tick to maintain preemption. This is usually fine since the + workload usually can't stand the latency of random context switches. + +- No call to the kernel from isolated CPUs, at the risk of triggering + random noise. + +- No use of POSIX CPU timers on isolated CPUs. + +- Architecture must have a stable and reliable clocksource (no + unreliable TSC that requires the watchdog). + + +Tradeoffs +~~~~~~~~~ + +In terms of cost, this is the most invasive isolation feature. It is +assumed to be used when the workload spends most of its time in +userspace and doesn't rely on the kernel except for preparatory +work because: + +- RCU adds more overhead due to the locked, offloaded and threaded + callbacks processing (the same that would be obtained with "rcu_nocbs" + boot parameter). + +- Kernel entry/exit through syscalls, exceptions and IRQs are more + costly due to fully ordered RmW operations that maintain userspace + as RCU extended quiescent state. Also the CPU time is accounted on + kernel boundaries instead of periodically from the tick. + +- Housekeeping CPUs must run a 1Hz residual remote scheduler tick + on behalf of the isolated CPUs. + +Checklist +========= + +You have set up each of the above isolation features but you still +observe jitters that trash your workload? Make sure to check a few +elements before proceeding. + +Some of these checklist items are similar to those of real-time +workloads: + +- Use mlock() to prevent your pages from being swapped away. Page + faults are usually not compatible with jitter sensitive workloads. + +- Avoid SMT to prevent your hardware thread from being "preempted" + by another one. + +- CPU frequency changes may induce subtle sorts of jitter in a + workload. Cpufreq should be used and tuned with caution. + +- Deep C-states may result in latency issues upon wake-up. If this + happens to be a problem, C-states can be limited via kernel boot + parameters such as processor.max_cstate or intel_idle.max_cstate. + More finegrained tunings are described in + Documentation/admin-guide/pm/cpuidle.rst page + +- Your system may be subject to firmware-originating interrupts - x86 has + System Management Interrupts (SMIs) for example. Check your system BIOS + to disable such interference, and with some luck your vendor will have + a BIOS tuning guidance for low-latency operations. + + +Full isolation example +====================== + +In this example, the system has 8 CPUs and the 8th is to be fully +isolated. Since CPUs start from 0, the 8th CPU is CPU 7. + +Kernel parameters +----------------- + +Set the following kernel boot parameters to disable SMT and setup tick +and IRQ isolation: + +- Full dynticks: nohz_full=7 + +- IRQs isolation: irqaffinity=0-6 + +- Managed IRQs isolation: isolcpus=managed_irq,7 + +- Prevent SMT: nosmt + +The full command line is then: + + nohz_full=7 irqaffinity=0-6 isolcpus=managed_irq,7 nosmt + +CPUSET configuration (cgroup v2) +-------------------------------- + +Assuming cgroup v2 is mounted to /sys/fs/cgroup, the following script +isolates CPU 7 from scheduler domains. + +:: + + cd /sys/fs/cgroup + # Activate the cpuset subsystem + echo +cpuset > cgroup.subtree_control + # Create partition to be isolated + mkdir test + cd test + echo +cpuset > cgroup.subtree_control + # Isolate CPU 7 + echo 7 > cpuset.cpus + echo "isolated" > cpuset.cpus.partition + +The userspace workload +---------------------- + +Fake a pure userspace workload, the program below runs a dummy +userspace loop on the isolated CPU 7. + +:: + + #include <stdio.h> + #include <fcntl.h> + #include <unistd.h> + #include <errno.h> + int main(void) + { + // Move the current task to the isolated cpuset (bind to CPU 7) + int fd = open("/sys/fs/cgroup/test/cgroup.procs", O_WRONLY); + if (fd < 0) { + perror("Can't open cpuset file...\n"); + return 0; + } + + write(fd, "0\n", 2); + close(fd); + + // Run an endless dummy loop until the launcher kills us + while (1) + ; + + return 0; + } + +Build it and save for later step: + +:: + + # gcc user_loop.c -o user_loop + +The launcher +------------ + +The below launcher runs the above program for 10 seconds and traces +the noise resulting from preempting tasks and IRQs. + +:: + + TRACING=/sys/kernel/tracing/ + # Make sure tracing is off for now + echo 0 > $TRACING/tracing_on + # Flush previous traces + echo > $TRACING/trace + # Record disturbance from other tasks + echo 1 > $TRACING/events/sched/sched_switch/enable + # Record disturbance from interrupts + echo 1 > $TRACING/events/irq_vectors/enable + # Now we can start tracing + echo 1 > $TRACING/tracing_on + # Run the dummy user_loop for 10 seconds on CPU 7 + ./user_loop & + USER_LOOP_PID=$! + sleep 10 + kill $USER_LOOP_PID + # Disable tracing and save traces from CPU 7 in a file + echo 0 > $TRACING/tracing_on + cat $TRACING/per_cpu/cpu7/trace > trace.7 + +If no specific problem arose, the output of trace.7 should look like +the following: + +:: + + <idle>-0 [007] d..2. 1980.976624: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=user_loop next_pid=1553 next_prio=120 + user_loop-1553 [007] d.h.. 1990.946593: reschedule_entry: vector=253 + user_loop-1553 [007] d.h.. 1990.946593: reschedule_exit: vector=253 + +That is, no specific noise triggered between the first trace and the +second during 10 seconds when user_loop was running. + +Debugging +========= + +Of course things are never so easy, especially on this matter. +Chances are that actual noise will be observed in the aforementioned +trace.7 file. + +The best way to investigate further is to enable finer grained +tracepoints such as those of subsystems producing asynchronous +events: workqueue, timer, irq_vector, etc... It also can be +interesting to enable the tick_stop event to diagnose why the tick is +retained when that happens. + +Some tools may also be useful for higher level analysis: + +- Documentation/tools/rtla/rtla.rst provides a suite of tools to analyze + latency and noise in the system. For example Documentation/tools/rtla/rtla-osnoise.rst + runs a kernel tracer that analyzes and output a summary of the noises. + +- dynticks-testing does something similar to rtla-osnoise but in userspace. It is available + at git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst index 4d667228e744..a1e673c0e782 100644 --- a/Documentation/admin-guide/device-mapper/delay.rst +++ b/Documentation/admin-guide/device-mapper/delay.rst @@ -3,7 +3,7 @@ dm-delay ======== Device-Mapper's "delay" target delays reads and/or writes -and/or flushs and optionally maps them to different devices. +and/or flushes and optionally maps them to different devices. Arguments:: @@ -18,7 +18,7 @@ Table line has to either have 3, 6 or 9 arguments: to write and flush operations on optionally different write_device with optionally different sector offset -9: same as 6 arguments plus define flush_offset and flush_delay explicitely +9: same as 6 arguments plus define flush_offset and flush_delay explicitly on/with optionally different flush_device/flush_offset. Offsets are specified in sectors. @@ -40,7 +40,7 @@ Example scripts #!/bin/sh # # Create mapped device delaying write and flush operations for 400ms and - # splitting reads to device $1 but writes and flushs to different device $2 + # splitting reads to device $1 but writes and flushes to different device $2 # to different offsets of 2048 and 4096 sectors respectively. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400" @@ -48,7 +48,7 @@ Example scripts :: #!/bin/sh # - # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms + # Create mapped device delaying reads for 50ms, writes for 100ms and flushes for 333ms # onto the same backing device at offset 0 sectors. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333" diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index 9f8139ff97d6..4467f6d4b632 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -146,6 +146,11 @@ integrity:<bytes>:<type> integrity for the encrypted device. The additional space is then used for storing authentication tag (and persistent IV if needed). +integrity_key_size:<bytes> + Optionally set the integrity key size if it differs from the digest size. + It allows the use of wrapped key algorithms where the key size is + independent of the cryptographic key size. + sector_size:<bytes> Use <bytes> as the encryption unit instead of 512 bytes sectors. This option can be in range 512 - 4096 bytes and must be power of two. diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst index d8a5f14d0e3c..c2e18ecc065c 100644 --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst @@ -92,6 +92,11 @@ Target arguments: allowed. This mode is useful for data recovery if the device cannot be activated in any of the other standard modes. + I - inline mode - in this mode, dm-integrity will store integrity + data directly in the underlying device sectors. + The underlying device must have an integrity profile that + allows storing user integrity data and provides enough + space for the selected integrity tag. 5. the number of additional arguments diff --git a/Documentation/admin-guide/device-mapper/dm-pcache.rst b/Documentation/admin-guide/device-mapper/dm-pcache.rst new file mode 100644 index 000000000000..09d327ef4b14 --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-pcache.rst @@ -0,0 +1,202 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= +dm-pcache — Persistent Cache +================================= + +*Author: Dongsheng Yang <dongsheng.yang@linux.dev>* + +This document describes *dm-pcache*, a Device-Mapper target that lets a +byte-addressable *DAX* (persistent-memory, “pmem”) region act as a +high-performance, crash-persistent cache in front of a slower block +device. The code lives in `drivers/md/dm-pcache/`. + +Quick feature summary +===================== + +* *Write-back* caching (only mode currently supported). +* *16 MiB segments* allocated on the pmem device. +* *Data CRC32* verification (optional, per cache). +* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX + == 2`) and protected with CRC+sequence numbers. +* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism +* Pure *DAX path* I/O – no extra BIO round-trips +* *Log-structured write-back* that preserves backend crash-consistency + + +Constructor +=========== + +:: + + pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>] + +========================= ==================================================== +``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…). + All metadata *and* cached blocks are stored here. + +``backing_dev`` The slow block device to be cached. + +``cache_mode`` Optional, Only ``writeback`` is accepted at the + moment. + +``data_crc`` Optional, default to ``false`` + + * ``true`` – store CRC32 for every cached entry + and verify on reads + * ``false`` – skip CRC (faster) +========================= ==================================================== + +Example +------- + +.. code-block:: shell + + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + +The first time a pmem device is used, dm-pcache formats it automatically +(super-block, cache_info, etc.). + + +Status line +=========== + +``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints: + +:: + + <sb_flags> <seg_total> <cache_segs> <segs_used> \ + <gc_percent> <cache_flags> \ + <key_head_seg>:<key_head_off> \ + <dirty_tail_seg>:<dirty_tail_off> \ + <key_tail_seg>:<key_tail_off> + +Field meanings +-------------- + +=============================== ============================================= +``sb_flags`` Super-block flags (e.g. endian marker). + +``seg_total`` Number of physical *pmem* segments. + +``cache_segs`` Number of segments used for cache. + +``segs_used`` Segments currently allocated (bitmap weight). + +``gc_percent`` Current GC high-water mark (0-90). + +``cache_flags`` Bit 0 – DATA_CRC enabled + Bit 1 – INIT_DONE (cache initialised) + Bits 2-5 – cache mode (0 == WB). + +``key_head`` Where new key-sets are being written. + +``dirty_tail`` First dirty key-set that still needs + write-back to the backing device. + +``key_tail`` First key-set that may be reclaimed by GC. +=============================== ============================================= + + +Messages +======== + +*Change GC trigger* + +:: + + dmsetup message <dev> 0 gc_percent <0-90> + + +Theory of operation +=================== + +Sub-devices +----------- + +==================== ========================================================= +backing_dev Any block device (SSD/HDD/loop/LVM, etc.). +cache_dev DAX device; must expose direct-access memory. +==================== ========================================================= + +Segments and key-sets +--------------------- + +* The pmem space is divided into *16 MiB segments*. +* Each write allocates space from a per-CPU *data_head* inside a segment. +* A *cache-key* records a logical range on the origin and where it lives + inside pmem (segment + offset + generation). +* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem + and are themselves crash-safe (CRC). +* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets. + +Write-back +---------- + +Dirty keys are queued into a tree; a background worker copies data +back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the +upper layers forces an immediate metadata commit. + +Garbage collection +------------------ + +GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks +from *key_tail*, frees segments whose every key has been invalidated, and +advances *key_tail*. + +CRC verification +---------------- + +If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data +range when it is inserted and stores it in the on-media key. Reads +validate the CRC before copying to the caller. + + +Failure handling +================ + +* *pmem media errors* – all metadata copies are read with + ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation. +* *Cache full* – if no free segment can be found, writes return ``-EBUSY``; + dm-pcache retries internally (request deferral). +* *System crash* – on attach, the driver replays ksets from *key_tail* to + rebuild the in-core trees; every segment’s generation guards against + use-after-free keys. + + +Limitations & TODO +================== + +* Only *write-back* mode; other modes planned. +* Only FIFO cache invalidate; other (LRU, ARC...) planned. +* Table reload is not supported currently. +* Discard planned. + + +Example workflow +================ + +.. code-block:: shell + + # 1. Create devices + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + + # 2. Put a filesystem on top + mkfs.ext4 /dev/mapper/pcache_sdb + mount /dev/mapper/pcache_sdb /mnt + + # 3. Tune GC threshold to 80 % + dmsetup message pcache_sdb 0 gc_percent 80 + + # 4. Observe status + watch -n1 'dmsetup status pcache_sdb' + + # 5. Shutdown + umount /mnt + dmsetup remove pcache_sdb + + +``dm-pcache`` is under active development; feedback, bug reports and patches +are very welcome! diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst index bb17e26e3c1b..3780f6e6b6bb 100644 --- a/Documentation/admin-guide/device-mapper/dm-raid.rst +++ b/Documentation/admin-guide/device-mapper/dm-raid.rst @@ -20,10 +20,10 @@ The target is named "raid" and it accepts the following parameters:: raid0 RAID0 striping (no resilience) raid1 RAID1 mirroring raid4 RAID4 with dedicated last parity disk - raid5_n RAID5 with dedicated last parity disk supporting takeover + raid5_n RAID5 with dedicated last parity disk supporting takeover from/to raid1 Same as raid4 - - Transitory layout + - Transitory layout for takeover from/to raid1 raid5_la RAID5 left asymmetric - rotating parity 0 with data continuation @@ -48,8 +48,8 @@ The target is named "raid" and it accepts the following parameters:: raid6_n_6 RAID6 with dedicate parity disks - parity and Q-syndrome on the last 2 disks; - layout for takeover from/to raid4/raid5_n - raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk + layout for takeover from/to raid0/raid4/raid5_n + raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk supporting takeover from/to raid5 - layout for takeover from raid5_la from/to raid6 raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk @@ -173,9 +173,9 @@ The target is named "raid" and it accepts the following parameters:: The delta_disks option value (-251 < N < +251) triggers device removal (negative value) or device addition (positive value) to any reshape supporting raid levels 4/5/6 and 10. - RAID levels 4/5/6 allow for addition of devices (metadata - and data device tuple), raid10_near and raid10_offset only - allow for device addition. raid10_far does not support any + RAID levels 4/5/6 allow for addition and removal of devices + (metadata and data device tuple), raid10_near and raid10_offset + only allow for device addition. raid10_far does not support any reshaping at all. A minimum of devices have to be kept to enforce resilience, which is 3 devices for raid4/5 and 4 devices for raid6. @@ -372,6 +372,72 @@ to safely enable discard support for RAID 4/5/6: 'devices_handle_discards_safely' +Takeover/Reshape Support +------------------------ +The target natively supports these two types of MDRAID conversions: + +o Takeover: Converts an array from one RAID level to another + +o Reshape: Changes the internal layout while maintaining the current RAID level + +Each operation is only valid under specific constraints imposed by the existing array's layout and configuration. + + +Takeover: +linear -> raid1 with N >= 2 mirrors +raid0 -> raid4 (add dedicated parity device) +raid0 -> raid5 (add dedicated parity device) +raid0 -> raid10 with near layout and N >= 2 mirror groups (raid0 stripes have to become first member within mirror groups) +raid1 -> linear +raid1 -> raid5 with 2 mirrors +raid4 -> raid5 w/ rotating parity +raid5 with dedicated parity device -> raid4 +raid5 -> raid6 (with dedicated Q-syndrome) +raid6 (with dedicated Q-syndrome) -> raid5 +raid10 with near layout and even number of disks -> raid0 (select any in-sync device from each mirror group) + +Reshape: +linear: not possible +raid0: not possible +raid1: change number of mirrors +raid4: add and remove stripes (minimum 3), change stripesize +raid5: add and remove stripes (minimum 3, special case 2 for raid1 takeover), change rotating parity algorithms, change stripesize +raid6: add and remove stripes (minimum 4), change rotating syndrome algorithms, change stripesize +raid10 near: add stripes (minimum 4), change stripesize, no stripe removal possible, change to offset layout +raid10 offset: add stripes, change stripesize, no stripe removal possible, change to near layout +raid10 far: not possible + +Table line examples: + +### raid1 -> raid5 +# +# 2 devices limitation in raid1. +# raid5 personality is able to just map 2 like raid1. +# Reshape after takeover to change to full raid5 layout + + 0 1960886272 raid raid1 3 0 region_size 2048 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + +# dm-0 and dm-2 are e.g. 4MiB large metadata devices, dm-1 and dm-3 have to be at least 1960886272 big. +# +# Table line to takeover to raid5 + + 0 1960886272 raid raid5 3 0 region_size 2048 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + +# Add required out-of-place reshape space to the beginniong of the given 2 data devices, +# allocate another metadata/data device tuple with the same sizes for the parity space +# and zero the first 4K of the metadata device. +# +# Example table of the out-of-place reshape space addition for one data device, e.g. dm-1 + + 0 8192 linear 8:0 0 1960903888 # <- must be free space segment + 8192 1960886272 linear 8:0 0 2048 # previous data segment + +# Mapping table for e.g. raid5_rs reshape causing the size of the raid device to double-fold once the reshape finishes. +# Check the status output (e.g. "dmsetup status $RaidDev") for progress. + + 0 $((2 * 1960886272)) raid raid5 7 0 region_size 2048 data_offset 8192 delta_disk 1 2 /dev/dm-0 /dev/dm-1 /dev/dm-2 /dev/dm-3 + + Version History --------------- diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index cc5aec861576..030d854628ac 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -18,6 +18,7 @@ Device Mapper dm-integrity dm-io dm-log + dm-pcache dm-queue-length dm-raid dm-service-time @@ -39,10 +40,3 @@ Device Mapper verity writecache zero - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/device-mapper/thin-provisioning.rst b/Documentation/admin-guide/device-mapper/thin-provisioning.rst index bafebf79da4b..b2fa49a5608a 100644 --- a/Documentation/admin-guide/device-mapper/thin-provisioning.rst +++ b/Documentation/admin-guide/device-mapper/thin-provisioning.rst @@ -80,11 +80,11 @@ less sharing than average you'll need a larger-than-average metadata device. As a guide, we suggest you calculate the number of bytes to use in the metadata device as 48 * $data_dev_size / $data_block_size but round it up -to 2MB if the answer is smaller. If you're creating large numbers of +to 2MiB if the answer is smaller. If you're creating large numbers of snapshots which are recording large amounts of change, you may find you need to increase this. -The largest size supported is 16GB: If the device is larger, +The largest size supported is 16GiB: If the device is larger, a warning will be issued and the excess space will not be used. Reloading a pool table @@ -107,13 +107,13 @@ Using an existing pool device $data_block_size gives the smallest unit of disk space that can be allocated at a time expressed in units of 512-byte sectors. -$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a -multiple of 128 (64KB). $data_block_size cannot be changed after the +$data_block_size must be between 128 (64KiB) and 2097152 (1GiB) and a +multiple of 128 (64KiB). $data_block_size cannot be changed after the thin-pool is created. People primarily interested in thin provisioning -may want to use a value such as 1024 (512KB). People doing lots of -snapshotting may want a smaller value such as 128 (64KB). If you are +may want to use a value such as 1024 (512KiB). People doing lots of +snapshotting may want a smaller value such as 128 (64KiB). If you are not zeroing newly-allocated data, a larger $data_block_size in the -region of 256000 (128MB) is suggested. +region of 262144 (128MiB) is suggested. $low_water_mark is expressed in blocks of size $data_block_size. If free space on the data device drops below this level then a dm event @@ -291,7 +291,7 @@ i) Constructor error_if_no_space: Error IOs, instead of queueing, if no space. - Data block size must be between 64KB (128 sectors) and 1GB + Data block size must be between 64KiB (128 sectors) and 1GiB (2097152 sectors) inclusive. diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst b/Documentation/admin-guide/device-mapper/vdo-design.rst index 3cd59decbec0..faa0ecd4a5ae 100644 --- a/Documentation/admin-guide/device-mapper/vdo-design.rst +++ b/Documentation/admin-guide/device-mapper/vdo-design.rst @@ -600,7 +600,7 @@ lock and return itself to the pool. All storage within vdo is managed as 4KB blocks, but it can accept writes as small as 512 bytes. Processing a write that is smaller than 4K requires a read-modify-write operation that reads the relevant 4K block, copies the -new data over the approriate sectors of the block, and then launches a +new data over the appropriate sectors of the block, and then launches a write operation for the modified data block. The read and write stages of this operation are nearly identical to the normal read and write operations, and a single data_vio is used throughout this operation. diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst index a14e6d3e787c..8a67b320a97b 100644 --- a/Documentation/admin-guide/device-mapper/vdo.rst +++ b/Documentation/admin-guide/device-mapper/vdo.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0-only +====== dm-vdo ====== diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst index a65c1602cb23..eb9475d7e196 100644 --- a/Documentation/admin-guide/device-mapper/verity.rst +++ b/Documentation/admin-guide/device-mapper/verity.rst @@ -87,35 +87,57 @@ panic_on_corruption Panic the device when a corrupted block is discovered. This option is not compatible with ignore_corruption and restart_on_corruption. +restart_on_error + Restart the system when an I/O error is detected. + This option can be combined with the restart_on_corruption option. + +panic_on_error + Panic the device when an I/O error is detected. This option is + not compatible with the restart_on_error option but can be combined + with the panic_on_corruption option. + ignore_zero_blocks Do not verify blocks that are expected to contain zeroes and always return zeroes instead. This may be useful if the partition contains unused blocks that are not guaranteed to contain zeroes. use_fec_from_device <fec_dev> - Use forward error correction (FEC) to recover from corruption if hash - verification fails. Use encoding data from the specified device. This - may be the same device where data and hash blocks reside, in which case - fec_start must be outside data and hash areas. + Use forward error correction (FEC) parity data from the specified device to + try to automatically recover from corruption and I/O errors. + + If this option is given, then <fec_roots> and <fec_blocks> must also be + given. <hash_block_size> must also be equal to <data_block_size>. + + <fec_dev> can be the same as <dev>, in which case <fec_start> must be + outside the data area. It can also be the same as <hash_dev>, in which case + <fec_start> must be outside the hash and optional additional metadata areas. - If the encoding data covers additional metadata, it must be accessible - on the hash device after the hash blocks. + If the data <dev> is encrypted, the <fec_dev> should be too. - Note: block sizes for data and hash devices must match. Also, if the - verity <dev> is encrypted the <fec_dev> should be too. + For more information, see `Forward error correction`_. fec_roots <num> - Number of generator roots. This equals to the number of parity bytes in - the encoding data. For example, in RS(M, N) encoding, the number of roots - is M-N. + The number of parity bytes in each 255-byte Reed-Solomon codeword. The + Reed-Solomon code used will be an RS(255, k) code where k = 255 - fec_roots. + + The supported values are 2 through 24 inclusive. Higher values provide + stronger error correction. However, the minimum value of 2 already provides + strong error correction due to the use of interleaving, so 2 is the + recommended value for most users. fec_roots=2 corresponds to an + RS(255, 253) code, which has a space overhead of about 0.8%. fec_blocks <num> - The number of encoding data blocks on the FEC device. The block size for - the FEC device is <data_block_size>. + The total number of <data_block_size> blocks that are error-checked using + FEC. This must be at least the sum of <num_data_blocks> and the number of + blocks needed by the hash tree. It can include additional metadata blocks, + which are assumed to be accessible on <hash_dev> following the hash blocks. + + Note that this is *not* the number of parity blocks. The number of parity + blocks is inferred from <fec_blocks>, <fec_roots>, and <data_block_size>. fec_start <offset> - This is the offset, in <data_block_size> blocks, from the start of the - FEC device to the beginning of the encoding data. + This is the offset, in <data_block_size> blocks, from the start of <fec_dev> + to the beginning of the parity data. check_at_most_once Verify data blocks only the first time they are read from the data device, @@ -142,8 +164,15 @@ root_hash_sig_key_desc <key_description> already in the secondary trusted keyring. try_verify_in_tasklet - If verity hashes are in cache, verify data blocks in kernel tasklet instead - of workqueue. This option can reduce IO latency. + If verity hashes are in cache and the IO size does not exceed the limit, + verify data blocks in bottom half instead of workqueue. This option can + reduce IO latency. The size limits can be configured via + /sys/module/dm_verity/parameters/use_bh_bytes. The four parameters + correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, + IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn. + For example: + <none>,<rt>,<be>,<idle> + 4096,4096,4096,4096 Theory of operation =================== @@ -164,11 +193,6 @@ per-block basis. This allows for a lightweight hash computation on first read into the page cache. Block hashes are stored linearly, aligned to the nearest block size. -If forward error correction (FEC) support is enabled any recovery of -corrupted data will be verified using the cryptographic hash of the -corresponding data. This is why combining error correction with -integrity checking is essential. - Hash Tree --------- @@ -196,6 +220,80 @@ The tree looks something like: / ... \ / . . . \ / \ blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767 +Forward error correction +------------------------ + +dm-verity's optional forward error correction (FEC) support adds strong error +correction capabilities to dm-verity. It allows systems that would be rendered +inoperable by errors to continue operating, albeit with reduced performance. + +FEC uses Reed-Solomon (RS) codes that are interleaved across the entire +device(s), allowing long bursts of corrupt or unreadable blocks to be recovered. + +dm-verity validates any FEC-corrected block against the wanted hash before using +it. Therefore, FEC doesn't affect the security properties of dm-verity. + +The integration of FEC with dm-verity provides significant benefits over a +separate error correction layer: + +- dm-verity invokes FEC only when a block's hash doesn't match the wanted hash + or the block cannot be read at all. As a result, FEC doesn't add overhead to + the common case where no error occurs. + +- dm-verity hashes are also used to identify erasure locations for RS decoding. + This allows correcting twice as many errors. + +FEC uses an RS(255, k) code where k = 255 - fec_roots. fec_roots is usually 2. +This means that each k (usually 253) message bytes have fec_roots (usually 2) +bytes of parity data added to get a 255-byte codeword. (Many external sources +call RS codewords "blocks". Since dm-verity already uses the term "block" to +mean something else, we'll use the clearer term "RS codeword".) + +FEC checks fec_blocks blocks of message data in total, consisting of: + +1. The data blocks from the data device +2. The hash blocks from the hash device +3. Optional additional metadata that follows the hash blocks on the hash device + +dm-verity assumes that the FEC parity data was computed as if the following +procedure were followed: + +1. Concatenate the message data from the above sources. +2. Zero-pad to the next multiple of k blocks. Let msg be the resulting byte + array, and msglen its length in bytes. +3. For 0 <= i < msglen / k (for each RS codeword): + a. Select msg[i + j * msglen / k] for 0 <= j < k. + Consider these to be the 'k' message bytes of an RS codeword. + b. Compute the corresponding 'fec_roots' parity bytes of the RS codeword, + and concatenate them to the FEC parity data. + +Step 3a interleaves the RS codewords across the entire device using an +interleaving degree of data_block_size * ceil(fec_blocks / k). This is the +maximal interleaving, such that the message data consists of a region containing +byte 0 of all the RS codewords, then a region containing byte 1 of all the RS +codewords, and so on up to the region for byte 'k - 1'. Note that the number of +codewords is set to a multiple of data_block_size; thus, the regions are +block-aligned, and there is an implicit zero padding of up to 'k - 1' blocks. + +This interleaving allows long bursts of errors to be corrected. It provides +much stronger error correction than storage devices typically provide, while +keeping the space overhead low. + +The cost is slow decoding: correcting a single block usually requires reading +254 extra blocks spread evenly across the device(s). However, that is +acceptable because dm-verity uses FEC only when there is actually an error. + +The list below contains additional details about the RS codes used by +dm-verity's FEC. Userspace programs that generate the parity data need to use +these parameters for the parity data to match exactly: + +- Field used is GF(256) +- Bytes are mapped to/from GF(256) elements in the natural way, where bits 0 + through 7 (low-order to high-order) map to the coefficients of x^0 through x^7 +- Field generator polynomial is x^8 + x^4 + x^3 + x^2 + 1 +- The codes used are systematic, BCH-view codes +- Primitive element alpha is 'x' +- First consecutive root of code generator polynomial is 'x^0' On-disk format ============== @@ -220,8 +318,10 @@ is available at the cryptsetup project's wiki page Status ====== -V (for Valid) is returned if every check performed so far was valid. -If any check failed, C (for Corruption) is returned. +1. V (for Valid) is returned if every check performed so far was valid. + If any check failed, C (for Corruption) is returned. +2. Number of corrected blocks by Forward Error Correction. + '-' if Forward Error Correction is not enabled. Example ======= diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst index e3776d77374b..b103ba52776a 100644 --- a/Documentation/admin-guide/devices.rst +++ b/Documentation/admin-guide/devices.rst @@ -97,9 +97,12 @@ It is recommended that these links exist on all systems: /dev/bttv0 video0 symbolic Backward compatibility /dev/radio radio0 symbolic Backward compatibility /dev/i2o* /dev/i2o/* symbolic Backward compatibility -/dev/scd? sr? hard Alternate SCSI CD-ROM name =============== =============== =============== =============================== +Suggested earlier ``/dev/scd?`` alternative names for ``/dev/sr?`` +CD-ROM and other optical drives (using SCSI commands) were removed +in ``udev`` version 174 that was released in 2011. + Locally defined links +++++++++++++++++++++ @@ -112,7 +115,6 @@ exist, they should have the following uses. /dev/mouse mouse port symbolic Current mouse device /dev/tape tape device symbolic Current tape device /dev/cdrom CD-ROM device symbolic Current CD-ROM device -/dev/cdwriter CD-writer symbolic Current CD-writer device /dev/scanner scanner symbolic Current scanner device /dev/modem modem port symbolic Current dialout device /dev/root root device symbolic Current root filesystem @@ -126,8 +128,8 @@ exists, ``/dev/modem`` should point to the appropriate primary TTY device For SCSI devices, ``/dev/tape`` and ``/dev/cdrom`` should point to the *cooked* devices (``/dev/st*`` and ``/dev/sr*``, respectively), whereas -``/dev/cdwriter`` and /dev/scanner should point to the appropriate generic -SCSI devices (/dev/sg*). +``/dev/scanner`` should point to the appropriate generic +SCSI device (``/dev/sg*``). ``/dev/mouse`` may point to a primary serial TTY device, a hardware mouse device, or a socket for a mouse driver program (e.g. ``/dev/gpmdata``). diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt index 94c98be1329a..440633642fea 100644 --- a/Documentation/admin-guide/devices.txt +++ b/Documentation/admin-guide/devices.txt @@ -352,7 +352,7 @@ 216 = /dev/fujitsu/apanel Fujitsu/Siemens application panel 217 = /dev/ni/natmotn National Instruments Motion 218 = /dev/kchuid Inter-process chuid control - 219 = /dev/modems/mwave MWave modem firmware upload + 219 = 220 = /dev/mptctl Message passing technology (MPT) control 221 = /dev/mvista/hssdsi Montavista PICMG hot swap system driver 222 = /dev/mvista/hasi Montavista PICMG high availability @@ -389,11 +389,11 @@ ... 11 block SCSI CD-ROM devices - 0 = /dev/scd0 First SCSI CD-ROM - 1 = /dev/scd1 Second SCSI CD-ROM + 0 = /dev/sr0 First SCSI CD-ROM + 1 = /dev/sr1 Second SCSI CD-ROM ... - The prefix /dev/sr (instead of /dev/scd) has been deprecated. + In the past the prefix /dev/scd (instead of /dev/sr) was used and even recommended. 12 char QIC-02 tape 2 = /dev/ntpqic11 QIC-11, no rewind-on-close diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index 7c036590cd07..095a63892257 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -223,12 +223,13 @@ The flags are:: f Include the function name s Include the source file name l Include line number + d Include call trace For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only the ``p`` flag has meaning, other flags are ignored. -Note the regexp ``^[-+=][fslmpt_]+$`` matches a flags specification. -To clear all flags at once, use ``=_`` or ``-fslmpt``. +Note the regexp ``^[-+=][fslmptd_]+$`` matches a flags specification. +To clear all flags at once, use ``=_`` or ``-fslmptd``. Debug messages during Boot Process diff --git a/Documentation/admin-guide/efi-stub.rst b/Documentation/admin-guide/efi-stub.rst index 090f3a185e18..f8e7407698bd 100644 --- a/Documentation/admin-guide/efi-stub.rst +++ b/Documentation/admin-guide/efi-stub.rst @@ -79,6 +79,9 @@ because the image we're executing is interpreted by the EFI shell, which understands relative paths, whereas the rest of the command line is passed to bzImage.efi. +.. hint:: + It is also possible to provide an initrd using a Linux-specific UEFI + protocol at boot time. See :ref:`pe-coff-entry-point` for details. The "dtb=" option ----------------- diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index 2418b0c2d3df..ac0c709ea9e7 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -238,11 +238,10 @@ When mounting an ext4 filesystem, the following option are accepted: configured using tune2fs) data_err=ignore(*) - Just print an error message if an error occurs in a file data buffer in - ordered mode. + Just print an error message if an error occurs in a file data buffer. + data_err=abort - Abort the journal if an error occurs in a file data buffer in ordered - mode. + Abort the journal if an error occurs in a file data buffer. grpid | bsdgroups New objects have the group ID of their parent. @@ -399,7 +398,7 @@ There are 3 different data modes: * writeback mode In data=writeback mode, ext4 does not journal data at all. This mode provides - a similar level of journaling as that of XFS, JFS, and ReiserFS in its default + a similar level of journaling as that of XFS and JFS in its default mode - metadata journaling. A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash. This mode will typically provide the best ext4 performance. diff --git a/Documentation/admin-guide/gpio/gpio-aggregator.rst b/Documentation/admin-guide/gpio/gpio-aggregator.rst index 5cd1e7221756..8374a9df9105 100644 --- a/Documentation/admin-guide/gpio/gpio-aggregator.rst +++ b/Documentation/admin-guide/gpio/gpio-aggregator.rst @@ -69,6 +69,113 @@ write-only attribute files in sysfs. $ echo gpio-aggregator.0 > delete_device +Aggregating GPIOs using Configfs +-------------------------------- + +**Group:** ``/config/gpio-aggregator`` + + This is the root directory of the gpio-aggregator configfs tree. + +**Group:** ``/config/gpio-aggregator/<example-name>`` + + This directory represents a GPIO aggregator device. You can assign any + name to ``<example-name>`` (e.g. ``agg0``), except names starting with + ``_sysfs`` prefix, which are reserved for auto-generated configfs + entries corresponding to devices created via Sysfs. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/live`` + + The ``live`` attribute allows to trigger the actual creation of the device + once it's fully configured. Accepted values are: + + * ``1``, ``yes``, ``true`` : enable the virtual device + * ``0``, ``no``, ``false`` : disable the virtual device + +**Attribute:** ``/config/gpio-aggregator/<example-name>/dev_name`` + + The read-only ``dev_name`` attribute exposes the name of the device as it + will appear in the system on the platform bus (e.g. ``gpio-aggregator.0``). + This is useful for identifying a character device for the newly created + aggregator. If it's ``gpio-aggregator.0``, + ``/sys/devices/platform/gpio-aggregator.0/gpiochipX`` path tells you that the + GPIO device id is ``X``. + +You must create subdirectories for each virtual line you want to +instantiate, named exactly as ``line0``, ``line1``, ..., ``lineY``, when +you want to instantiate ``Y+1`` (Y >= 0) lines. Configure all lines before +activating the device by setting ``live`` to 1. + +**Group:** ``/config/gpio-aggregator/<example-name>/<lineY>/`` + + This directory represents a GPIO line to include in the aggregator. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/key`` + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/offset`` + + The default values after creating the ``<lineY>`` directory are: + + * ``key`` : <empty> + * ``offset`` : -1 + + ``key`` must always be explicitly configured, while ``offset`` depends. + Two configuration patterns exist for each ``<lineY>``: + + (a). For lookup by GPIO line name: + + * Set ``key`` to the line name. + * Ensure ``offset`` remains -1 (the default). + + (b). For lookup by GPIO chip name and the line offset within the chip: + + * Set ``key`` to the chip name. + * Set ``offset`` to the line offset (0 <= ``offset`` < 65535). + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/name`` + + The ``name`` attribute sets a custom name for lineY. If left unset, the + line will remain unnamed. + +Once the configuration is done, the ``'live'`` attribute must be set to 1 +in order to instantiate the aggregator device. It can be set back to 0 to +destroy the virtual device. The module will synchronously wait for the new +aggregator device to be successfully probed and if this doesn't happen, writing +to ``'live'`` will result in an error. This is a different behaviour from the +case when you create it using sysfs ``new_device`` interface. + +.. note:: + + For aggregators created via Sysfs, the configfs entries are + auto-generated and appear as ``/config/gpio-aggregator/_sysfs.<N>/``. You + cannot add or remove line directories with mkdir(2)/rmdir(2). To modify + lines, you must use the "delete_device" interface to tear down the + existing device and reconfigure it from scratch. However, you can still + toggle the aggregator with the ``live`` attribute and adjust the + ``key``, ``offset``, and ``name`` attributes for each line when ``live`` + is set to 0 by hand (i.e. it's not waiting for deferred probe). + +Sample configuration commands +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + # Create a directory for an aggregator device + $ mkdir /sys/kernel/config/gpio-aggregator/agg0 + + # Configure each line + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line0 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line0/key + $ echo 6 > /sys/kernel/config/gpio-aggregator/agg0/line0/offset + $ echo test0 > /sys/kernel/config/gpio-aggregator/agg0/line0/name + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line1 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line1/key + $ echo 7 > /sys/kernel/config/gpio-aggregator/agg0/line1/offset + $ echo test1 > /sys/kernel/config/gpio-aggregator/agg0/line1/name + + # Activate the aggregator device + $ echo 1 > /sys/kernel/config/gpio-aggregator/agg0/live + + Generic GPIO Driver ------------------- diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index 1cc5567a4bbe..f5135a14ef2e 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -50,8 +50,11 @@ the number of lines exposed by this bank. **Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/name`` -This group represents a single line at the offset Y. The 'name' attribute -allows to set the line name as represented by the 'gpio-line-names' property. +**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/valid`` + +This group represents a single line at the offset Y. The ``valid`` attribute +indicates whether the line can be used as GPIO. The ``name`` attribute allows +to set the line name as represented by the 'gpio-line-names' property. **Item:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog`` @@ -71,7 +74,7 @@ specific lines. The name of those subdirectories must take the form of: ``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be used by the module to assign the config to the specific line at given offset. -Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in +Once the configuration is complete, the ``'live'`` attribute must be set to 1 in order to instantiate the chip. It can be set back to 0 to destroy the simulated chip. The module will synchronously wait for the new simulated device to be successfully probed and if this doesn't happen, writing to ``'live'`` will diff --git a/Documentation/admin-guide/gpio/gpio-virtuser.rst b/Documentation/admin-guide/gpio/gpio-virtuser.rst index 2aca70db9f3b..7e7c0df51640 100644 --- a/Documentation/admin-guide/gpio/gpio-virtuser.rst +++ b/Documentation/admin-guide/gpio/gpio-virtuser.rst @@ -92,7 +92,7 @@ struct. The first two take string values as arguments: Activating GPIO consumers ------------------------- -Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in +Once the configuration is complete, the ``'live'`` attribute must be set to 1 in order to instantiate the consumer. It can be set back to 0 to destroy the virtual device. The module will synchronously wait for the new simulated device to be successfully probed and if this doesn't happen, writing to ``'live'`` will diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst index 712f379731cb..082646851029 100644 --- a/Documentation/admin-guide/gpio/index.rst +++ b/Documentation/admin-guide/gpio/index.rst @@ -12,10 +12,3 @@ GPIO gpio-sim gpio-virtuser Obsolete APIs <obsolete> - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/highuid.rst b/Documentation/admin-guide/highuid.rst deleted file mode 100644 index 6ee70465c0ea..000000000000 --- a/Documentation/admin-guide/highuid.rst +++ /dev/null @@ -1,80 +0,0 @@ -=================================================== -Notes on the change from 16-bit UIDs to 32-bit UIDs -=================================================== - -:Author: Chris Wing <wingc@umich.edu> -:Last updated: January 11, 2000 - -- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t - when communicating between user and kernel space in an ioctl or data - structure. - -- kernel code should use uid_t and gid_t in kernel-private structures and - code. - -What's left to be done for 32-bit UIDs on all Linux architectures: - -- Disk quotas have an interesting limitation that is not related to the - maximum UID/GID. They are limited by the maximum file size on the - underlying filesystem, because quota records are written at offsets - corresponding to the UID in question. - Further investigation is needed to see if the quota system can cope - properly with huge UIDs. If it can deal with 64-bit file offsets on all - architectures, this should not be a problem. - -- Decide whether or not to keep backwards compatibility with the system - accounting file, or if we should break it as the comments suggest - (currently, the old 16-bit UID and GID are still written to disk, and - part of the former pad space is used to store separate 32-bit UID and - GID) - -- Need to validate that OS emulation calls the 16-bit UID - compatibility syscalls, if the OS being emulated used 16-bit UIDs, or - uses the 32-bit UID system calls properly otherwise. - - This affects at least: - - - iBCS on Intel - - - sparc32 emulation on sparc64 - (need to support whatever new 32-bit UID system calls are added to - sparc32) - -- Validate that all filesystems behave properly. - - At present, 32-bit UIDs _should_ work for: - - - ext2 - - ufs - - isofs - - nfs - - coda - - udf - - Ioctl() fixups have been made for: - - - ncpfs - - smbfs - - Filesystems with simple fixups to prevent 16-bit UID wraparound: - - - minix - - sysv - - qnx4 - - Other filesystems have not been checked yet. - -- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in - all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but - more are needed. (as well as new user<->kernel data structures) - -- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k, - sh, and sparc32. Fixing this is probably not that important, but would - require adding a new ELF section. - -- The ioctl()s used to control the in-kernel NFS server only support - 16-bit UIDs on arm, i386, m68k, sh, and sparc32. - -- make sure that the UID mapping feature of AX25 networking works properly - (it should be safe because it's always used a 32-bit integer to - communicate between user and kernel) diff --git a/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst new file mode 100644 index 000000000000..d0bdbd81dcf9 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst @@ -0,0 +1,236 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Attack Vector Controls +====================== + +Attack vector controls provide a simple method to configure only the mitigations +for CPU vulnerabilities which are relevant given the intended use of a system. +Administrators are encouraged to consider which attack vectors are relevant and +disable all others in order to recoup system performance. + +When new relevant CPU vulnerabilities are found, they will be added to these +attack vector controls so administrators will likely not need to reconfigure +their command line parameters as mitigations will continue to be correctly +applied based on the chosen attack vector controls. + +Attack Vectors +-------------- + +There are 5 sets of attack-vector mitigations currently supported by the kernel: + +#. :ref:`user_kernel` +#. :ref:`user_user` +#. :ref:`guest_host` +#. :ref:`guest_guest` +#. :ref:`smt` + +To control the enabled attack vectors, see :ref:`cmdline`. + +.. _user_kernel: + +User-to-Kernel +^^^^^^^^^^^^^^ + +The user-to-kernel attack vector involves a malicious userspace program +attempting to leak kernel data into userspace by exploiting a CPU vulnerability. +The kernel data involved might be limited to certain kernel memory, or include +all memory in the system, depending on the vulnerability exploited. + +If no untrusted userspace applications are being run, such as with single-user +systems, consider disabling user-to-kernel mitigations. + +Note that the CPU vulnerabilities mitigated by Linux have generally not been +shown to be exploitable from browser-based sandboxes. User-to-kernel +mitigations are therefore mostly relevant if unknown userspace applications may +be run by untrusted users. + +*user-to-kernel mitigations are enabled by default* + +.. _user_user: + +User-to-User +^^^^^^^^^^^^ + +The user-to-user attack vector involves a malicious userspace program attempting +to influence the behavior of another unsuspecting userspace program in order to +exfiltrate data. The vulnerability of a userspace program is based on the +program itself and the interfaces it provides. + +If no untrusted userspace applications are being run, consider disabling +user-to-user mitigations. + +Note that because the Linux kernel contains a mapping of all physical memory, +preventing a malicious userspace program from leaking data from another +userspace program requires mitigating user-to-kernel attacks as well for +complete protection. + +*user-to-user mitigations are enabled by default* + +.. _guest_host: + +Guest-to-Host +^^^^^^^^^^^^^ + +The guest-to-host attack vector involves a malicious VM attempting to leak +hypervisor data into the VM. The data involved may be limited, or may +potentially include all memory in the system, depending on the vulnerability +exploited. + +If no untrusted VMs are being run, consider disabling guest-to-host mitigations. + +*guest-to-host mitigations are enabled by default if KVM support is present* + +.. _guest_guest: + +Guest-to-Guest +^^^^^^^^^^^^^^ + +The guest-to-guest attack vector involves a malicious VM attempting to influence +the behavior of another unsuspecting VM in order to exfiltrate data. The +vulnerability of a VM is based on the code inside the VM itself and the +interfaces it provides. + +If no untrusted VMs, or only a single VM is being run, consider disabling +guest-to-guest mitigations. + +Similar to the user-to-user attack vector, preventing a malicious VM from +leaking data from another VM requires mitigating guest-to-host attacks as well +due to the Linux kernel phys map. + +*guest-to-guest mitigations are enabled by default if KVM support is present* + +.. _smt: + +Cross-Thread +^^^^^^^^^^^^ + +The cross-thread attack vector involves a malicious userspace program or +malicious VM either observing or attempting to influence the behavior of code +running on the SMT sibling thread in order to exfiltrate data. + +Many cross-thread attacks can only be mitigated if SMT is disabled, which will +result in reduced CPU core count and reduced performance. + +If cross-thread mitigations are fully enabled ('auto,nosmt'), all mitigations +for cross-thread attacks will be enabled. SMT may be disabled depending on +which vulnerabilities are present in the CPU. + +If cross-thread mitigations are partially enabled ('auto'), mitigations for +cross-thread attacks will be enabled but SMT will not be disabled. + +If cross-thread mitigations are disabled, no mitigations for cross-thread +attacks will be enabled. + +Cross-thread mitigation may not be required if core-scheduling or similar +techniques are used to prevent untrusted workloads from running on SMT siblings. + +*cross-thread mitigations default to partially enabled* + +.. _cmdline: + +Command Line Controls +--------------------- + +Attack vectors are controlled through the mitigations= command line option. The +value provided begins with a global option and then may optionally include one +or more options to disable various attack vectors. + +Format: + | ``mitigations=[global]`` + | ``mitigations=[global],[attack vectors]`` + +Global options: + +============ ============================================================= +Option Description +============ ============================================================= +'off' All attack vectors disabled. +'auto' All attack vectors enabled, partial cross-thread mitigations. +'auto,nosmt' All attack vectors enabled, full cross-thread mitigations. +============ ============================================================= + +Attack vector options: + +================= ======================================= +Option Description +================= ======================================= +'no_user_kernel' Disables user-to-kernel mitigations. +'no_user_user' Disables user-to-user mitigations. +'no_guest_host' Disables guest-to-host mitigations. +'no_guest_guest' Disables guest-to-guest mitigations +'no_cross_thread' Disables all cross-thread mitigations. +================= ======================================= + +Multiple attack vector options may be specified in a comma-separated list. If +the global option is not specified, it defaults to 'auto'. The global option +'off' is equivalent to disabling all attack vectors. + +Examples: + | ``mitigations=auto,no_user_kernel`` + + Enable all attack vectors except user-to-kernel. Partial cross-thread + mitigations. + + | ``mitigations=auto,nosmt,no_guest_host,no_guest_guest`` + + Enable all attack vectors and cross-thread mitigations except for + guest-to-host and guest-to-guest mitigations. + + | ``mitigations=,no_cross_thread`` + + Enable all attack vectors but not cross-thread mitigations. + +Interactions with command-line options +-------------------------------------- + +Vulnerability-specific controls (e.g. "retbleed=off") take precedence over all +attack vector controls. Mitigations for individual vulnerabilities may be +turned on or off via their command-line options regardless of the attack vector +controls. + +Summary of attack-vector mitigations +------------------------------------ + +When a vulnerability is mitigated due to an attack-vector control, the default +mitigation option for that particular vulnerability is used. To use a different +mitigation, please use the vulnerability-specific command line option. + +The table below summarizes which vulnerabilities are mitigated when different +attack vectors are enabled and assuming the CPU is vulnerable. + +=============== ============== ============ ============= ============== ============ ======== +Vulnerability User-to-Kernel User-to-User Guest-to-Host Guest-to-Guest Cross-Thread Notes +=============== ============== ============ ============= ============== ============ ======== +BHI X X +ITS X X +GDS X X X X * (Note 1) +L1TF X X * (Note 2) +MDS X X X X * (Note 2) +MMIO X X X X * (Note 2) +Meltdown X +Retbleed X X * (Note 3) +RFDS X X X X +Spectre_v1 X +Spectre_v2 X X +Spectre_v2_user X X * (Note 1) +SRBDS X X X X +SRSO X X X X +SSB X +TAA X X X X * (Note 2) +TSA X X X X +VMSCAPE X +=============== ============== ============ ============= ============== ============ ======== + +Notes: + 1 -- Can be mitigated without disabling SMT. + + 2 -- Disables SMT if cross-thread mitigations are fully enabled and the CPU + is vulnerable + + 3 -- Disables SMT if cross-thread mitigations are fully enabled, the CPU is + vulnerable, and STIBP is not supported + +When an attack-vector is disabled, all mitigations for the vulnerabilities +listed in the above table are disabled, unless mitigation is required for a +different enabled attack-vector or a mitigation is explicitly selected via a +vulnerability-specific command line option. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index ff0b440ef2dc..55d747511f83 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -9,6 +9,7 @@ are configurable at compile, boot or run time. .. toctree:: :maxdepth: 1 + attack_vector_controls spectre l1tf mds @@ -22,3 +23,7 @@ are configurable at compile, boot or run time. srso gather_data_sampling reg-file-data-sampling + rsb + old_microcode + indirect-target-selection + vmscape diff --git a/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst new file mode 100644 index 000000000000..d9ca64108d23 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst @@ -0,0 +1,168 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Indirect Target Selection (ITS) +=============================== + +ITS is a vulnerability in some Intel CPUs that support Enhanced IBRS and were +released before Alder Lake. ITS may allow an attacker to control the prediction +of indirect branches and RETs located in the lower half of a cacheline. + +ITS is assigned CVE-2024-28956 with a CVSS score of 4.7 (Medium). + +Scope of Impact +--------------- +- **eIBRS Guest/Host Isolation**: Indirect branches in KVM/kernel may still be + predicted with unintended target corresponding to a branch in the guest. + +- **Intra-Mode BTI**: In-kernel training such as through cBPF or other native + gadgets. + +- **Indirect Branch Prediction Barrier (IBPB)**: After an IBPB, indirect + branches may still be predicted with targets corresponding to direct branches + executed prior to the IBPB. This is fixed by the IPU 2025.1 microcode, which + should be available via distro updates. Alternatively microcode can be + obtained from Intel's github repository [#f1]_. + +Affected CPUs +------------- +Below is the list of ITS affected CPUs [#f2]_ [#f3]_: + + ======================== ============ ==================== =============== + Common name Family_Model eIBRS Intra-mode BTI + Guest/Host Isolation + ======================== ============ ==================== =============== + SKYLAKE_X (step >= 6) 06_55H Affected Affected + ICELAKE_X 06_6AH Not affected Affected + ICELAKE_D 06_6CH Not affected Affected + ICELAKE_L 06_7EH Not affected Affected + TIGERLAKE_L 06_8CH Not affected Affected + TIGERLAKE 06_8DH Not affected Affected + KABYLAKE_L (step >= 12) 06_8EH Affected Affected + KABYLAKE (step >= 13) 06_9EH Affected Affected + COMETLAKE 06_A5H Affected Affected + COMETLAKE_L 06_A6H Affected Affected + ROCKETLAKE 06_A7H Not affected Affected + ======================== ============ ==================== =============== + +- All affected CPUs enumerate Enhanced IBRS feature. +- IBPB isolation is affected on all ITS affected CPUs, and need a microcode + update for mitigation. +- None of the affected CPUs enumerate BHI_CTRL which was introduced in Golden + Cove (Alder Lake and Sapphire Rapids). This can help guests to determine the + host's affected status. +- Intel Atom CPUs are not affected by ITS. + +Mitigation +---------- +As only the indirect branches and RETs that have their last byte of instruction +in the lower half of the cacheline are vulnerable to ITS, the basic idea behind +the mitigation is to not allow indirect branches in the lower half. + +This is achieved by relying on existing retpoline support in the kernel, and in +compilers. ITS-vulnerable retpoline sites are runtime patched to point to newly +added ITS-safe thunks. These safe thunks consists of indirect branch in the +second half of the cacheline. Not all retpoline sites are patched to thunks, if +a retpoline site is evaluated to be ITS-safe, it is replaced with an inline +indirect branch. + +Dynamic thunks +~~~~~~~~~~~~~~ +From a dynamically allocated pool of safe-thunks, each vulnerable site is +replaced with a new thunk, such that they get a unique address. This could +improve the branch prediction accuracy. Also, it is a defense-in-depth measure +against aliasing. + +Note, for simplicity, indirect branches in eBPF programs are always replaced +with a jump to a static thunk in __x86_indirect_its_thunk_array. If required, +in future this can be changed to use dynamic thunks. + +All vulnerable RETs are replaced with a static thunk, they do not use dynamic +thunks. This is because RETs get their prediction from RSB mostly that does not +depend on source address. RETs that underflow RSB may benefit from dynamic +thunks. But, RETs significantly outnumber indirect branches, and any benefit +from a unique source address could be outweighed by the increased icache +footprint and iTLB pressure. + +Retpoline +~~~~~~~~~ +Retpoline sequence also mitigates ITS-unsafe indirect branches. For this +reason, when retpoline is enabled, ITS mitigation only relocates the RETs to +safe thunks. Unless user requested the RSB-stuffing mitigation. + +RSB Stuffing +~~~~~~~~~~~~ +RSB-stuffing via Call Depth Tracking is a mitigation for Retbleed RSB-underflow +attacks. And it also mitigates RETs that are vulnerable to ITS. + +Mitigation in guests +^^^^^^^^^^^^^^^^^^^^ +All guests deploy ITS mitigation by default, irrespective of eIBRS enumeration +and Family/Model of the guest. This is because eIBRS feature could be hidden +from a guest. One exception to this is when a guest enumerates BHI_DIS_S, which +indicates that the guest is running on an unaffected host. + +To prevent guests from unnecessarily deploying the mitigation on unaffected +platforms, Intel has defined ITS_NO bit(62) in MSR IA32_ARCH_CAPABILITIES. When +a guest sees this bit set, it should not enumerate the ITS bug. Note, this bit +is not set by any hardware, but is **intended for VMMs to synthesize** it for +guests as per the host's affected status. + +Mitigation options +^^^^^^^^^^^^^^^^^^ +The ITS mitigation can be controlled using the "indirect_target_selection" +kernel parameter. The available options are: + + ======== =================================================================== + on (default) Deploy the "Aligned branch/return thunks" mitigation. + If spectre_v2 mitigation enables retpoline, aligned-thunks are only + deployed for the affected RET instructions. Retpoline mitigates + indirect branches. + + off Disable ITS mitigation. + + vmexit Equivalent to "=on" if the CPU is affected by guest/host isolation + part of ITS. Otherwise, mitigation is not deployed. This option is + useful when host userspace is not in the threat model, and only + attacks from guest to host are considered. + + stuff Deploy RSB-fill mitigation when retpoline is also deployed. + Otherwise, deploy the default mitigation. When retpoline mitigation + is enabled, RSB-stuffing via Call-Depth-Tracking also mitigates + ITS. + + force Force the ITS bug and deploy the default mitigation. + ======== =================================================================== + +Sysfs reporting +--------------- + +The sysfs file showing ITS mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/indirect_target_selection + +Note, microcode mitigation status is not reported in this file. + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - Vulnerable + - System is vulnerable and no mitigation has been applied. + * - Vulnerable, KVM: Not affected + - System is vulnerable to intra-mode BTI, but not affected by eIBRS + guest/host isolation. + * - Mitigation: Aligned branch/return thunks + - The mitigation is enabled, affected indirect branches and RETs are + relocated to safe thunks. + * - Mitigation: Retpolines, Stuffing RSB + - The mitigation is enabled using retpoline and RSB stuffing. + +References +---------- +.. [#f1] Microcode repository - https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files + +.. [#f2] Affected Processors list - https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html + +.. [#f3] Affected Processors list (machine readable) - https://github.com/intel/Intel-affected-processor-list diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst index 210020bc3f56..35dc25159b28 100644 --- a/Documentation/admin-guide/hw-vuln/l1d_flush.rst +++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst @@ -31,7 +31,7 @@ specifically opt into the feature to enable it. Mitigation ---------- -When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is +When PR_SPEC_L1D_FLUSH is enabled for a task a flush of the L1D cache is performed when the task is scheduled out and the incoming task belongs to a different process and therefore to a different address space. diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst index 48c7b0b72aed..754679db0ce8 100644 --- a/Documentation/admin-guide/hw-vuln/mds.rst +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -214,7 +214,7 @@ XEON PHI specific considerations command line with the 'ring3mwait=disable' command line option. XEON PHI is not affected by the other MDS variants and MSBDS is mitigated - before the CPU enters a idle state. As XEON PHI is not affected by L1TF + before the CPU enters an idle state. As XEON PHI is not affected by L1TF either disabling SMT is not required for full protection. .. _mds_smt_control: diff --git a/Documentation/admin-guide/hw-vuln/old_microcode.rst b/Documentation/admin-guide/hw-vuln/old_microcode.rst new file mode 100644 index 000000000000..6ded8f86b8d0 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/old_microcode.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Old Microcode +============= + +The kernel keeps a table of released microcode. Systems that had +microcode older than this at boot will say "Vulnerable". This means +that the system was vulnerable to some known CPU issue. It could be +security or functional, the kernel does not know or care. + +You should update the CPU microcode to mitigate any exposure. This is +usually accomplished by updating the files in +/lib/firmware/intel-ucode/ via normal distribution updates. Intel also +distributes these files in a github repo: + + https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files.git + +Just like all the other hardware vulnerabilities, exposure is +determined at boot. Runtime microcode updates do not change the status +of this vulnerability. diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst index 1302fd1b55e8..6dba18dbb9ab 100644 --- a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -157,9 +157,7 @@ This is achieved by using the otherwise unused and obsolete VERW instruction in combination with a microcode update. The microcode clears the affected CPU buffers when the VERW instruction is executed. -Kernel reuses the MDS function to invoke the buffer clearing: - - mds_clear_cpu_buffers() +Kernel does the buffer clearing with x86_clear_cpu_buffers(). On MDS affected CPUs, the kernel already invokes CPU buffer clear on kernel/userspace, hypervisor/guest and C-state (idle) transitions. No diff --git a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst index 0585d02b9a6c..ad15417d39f9 100644 --- a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst +++ b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst @@ -29,14 +29,6 @@ Below is the list of affected Intel processors [#f1]_: RAPTORLAKE_S 06_BFH =================== ============ -As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and -RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as -vulnerable in Linux because they share the same family/model with an affected -part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or -CPUID.HYBRID. This information could be used to distinguish between the -affected and unaffected parts, but it is deemed not worth adding complexity as -the reporting is fixed automatically when these parts enumerate RFDS_NO. - Mitigation ========== Intel released a microcode update that enables software to clear sensitive diff --git a/Documentation/admin-guide/hw-vuln/rsb.rst b/Documentation/admin-guide/hw-vuln/rsb.rst new file mode 100644 index 000000000000..21dbf9cf25f8 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/rsb.rst @@ -0,0 +1,268 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +RSB-related mitigations +======================= + +.. warning:: + Please keep this document up-to-date, otherwise you will be + volunteered to update it and convert it to a very long comment in + bugs.c! + +Since 2018 there have been many Spectre CVEs related to the Return Stack +Buffer (RSB) (sometimes referred to as the Return Address Stack (RAS) or +Return Address Predictor (RAP) on AMD). + +Information about these CVEs and how to mitigate them is scattered +amongst a myriad of microarchitecture-specific documents. + +This document attempts to consolidate all the relevant information in +once place and clarify the reasoning behind the current RSB-related +mitigations. It's meant to be as concise as possible, focused only on +the current kernel mitigations: what are the RSB-related attack vectors +and how are they currently being mitigated? + +It's *not* meant to describe how the RSB mechanism operates or how the +exploits work. More details about those can be found in the references +below. + +Rather, this is basically a glorified comment, but too long to actually +be one. So when the next CVE comes along, a kernel developer can +quickly refer to this as a refresher to see what we're actually doing +and why. + +At a high level, there are two classes of RSB attacks: RSB poisoning +(Intel and AMD) and RSB underflow (Intel only). They must each be +considered individually for each attack vector (and microarchitecture +where applicable). + +---- + +RSB poisoning (Intel and AMD) +============================= + +SpectreRSB +~~~~~~~~~~ + +RSB poisoning is a technique used by SpectreRSB [#spectre-rsb]_ where +an attacker poisons an RSB entry to cause a victim's return instruction +to speculate to an attacker-controlled address. This can happen when +there are unbalanced CALLs/RETs after a context switch or VMEXIT. + +* All attack vectors can potentially be mitigated by flushing out any + poisoned RSB entries using an RSB filling sequence + [#intel-rsb-filling]_ [#amd-rsb-filling]_ when transitioning between + untrusted and trusted domains. But this has a performance impact and + should be avoided whenever possible. + + .. DANGER:: + **FIXME**: Currently we're flushing 32 entries. However, some CPU + models have more than 32 entries. The loop count needs to be + increased for those. More detailed information is needed about RSB + sizes. + +* On context switch, the user->user mitigation requires ensuring the + RSB gets filled or cleared whenever IBPB gets written [#cond-ibpb]_ + during a context switch: + + * AMD: + On Zen 4+, IBPB (or SBPB [#amd-sbpb]_ if used) clears the RSB. + This is indicated by IBPB_RET in CPUID [#amd-ibpb-rsb]_. + + On Zen < 4, the RSB filling sequence [#amd-rsb-filling]_ must be + always be done in addition to IBPB [#amd-ibpb-no-rsb]_. This is + indicated by X86_BUG_IBPB_NO_RET. + + * Intel: + IBPB always clears the RSB: + + "Software that executed before the IBPB command cannot control + the predicted targets of indirect branches executed after the + command on the same logical processor. The term indirect branch + in this context includes near return instructions, so these + predicted targets may come from the RSB." [#intel-ibpb-rsb]_ + +* On context switch, user->kernel attacks are prevented by SMEP. User + space can only insert user space addresses into the RSB. Even + non-canonical addresses can't be inserted due to the page gap at the + end of the user canonical address space reserved by TASK_SIZE_MAX. + A SMEP #PF at instruction fetch prevents the kernel from speculatively + executing user space. + + * AMD: + "Finally, branches that are predicted as 'ret' instructions get + their predicted targets from the Return Address Predictor (RAP). + AMD recommends software use a RAP stuffing sequence (mitigation + V2-3 in [2]) and/or Supervisor Mode Execution Protection (SMEP) + to ensure that the addresses in the RAP are safe for + speculation. Collectively, we refer to these mitigations as "RAP + Protection"." [#amd-smep-rsb]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits." [#intel-smep-rsb]_ + +* On VMEXIT, guest->host attacks are mitigated by eIBRS (and PBRSB + mitigation if needed): + + * AMD: + "When Automatic IBRS is enabled, the internal return address + stack used for return address predictions is cleared on VMEXIT." + [#amd-eibrs-vmexit]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits. Processors with + enhanced IBRS still support the usage model where IBRS is set + only in the OS/VMM for OSes that enable SMEP. To do this, such + processors will ensure that guest behavior cannot control the + RSB after a VM exit once IBRS is set, even if IBRS was not set + at the time of the VM exit." [#intel-eibrs-vmexit]_ + + Note that some Intel CPUs are susceptible to Post-barrier Return + Stack Buffer Predictions (PBRSB) [#intel-pbrsb]_, where the last + CALL from the guest can be used to predict the first unbalanced RET. + In this case the PBRSB mitigation is needed in addition to eIBRS. + +AMD RETBleed / SRSO / Branch Type Confusion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On AMD, poisoned RSB entries can also be created by the AMD RETBleed +variant [#retbleed-paper]_ [#amd-btc]_ or by Speculative Return Stack +Overflow [#amd-srso]_ (Inception [#inception-paper]_). The kernel +protects itself by replacing every RET in the kernel with a branch to a +single safe RET. + +---- + +RSB underflow (Intel only) +========================== + +RSB Alternate (RSBA) ("Intel Retbleed") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some Intel Skylake-generation CPUs are susceptible to the Intel variant +of RETBleed [#retbleed-paper]_ (Return Stack Buffer Underflow +[#intel-rsbu]_). If a RET is executed when the RSB buffer is empty due +to mismatched CALLs/RETs or returning from a deep call stack, the branch +predictor can fall back to using the Branch Target Buffer (BTB). If a +user forces a BTB collision then the RET can speculatively branch to a +user-controlled address. + +* Note that RSB filling doesn't fully mitigate this issue. If there + are enough unbalanced RETs, the RSB may still underflow and fall back + to using a poisoned BTB entry. + +* On context switch, user->user underflow attacks are mitigated by the + conditional IBPB [#cond-ibpb]_ on context switch which effectively + clears the BTB: + + * "The indirect branch predictor barrier (IBPB) is an indirect branch + control mechanism that establishes a barrier, preventing software + that executed before the barrier from controlling the predicted + targets of indirect branches executed after the barrier on the same + logical processor." [#intel-ibpb-btb]_ + +* On context switch and VMEXIT, user->kernel and guest->host RSB + underflows are mitigated by IBRS or eIBRS: + + * "Enabling IBRS (including enhanced IBRS) will mitigate the "RSBU" + attack demonstrated by the researchers. As previously documented, + Intel recommends the use of enhanced IBRS, where supported. This + includes any processor that enumerates RRSBA but not RRSBA_DIS_S." + [#intel-rsbu]_ + + However, note that eIBRS and IBRS do not mitigate intra-mode attacks. + Like RRSBA below, this is mitigated by clearing the BHB on kernel + entry. + + As an alternative to classic IBRS, call depth tracking (combined with + retpolines) can be used to track kernel returns and fill the RSB when + it gets close to being empty. + +Restricted RSB Alternate (RRSBA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some newer Intel CPUs have Restricted RSB Alternate (RRSBA) behavior, +which, similar to RSBA described above, also falls back to using the BTB +on RSB underflow. The only difference is that the predicted targets are +restricted to the current domain when eIBRS is enabled: + +* "Restricted RSB Alternate (RRSBA) behavior allows alternate branch + predictors to be used by near RET instructions when the RSB is + empty. When eIBRS is enabled, the predicted targets of these + alternate predictors are restricted to those belonging to the + indirect branch predictor entries of the current prediction domain. + [#intel-eibrs-rrsba]_ + +When a CPU with RRSBA is vulnerable to Branch History Injection +[#bhi-paper]_ [#intel-bhi]_, an RSB underflow could be used for an +intra-mode BTI attack. This is mitigated by clearing the BHB on +kernel entry. + +However if the kernel uses retpolines instead of eIBRS, it needs to +disable RRSBA: + +* "Where software is using retpoline as a mitigation for BHI or + intra-mode BTI, and the processor both enumerates RRSBA and + enumerates RRSBA_DIS controls, it should disable this behavior." + [#intel-retpoline-rrsba]_ + +---- + +References +========== + +.. [#spectre-rsb] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://arxiv.org/pdf/1807.07940.pdf>`_ + +.. [#intel-rsb-filling] "Empty RSB Mitigation on Skylake-generation" in `Retpoline: A Branch Target Injection Mitigation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html#inpage-nav-5-1>`_ + +.. [#amd-rsb-filling] "Mitigation V2-3" in `Software Techniques for Managing Speculation <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf>`_ + +.. [#cond-ibpb] Whether IBPB is written depends on whether the prev and/or next task is protected from Spectre attacks. It typically requires opting in per task or system-wide. For more details see the documentation for the ``spectre_v2_user`` cmdline option in Documentation/admin-guide/kernel-parameters.txt. + +.. [#amd-sbpb] IBPB without flushing of branch type predictions. Only exists for AMD. + +.. [#amd-ibpb-rsb] "Function 8000_0008h -- Processor Capacity Parameters and Extended Feature Identification" in `AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf>`_. SBPB behaves the same way according to `this email <https://lore.kernel.org/5175b163a3736ca5fd01cedf406735636c99a>`_. + +.. [#amd-ibpb-no-rsb] `Spectre Attacks: Exploiting Speculative Execution <https://comsec.ethz.ch/wp-content/files/ibpb_sp25.pdf>`_ + +.. [#intel-ibpb-rsb] "Introduction" in `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#amd-smep-rsb] "Existing Mitigations" in `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#intel-smep-rsb] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#amd-eibrs-vmexit] "Extended Feature Enable Register (EFER)" in `AMD64 Architecture Programmer's Manual Volume 2: System Programming <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_ + +.. [#intel-eibrs-vmexit] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#intel-pbrsb] `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#retbleed-paper] `RETBleed: Arbitrary Speculative Code Execution with Return Instruction <https://comsec.ethz.ch/wp-content/files/retbleed_sec22.pdf>`_ + +.. [#amd-btc] `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#amd-srso] `Technical Update Regarding Speculative Return Stack Overflow <https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf>`_ + +.. [#inception-paper] `Inception: Exposing New Attack Surfaces with Training in Transient Execution <https://comsec.ethz.ch/wp-content/files/inception_sec23.pdf>`_ + +.. [#intel-rsbu] `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#intel-ibpb-btb] `Indirect Branch Predictor Barrier' <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-predictor-barrier.html>`_ + +.. [#intel-eibrs-rrsba] "Guidance for RSBU" in `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#bhi-paper] `Branch History Injection: On the Effectiveness of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks <http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf>`_ + +.. [#intel-bhi] `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ + +.. [#intel-retpoline-rrsba] "Retpoline" in `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 132e0bc6007e..4bb8549bee82 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -406,7 +406,7 @@ The possible values in this file are: - Single threaded indirect branch prediction (STIBP) status for protection between different hyper threads. This feature can be controlled through - prctl per process, or through kernel command line options. This is x86 + prctl per process, or through kernel command line options. This is an x86 only feature. For more details see below. ==================== ======================================================== @@ -664,7 +664,7 @@ Intel white papers: .. _spec_ref1: -[1] `Intel analysis of speculative execution side channels <https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf>`_. +[1] `Intel analysis of speculative execution side channels <https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/analysis-of-speculative-execution-side-channels-white-paper.pdf>`_. .. _spec_ref2: @@ -682,7 +682,7 @@ AMD white papers: .. _spec_ref5: -[5] `AMD64 technology indirect branch control extension <https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf>`_. +[5] `AMD64 technology indirect branch control extension <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/white-papers/111006-architecture-guidelines-update-amd64-technology-indirect-branch-control-extension.pdf>`_. .. _spec_ref6: @@ -708,7 +708,7 @@ MIPS white paper: .. _spec_ref10: -[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_. +[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://web.archive.org/web/20220512003005if_/https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_. Academic papers: diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst index 2ad1c05b8c88..66af95251a3d 100644 --- a/Documentation/admin-guide/hw-vuln/srso.rst +++ b/Documentation/admin-guide/hw-vuln/srso.rst @@ -104,7 +104,20 @@ The possible values in this file are: (spec_rstack_overflow=ibpb-vmexit) + * 'Mitigation: Reduced Speculation': + This mitigation gets automatically enabled when the above one "IBPB on + VMEXIT" has been selected and the CPU supports the BpSpecReduce bit. + + It gets automatically enabled on machines which have the + SRSO_USER_KERNEL_NO=1 CPUID bit. In that case, the code logic is to switch + to the above =ibpb-vmexit mitigation because the user/kernel boundary is + not affected anymore and thus "safe RET" is not needed. + + After enabling the IBPB on VMEXIT mitigation option, the BpSpecReduce bit + is detected (functionality present on all such machines) and that + practically overrides IBPB on VMEXIT as it has a lot less performance + impact and takes care of the guest->host attack vector too. In order to exploit vulnerability, an attacker needs to: diff --git a/Documentation/admin-guide/hw-vuln/vmscape.rst b/Documentation/admin-guide/hw-vuln/vmscape.rst new file mode 100644 index 000000000000..d9b9a2b6c114 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/vmscape.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0 + +VMSCAPE +======= + +VMSCAPE is a vulnerability that may allow a guest to influence the branch +prediction in host userspace. It particularly affects hypervisors like QEMU. + +Even if a hypervisor may not have any sensitive data like disk encryption keys, +guest-userspace may be able to attack the guest-kernel using the hypervisor as +a confused deputy. + +Affected processors +------------------- + +The following CPU families are affected by VMSCAPE: + +**Intel processors:** + - Skylake generation (Parts without Enhanced-IBRS) + - Cascade Lake generation - (Parts affected by ITS guest/host separation) + - Alder Lake and newer (Parts affected by BHI) + +Note that, BHI affected parts that use BHB clearing software mitigation e.g. +Icelake are not vulnerable to VMSCAPE. + +**AMD processors:** + - Zen series (families 0x17, 0x19, 0x1a) + +** Hygon processors:** + - Family 0x18 + +Mitigation +---------- + +Conditional IBPB +---------------- + +Kernel tracks when a CPU has run a potentially malicious guest and issues an +IBPB before the first exit to userspace after VM-exit. If userspace did not run +between VM-exit and the next VM-entry, no IBPB is issued. + +Note that the existing userspace mitigation against Spectre-v2 is effective in +protecting the userspace. They are insufficient to protect the userspace VMMs +from a malicious guest. This is because Spectre-v2 mitigations are applied at +context switch time, while the userspace VMM can run after a VM-exit without a +context switch. + +Vulnerability enumeration and mitigation is not applied inside a guest. This is +because nested hypervisors should already be deploying IBPB to isolate +themselves from nested guests. + +SMT considerations +------------------ + +When Simultaneous Multi-Threading (SMT) is enabled, hypervisors can be +vulnerable to cross-thread attacks. For complete protection against VMSCAPE +attacks in SMT environments, STIBP should be enabled. + +The kernel will issue a warning if SMT is enabled without adequate STIBP +protection. Warning is not issued when: + +- SMT is disabled +- STIBP is enabled system-wide +- Intel eIBRS is enabled (which implies STIBP protection) + +System information and options +------------------------------ + +The sysfs file showing VMSCAPE mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/vmscape + +The possible values in this file are: + + * 'Not affected': + + The processor is not vulnerable to VMSCAPE attacks. + + * 'Vulnerable': + + The processor is vulnerable and no mitigation has been applied. + + * 'Mitigation: IBPB before exit to userspace': + + Conditional IBPB mitigation is enabled. The kernel tracks when a CPU has + run a potentially malicious guest and issues an IBPB before the first + exit to userspace after VM-exit. + + * 'Mitigation: IBPB on VMEXIT': + + IBPB is issued on every VM-exit. This occurs when other mitigations like + RETBLEED or SRSO are already issuing IBPB on VM-exit. + +Mitigation control on the kernel command line +---------------------------------------------- + +The mitigation can be controlled via the ``vmscape=`` command line parameter: + + * ``vmscape=off``: + + Disable the VMSCAPE mitigation. + + * ``vmscape=ibpb``: + + Enable conditional IBPB mitigation (default when CONFIG_MITIGATION_VMSCAPE=y). + + * ``vmscape=force``: + + Force vulnerability detection and mitigation even on processors that are + not known to be affected. diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index c8af32a8f800..cd28dfe91b06 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -94,6 +94,7 @@ likely to be of interest on almost any system. cgroup-v2 cgroup-v1/index + cpu-isolation cpu-load mm/index module-signing @@ -187,13 +188,5 @@ A few hard-to-categorize and generally obsolete documents. .. toctree:: :maxdepth: 1 - highuid ldm unicode - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/initrd.rst b/Documentation/admin-guide/initrd.rst index 67bbad8806e8..6c1660a4c5cc 100644 --- a/Documentation/admin-guide/initrd.rst +++ b/Documentation/admin-guide/initrd.rst @@ -297,7 +297,7 @@ as follows: 8) now the system is bootable and additional installation tasks can be performed -The key role of initrd here is to re-use the configuration data during +The key role of initrd here is to reuse the configuration data during normal system operation without requiring the use of a bloated "generic" kernel or re-compiling or re-linking the kernel. diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index 609a3201fd4e..9453196ade51 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -2,62 +2,39 @@ I/O statistics fields ===================== -Since 2.4.20 (and some versions before, with patches), and 2.5.45, -more extensive disk statistics have been introduced to help measure disk -activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do -the work for you, but in case you are interested in creating your own -tools, the fields are explained here. - -In 2.4 now, the information is found as additional fields in -``/proc/partitions``. In 2.6 and upper, the same information is found in two -places: one is in the file ``/proc/diskstats``, and the other is within -the sysfs file system, which must be mounted in order to obtain -the information. Throughout this document we'll assume that sysfs -is mounted on ``/sys``, although of course it may be mounted anywhere. -Both ``/proc/diskstats`` and sysfs use the same source for the information -and so should not differ. - -Here are examples of these different formats:: - - 2.4: - 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 - - 2.6+ sysfs: - 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 35486 38030 38030 38030 - - 2.6+ diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 hda1 35486 38030 38030 38030 - - 4.18+ diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0 - -On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have -a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. - -The advantage of one over the other is that the sysfs choice works well -if you are watching a known, small set of disks. ``/proc/diskstats`` may -be a better choice if you are watching a large number of disks because -you'll avoid the overhead of 50, 100, or 500 or more opens/closes with -each snapshot of your disk statistics. - -In 2.4, the statistics fields are those after the device name. In -the above example, the first field of statistics would be 446216. -By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll -find just the 15 fields, beginning with 446216. If you look at -``/proc/diskstats``, the 15 fields will be preceded by the major and -minor device numbers, and device name. Each of these formats provides -15 fields of statistics, each meaning exactly the same things. -All fields except field 9 are cumulative since boot. Field 9 should -go to zero as I/Os complete; all others only increase (unless they -overflow and wrap). Wrapping might eventually occur on a very busy -or long-lived system; so applications should be prepared to deal with -it. Regarding wrapping, the types of the fields are either unsigned -int (32 bit) or unsigned long (32-bit or 64-bit, depending on your -machine) as noted per-field below. Unless your observations are very -spread in time, these fields should not wrap twice before you notice it. +The kernel exposes disk statistics via ``/proc/diskstats`` and +``/sys/block/<device>/stat``. These stats are usually accessed via tools +such as ``sar`` and ``iostat``. + +Here are examples using a disk with two partitions:: + + /proc/diskstats: + 259 0 nvme0n1 255999 814 12369153 47919 996852 81 36123024 425995 0 301795 580470 0 0 0 0 60602 106555 + 259 1 nvme0n1p1 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0 + 259 2 nvme0n1p2 255401 1 12343477 47799 996004 0 36014736 425784 0 344336 473584 0 0 0 0 0 0 + + /sys/block/nvme0n1/stat: + 255999 814 12369153 47919 996858 81 36123056 426009 0 301809 580491 0 0 0 0 60605 106562 + + /sys/block/nvme0n1/nvme0n1p1/stat: + 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0 + +Both files contain the same 17 statistics. ``/sys/block/<device>/stat`` +contains the fields for ``<device>``. In ``/proc/diskstats`` the fields +are prefixed with the major and minor device numbers and the device +name. In the example above, the first stat value for ``nvme0n1`` is +255999 in both files. + +The sysfs ``stat`` file is efficient for monitoring a small, known set +of disks. If you're tracking a large number of devices, +``/proc/diskstats`` is often the better choice since it avoids the +overhead of opening and closing multiple files for each snapshot. + +All fields are cumulative, monotonic counters, except for field 9, which +resets to zero as I/Os complete. The remaining fields reset at boot, on +device reattachment or reinitialization, or when the underlying counter +overflows. Applications reading these counters should detect and handle +resets when comparing stat snapshots. Each set of stats only applies to the indicated device; if you want system-wide stats you'll have to find all the devices and sum them all up. diff --git a/Documentation/admin-guide/kdump/index.rst b/Documentation/admin-guide/kdump/index.rst index 8e2ebd0383cd..cf5d7c868b74 100644 --- a/Documentation/admin-guide/kdump/index.rst +++ b/Documentation/admin-guide/kdump/index.rst @@ -11,10 +11,3 @@ information. kdump vmcoreinfo - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 5376890adbeb..7587caadbae1 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -180,10 +180,6 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64) 1) On i386, enable high memory support under "Processor type and features":: - CONFIG_HIGHMEM64G=y - - or:: - CONFIG_HIGHMEM4G 2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel @@ -315,6 +311,27 @@ crashkernel syntax crashkernel=0,low +4) crashkernel=size,cma + + Reserve additional crash kernel memory from CMA. This reservation is + usable by the first system's userspace memory and kernel movable + allocations (memory balloon, zswap). Pages allocated from this memory + range will not be included in the vmcore so this should not be used if + dumping of userspace memory is intended and it has to be expected that + some movable kernel pages may be missing from the dump. + + A standard crashkernel reservation, as described above, is still needed + to hold the crash kernel and initrd. + + This option increases the risk of a kdump failure: DMA transfers + configured by the first kernel may end up corrupting the second + kernel's memory. + + This reservation method is intended for systems that can't afford to + sacrifice enough memory for standard crashkernel reservation and where + less reliable and possibly incomplete kdump is preferable to no kdump at + all. + Boot into System Kernel ----------------------- 1) Update the boot loader (such as grub, yaboot, or lilo) configuration @@ -454,7 +471,7 @@ Notes on loading the dump-capture kernel: performance degradation. To enable multi-cpu support, you should bring up an SMP dump-capture kernel and specify maxcpus/nr_cpus options while loading it. -* For s390x there are two kdump modes: If a ELF header is specified with +* For s390x there are two kdump modes: If an ELF header is specified with the elfcorehdr= kernel parameter, it is used by the kdump kernel as it is done on all other architectures. If no elfcorehdr= kernel parameter is specified, the s390x kdump kernel dynamically creates the header. The @@ -551,6 +568,38 @@ from within add_taint() whenever the value set in this bitmask matches with the bit flag being set by add_taint(). This will cause a kdump to occur at the add_taint()->panic() call. +Write the dump file to encrypted disk volume +============================================ + +CONFIG_CRASH_DM_CRYPT can be enabled to support saving the dump file to an +encrypted disk volume (only x86_64 supported for now). User space can interact +with /sys/kernel/config/crash_dm_crypt_keys for setup, + +1. Tell the first kernel what logon keys are needed to unlock the disk volumes, + # Add key #1 + mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720 + # Add key #1's description + echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 1 + + # Add key #2 in the same way + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 2 + + # To support CPU/memory hot-plugging, reuse keys already saved to reserved + # memory + echo true > /sys/kernel/config/crash_dm_crypt_key/reuse + +2. Load the dump-capture kernel + +3. After the dump-capture kerne get booted, restore the keys to user keyring + echo yes > /sys/kernel/crash_dm_crypt_keys/restore + Contact ======= diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 0f714fc945ac..7663c610fe90 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -141,7 +141,7 @@ nodemask_t The size of a nodemask_t type. Used to compute the number of online nodes. -(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head) +(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_info) ---------------------------------------------------------------------------------- User-space tools compute their values based on the offset of these @@ -325,14 +325,14 @@ NR_FREE_PAGES On linux-2.6.21 or later, the number of free pages is in vm_stat[NR_FREE_PAGES]. Used to get the number of free pages. -PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb ------------------------------------------------------------------------------------------ +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_hwpoison|PG_head_mask +-------------------------------------------------------------------------- Page attributes. These flags are used to filter various unnecessary for dumping pages. -PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) ------------------------------------------------------------------------------ +PAGE_SLAB_MAPCOUNT_VALUE|PAGE_BUDDY_MAPCOUNT_VALUE|PAGE_OFFLINE_MAPCOUNT_VALUE|PAGE_HUGETLB_MAPCOUNT_VALUE|PAGE_UNACCEPTED_MAPCOUNT_VALUE +------------------------------------------------------------------------------------------------------------------------------------------ More page attributes. These flags are used to filter various unnecessary for dumping pages. diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index 39d0e7ff0965..02a725536cc5 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + .. _kernelparameters: The kernel's command-line parameters @@ -108,102 +110,7 @@ The parameters listed below are only valid if certain kernel build options were enabled and if respective hardware is present. This list should be kept in alphabetical order. The text in square brackets at the beginning of each description states the restrictions within which a parameter -is applicable:: - - ACPI ACPI support is enabled. - AGP AGP (Accelerated Graphics Port) is enabled. - ALSA ALSA sound support is enabled. - APIC APIC support is enabled. - APM Advanced Power Management support is enabled. - APPARMOR AppArmor support is enabled. - ARM ARM architecture is enabled. - ARM64 ARM64 architecture is enabled. - AX25 Appropriate AX.25 support is enabled. - CLK Common clock infrastructure is enabled. - CMA Contiguous Memory Area support is enabled. - DRM Direct Rendering Management support is enabled. - DYNAMIC_DEBUG Build in debug messages and enable them at runtime - EARLY Parameter processed too early to be embedded in initrd. - EDD BIOS Enhanced Disk Drive Services (EDD) is enabled - EFI EFI Partitioning (GPT) is enabled - EVM Extended Verification Module - FB The frame buffer device is enabled. - FTRACE Function tracing enabled. - GCOV GCOV profiling is enabled. - HIBERNATION HIBERNATION is enabled. - HW Appropriate hardware is enabled. - HYPER_V HYPERV support is enabled. - IMA Integrity measurement architecture is enabled. - IP_PNP IP DHCP, BOOTP, or RARP is enabled. - IPV6 IPv6 support is enabled. - ISAPNP ISA PnP code is enabled. - ISDN Appropriate ISDN support is enabled. - ISOL CPU Isolation is enabled. - JOY Appropriate joystick support is enabled. - KGDB Kernel debugger support is enabled. - KVM Kernel Virtual Machine support is enabled. - LIBATA Libata driver is enabled - LOONGARCH LoongArch architecture is enabled. - LOOP Loopback device support is enabled. - LP Printer support is enabled. - M68k M68k architecture is enabled. - These options have more detailed description inside of - Documentation/arch/m68k/kernel-options.rst. - MDA MDA console support is enabled. - MIPS MIPS architecture is enabled. - MOUSE Appropriate mouse support is enabled. - MSI Message Signaled Interrupts (PCI). - MTD MTD (Memory Technology Device) support is enabled. - NET Appropriate network support is enabled. - NFS Appropriate NFS support is enabled. - NUMA NUMA support is enabled. - OF Devicetree is enabled. - PARISC The PA-RISC architecture is enabled. - PCI PCI bus support is enabled. - PCIE PCI Express support is enabled. - PCMCIA The PCMCIA subsystem is enabled. - PNP Plug & Play support is enabled. - PPC PowerPC architecture is enabled. - PPT Parallel port support is enabled. - PS2 Appropriate PS/2 support is enabled. - PV_OPS A paravirtualized kernel is enabled. - RAM RAM disk support is enabled. - RDT Intel Resource Director Technology. - RISCV RISCV architecture is enabled. - S390 S390 architecture is enabled. - SCSI Appropriate SCSI support is enabled. - A lot of drivers have their options described inside - the Documentation/scsi/ sub-directory. - SDW SoundWire support is enabled. - SECURITY Different security models are enabled. - SELINUX SELinux support is enabled. - SERIAL Serial support is enabled. - SH SuperH architecture is enabled. - SMP The kernel is an SMP kernel. - SPARC Sparc architecture is enabled. - SUSPEND System suspend states are enabled. - SWSUSP Software suspend (hibernation) is enabled. - TPM TPM drivers are enabled. - UMS USB Mass Storage support is enabled. - USB USB support is enabled. - USBHID USB Human Interface Device support is enabled. - V4L Video For Linux support is enabled. - VGA The VGA console has been enabled. - VMMIO Driver for memory mapped virtio devices is enabled. - VT Virtual terminal support is enabled. - WDT Watchdog support is enabled. - X86-32 X86-32, aka i386 architecture is enabled. - X86-64 X86-64 architecture is enabled. - X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) - X86_UV SGI UV support is enabled. - XEN Xen support is enabled - XTENSA xtensa architecture is enabled. - -In addition, the following text indicates that the option:: - - BOOT Is a boot loader parameter. - BUGS= Relates to possible processor bugs on the said processor. - KNL Is a kernel start-up parameter. +is applicable. Parameters denoted with BOOT are actually interpreted by the boot loader, and have no meaning to the kernel directly. @@ -213,7 +120,7 @@ need or coordination with <Documentation/arch/x86/boot.rst>. There are also arch-specific kernel-parameters not documented here. Note that ALL kernel parameters listed below are CASE SENSITIVE, and that -a trailing = on the name of any parameter states that that parameter will +a trailing = on the name of any parameter states that the parameter will be entered as an environment variable, whereas its absence indicates that it will appear as a kernel argument readable via /proc/cmdline by programs running once the system is up. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb8752b42ec8..4d0f545fb3ec 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1,3 +1,101 @@ + ACPI ACPI support is enabled. + AGP AGP (Accelerated Graphics Port) is enabled. + ALSA ALSA sound support is enabled. + APIC APIC support is enabled. + APM Advanced Power Management support is enabled. + APPARMOR AppArmor support is enabled. + ARM ARM architecture is enabled. + ARM64 ARM64 architecture is enabled. + CLK Common clock infrastructure is enabled. + CMA Contiguous Memory Area support is enabled. + DRM Direct Rendering Management support is enabled. + DYNAMIC_DEBUG Build in debug messages and enable them at runtime + EARLY Parameter processed too early to be embedded in initrd. + EDD BIOS Enhanced Disk Drive Services (EDD) is enabled + EFI EFI Partitioning (GPT) is enabled + EVM Extended Verification Module + FB The frame buffer device is enabled. + FTRACE Function tracing enabled. + GCOV GCOV profiling is enabled. + HIBERNATION HIBERNATION is enabled. + HW Appropriate hardware is enabled. + HYPER_V HYPERV support is enabled. + IMA Integrity measurement architecture is enabled. + IP_PNP IP DHCP, BOOTP, or RARP is enabled. + IPV6 IPv6 support is enabled. + ISAPNP ISA PnP code is enabled. + ISDN Appropriate ISDN support is enabled. + ISOL CPU Isolation is enabled. + JOY Appropriate joystick support is enabled. + KGDB Kernel debugger support is enabled. + KVM Kernel Virtual Machine support is enabled. + LIBATA Libata driver is enabled + LOONGARCH LoongArch architecture is enabled. + LOOP Loopback device support is enabled. + LP Printer support is enabled. + M68k M68k architecture is enabled. + These options have more detailed description inside of + Documentation/arch/m68k/kernel-options.rst. + MDA MDA console support is enabled. + MIPS MIPS architecture is enabled. + MOUSE Appropriate mouse support is enabled. + MSI Message Signaled Interrupts (PCI). + MTD MTD (Memory Technology Device) support is enabled. + NET Appropriate network support is enabled. + NFS Appropriate NFS support is enabled. + NUMA NUMA support is enabled. + OF Devicetree is enabled. + PARISC The PA-RISC architecture is enabled. + PCI PCI bus support is enabled. + PCIE PCI Express support is enabled. + PCMCIA The PCMCIA subsystem is enabled. + PNP Plug & Play support is enabled. + PPC PowerPC architecture is enabled. + PPT Parallel port support is enabled. + PS2 Appropriate PS/2 support is enabled. + PV_OPS A paravirtualized kernel is enabled. + RAM RAM disk support is enabled. + RDT Intel Resource Director Technology. + RISCV RISCV architecture is enabled. + S390 S390 architecture is enabled. + SCSI Appropriate SCSI support is enabled. + A lot of drivers have their options described inside + the Documentation/scsi/ sub-directory. + SDW SoundWire support is enabled. + SECURITY Different security models are enabled. + SELINUX SELinux support is enabled. + SERIAL Serial support is enabled. + SH SuperH architecture is enabled. + SMP The kernel is an SMP kernel. + SPARC Sparc architecture is enabled. + SUSPEND System suspend states are enabled. + SWSUSP Software suspend (hibernation) is enabled. + TPM TPM drivers are enabled. + UMS USB Mass Storage support is enabled. + USB USB support is enabled. + NVME NVMe support is enabled + USBHID USB Human Interface Device support is enabled. + V4L Video For Linux support is enabled. + VGA The VGA console has been enabled. + VMMIO Driver for memory mapped virtio devices is enabled. + VT Virtual terminal support is enabled. + WDT Watchdog support is enabled. + X86-32 X86-32, aka i386 architecture is enabled. + X86-64 X86-64 architecture is enabled. + X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) + X86_UV SGI UV support is enabled. + XEN Xen support is enabled + XTENSA xtensa architecture is enabled. + +In addition, the following text indicates that the option + + BOOT Is a boot loader parameter. + BUGS= Relates to possible processor bugs on the said processor. + KNL Is a kernel start-up parameter. + + +Kernel parameters + accept_memory= [MM] Format: { eager | lazy } default: lazy @@ -27,6 +125,8 @@ may result in duplicate corrected error reports. nospcr -- disable console in ACPI SPCR table as default _serial_ console on ARM64 + spcr -- enable console in ACPI SPCR table as + default _serial_ console on x86 For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or "acpi=nospcr" are available For RISCV64, ONLY "acpi=off", "acpi=on" or "acpi=force" @@ -89,6 +189,14 @@ unusable. The "log_buf_len" parameter may be useful if you need to capture more output. + acpi.poweroff_on_fatal= [ACPI] + {0 | 1} + Causes the system to poweroff when the ACPI bytecode signals + a fatal error. The default value of this setting is 1. + Overriding this value should only be done for diagnosing + ACPI firmware problems, as the system might behave erratically + after having encountered a fatal ACPI error. + acpi_enforce_resources= [ACPI] { strict | lax | no } Check for resource conflicts between native drivers @@ -392,6 +500,13 @@ disable Disable amd-pstate preferred core. + amd_dynamic_epp= + [X86] + disable + Disable amd-pstate dynamic EPP. + enable + Enable amd-pstate dynamic EPP. + amijoy.map= [HW,JOY] Amiga joystick support Map of devices attached to JOY0DAT and JOY1DAT Format: <a>,<b> @@ -416,10 +531,6 @@ Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. - For X86-32, this can also be used to specify an APIC - driver name. - Format: apic=driver_name - Examples: apic=bigsmp apic_extnmi= [APIC,X86,EARLY] External NMI delivery setting Format: { bsp (default) | all | none } @@ -462,6 +573,9 @@ arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory Set instructions support + arm64.nompam [ARM64] Unconditionally disable Memory Partitioning And + Monitoring support + arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension support @@ -518,23 +632,6 @@ 1 - Enable the BAU. unset - Disable the BAU. - baycom_epp= [HW,AX25] - Format: <io>,<mode> - - baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem - Format: <io>,<mode> - See header of drivers/net/hamradio/baycom_par.c. - - baycom_ser_fdx= [HW,AX25] - BayCom Serial Port AX.25 Modem (Full Duplex Mode) - Format: <io>,<irq>,<mode>[,<baud>] - See header of drivers/net/hamradio/baycom_ser_fdx.c. - - baycom_ser_hdx= [HW,AX25] - BayCom Serial Port AX.25 Modem (Half Duplex Mode) - Format: <io>,<irq>,<mode> - See header of drivers/net/hamradio/baycom_ser_hdx.c. - bdev_allow_write_mounted= Format: <bool> Control the ability to open a mounted block device @@ -609,6 +706,24 @@ ccw_timeout_log [S390] See Documentation/arch/s390/common_io.rst for details. + cfi= [X86-64] Set Control Flow Integrity checking features + when CONFIG_FINEIBT is enabled. + Format: feature[,feature...] + Default: auto + + auto: Use FineIBT if IBT available, otherwise kCFI. + Under FineIBT, enable "paranoid" mode when + FRED is not available. + off: Turn off CFI checking. + kcfi: Use kCFI (disable FineIBT). + fineibt: Use FineIBT (even if IBT not available). + norand: Do not re-randomize CFI hashes. + paranoid: Add caller hash checking under FineIBT. + bhi: Enable register poisoning to stop speculation + across FineIBT. (Disabled by default.) + warn: Do not enforce CFI checking: warn only. + debug: Report CFI initialization details. + cgroup_disable= [KNL] Disable a particular controller or optional feature Format: {name of the controller(s) or feature(s) to disable} The effects of cgroup_disable=foo are: @@ -634,6 +749,14 @@ named mounts. Specifying both "all" and "named" disables all v1 hierarchies. + cgroup_v1_proc= [KNL] Show also missing controllers in /proc/cgroups + Format: { "true" | "false" } + /proc/cgroups lists only v1 controllers by default. + This compatibility option enables listing also v2 + controllers (whose v1 code is not compiled!), so that + semi-legacy software can check this file to decide + about usage of v2 (sic) controllers. + cgroup_favordynmods= [KNL] Enable or Disable favordynmods. Format: { "true" | "false" } Defaults to the value of CONFIG_CGROUP_FAVOR_DYNMODS. @@ -644,6 +767,14 @@ nokmem -- Disable kernel memory accounting. nobpf -- Disable BPF memory accounting. + check_pages= [MM,EARLY] Enable sanity checking of pages after + allocations / before freeing. This adds checks to catch + double-frees, use-after-frees, and other sources of + page corruption by inspecting page internals (flags, + mapcount/refcount, memcg_data, etc.). + Format: { "0" | "1" } + Default: 0 (1 if CONFIG_DEBUG_VM is set) + checkreqprot= [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } See security/selinux/Kconfig help text. @@ -987,6 +1118,28 @@ 0: to disable low allocation. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + crashkernel=size[KMG],cma + [KNL, X86, ppc] Reserve additional crash kernel memory from + CMA. This reservation is usable by the first system's + userspace memory and kernel movable allocations (memory + balloon, zswap). Pages allocated from this memory range + will not be included in the vmcore so this should not + be used if dumping of userspace memory is intended and + it has to be expected that some movable kernel pages + may be missing from the dump. + + A standard crashkernel reservation, as described above, + is still needed to hold the crash kernel and initrd. + + This option increases the risk of a kdump failure: DMA + transfers configured by the first kernel may end up + corrupting the second kernel's memory. + + This reservation method is intended for systems that + can't afford to sacrifice enough memory for standard + crashkernel reservation and where less reliable and + possibly incomplete kdump is preferable to no kdump at + all. cryptomgr.notests [KNL] Disable crypto self-tests @@ -1066,12 +1219,8 @@ debugfs= [KNL,EARLY] This parameter enables what is exposed to userspace and debugfs internal clients. - Format: { on, no-mount, off } + Format: { on, off } on: All functions are enabled. - no-mount: - Filesystem is not registered but kernel clients can - access APIs and a crashkernel can be used to read - its content. There is nothing to mount. off: Filesystem is not registered and clients get a -EPERM as result when trying to register files or directories within debugfs. @@ -1221,6 +1370,13 @@ For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst + dm_verity.keyring_unsealed= + [KNL] When set to 1, leave the dm-verity keyring + unsealed after initialization so userspace can + provision keys. Once the keyring is restricted + it becomes active and is searched during signature + verification. + driver_async_probe= [KNL] List of driver names to be probed asynchronously. * matches with all driver names. If * is specified, the @@ -1411,7 +1567,8 @@ earlyprintk=serial[,0x...[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] - earlyprintk=pciserial[,force],bus:device.function[,baudrate] + earlyprintk=mmio32,membase[,{nocfg|baudrate}] + earlyprintk=pciserial[,force],bus:device.function[,{nocfg|baudrate}] earlyprintk=xdbc[xhciController#] earlyprintk=bios @@ -1419,6 +1576,9 @@ the normal console is initialized. It is not enabled by default because it has some cosmetic problems. + Use "nocfg" to skip UART configuration, assume + BIOS/firmware has configured UART correctly. + Append ",keep" to not disable it when the real console takes over. @@ -1587,8 +1747,8 @@ fred= [X86-64] Enable/disable Flexible Return and Event Delivery. Format: { on | off } - on: enable FRED when it's present. - off: disable FRED, the default setting. + on: enable FRED when it's present, the default setting. + off: disable FRED. ftrace=[tracer] [FTRACE] will set and start the specified tracer @@ -1785,7 +1945,9 @@ allocation boundaries as a proactive defense against bounds-checking flaws in the kernel's copy_to_user()/copy_from_user() interface. - on Perform hardened usercopy checks (default). + The default is determined by + CONFIG_HARDENED_USERCOPY_DEFAULT_ON. + on Perform hardened usercopy checks. off Disable hardened usercopy checks. hardlockup_all_cpu_backtrace= @@ -1793,6 +1955,30 @@ backtraces on all cpus. Format: 0 | 1 + hash_pointers= + [KNL,EARLY] + By default, when pointers are printed to the console + or buffers via the %p format string, that pointer is + "hashed", i.e. obscured by hashing the pointer value. + This is a security feature that hides actual kernel + addresses from unprivileged users, but it also makes + debugging the kernel more difficult since unequal + pointers can no longer be compared. The choices are: + Format: { auto | always | never } + Default: auto + + auto - Hash pointers unless slab_debug is enabled. + always - Always hash pointers (even if slab_debug is + enabled). + never - Never hash pointers. This option should only + be specified when debugging the kernel. Do + not use on production kernels. The boot + param "no_hash_pointers" is an alias for + this mode. + + For controlling hashing dynamically at runtime, + use the "kernel.kptr_restrict" sysctl instead. + hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for 64-bit NUMA, off otherwise. @@ -1826,6 +2012,23 @@ lz4: Select LZ4 compression algorithm to compress/decompress hibernation image. + hibernate.pm_test_delay= + [HIBERNATION] + Sets the number of seconds to remain in a hibernation test + mode before resuming the system (see + /sys/power/pm_test). Only available when CONFIG_PM_DEBUG + is set. Default value is 5. + + hibernate_compression_threads= + [HIBERNATION] + Set the number of threads used for compressing or decompressing + hibernation images. + + Format: <integer> + Default: 3 + Minimum: 1 + Example: hibernate_compression_threads=4 + highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact size of <nn>. This works even on boxes that have no highmem otherwise. This also works to reduce highmem @@ -1861,7 +2064,7 @@ hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET registers. Default set by CONFIG_HPET_MMAP_DEFAULT. - hugepages= [HW] Number of HugeTLB pages to allocate at boot. + hugepages= [HW,EARLY] Number of HugeTLB pages to allocate at boot. If this follows hugepagesz (below), it specifies the number of pages of hugepagesz to be allocated. If this is the first HugeTLB parameter on the command @@ -1873,15 +2076,24 @@ <node>:<integer>[,<node>:<integer>] hugepagesz= - [HW] The size of the HugeTLB pages. This is used in - conjunction with hugepages (above) to allocate huge - pages of a specific size at boot. The pair - hugepagesz=X hugepages=Y can be specified once for - each supported huge page size. Huge page sizes are - architecture dependent. See also + [HW,EARLY] The size of the HugeTLB pages. This is + used in conjunction with hugepages (above) to + allocate huge pages of a specific size at boot. The + pair hugepagesz=X hugepages=Y can be specified once + for each supported huge page size. Huge page sizes + are architecture dependent. See also Documentation/admin-guide/mm/hugetlbpage.rst. Format: size[KMG] + hugepage_alloc_threads= + [HW] The number of threads that should be used to + allocate hugepages during boot. This option can be + used to improve system bootup time when allocating + a large amount of huge pages. + The default value is 25% of the available hardware threads. + + Note that this parameter only applies to non-gigantic huge pages. + hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation of gigantic hugepages. Or using node format, the size of a CMA area per node can be specified. @@ -1892,6 +2104,13 @@ hugepages using the CMA allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped. + hugetlb_cma_only= + [HW,CMA,EARLY] When allocating new HugeTLB pages, only + try to allocate from the CMA areas. + + This option does nothing if hugetlb_cma= is not also + specified. + hugetlb_free_vmemmap= [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP enabled. @@ -1913,14 +2132,20 @@ the added memory block itself do not be affected. hung_task_panic= - [KNL] Should the hung task detector generate panics. - Format: 0 | 1 + [KNL] Number of hung tasks to trigger kernel panic. + Format: <int> + + When set to a non-zero value, a kernel panic will be triggered if + the number of detected hung tasks reaches this value. - A value of 1 instructs the kernel to panic when a - hung task is detected. The default value is controlled - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time - option. The value selected by this boot parameter can - be changed later by the kernel.hung_task_panic sysctl. + 0: don't panic + 1: panic immediately on first hung task + N: panic after N hung tasks are detected in a single scan + + The default value is controlled by the + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value + selected by this boot parameter can be changed later by the + kernel.hung_task_panic sysctl. hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 @@ -1933,6 +2158,12 @@ which allow the hypervisor to 'idle' the guest on lock contention. + hw_protection= [HW] + Format: reboot | shutdown + + Hardware protection action taken on critical events like + overtemperature or imminent voltage loss. + i2c_bus= [HW] Override the default board specific I2C bus speed or register an additional I2C bus that is not registered from board initialization code. @@ -2161,22 +2392,27 @@ [IMA] Define a custom template format. Format: { "field1|...|fieldN" } - ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage - Format: <min_file_size> - Set the minimal file size for using asynchronous hash. - If left unspecified, ahash usage is disabled. - - ahash performance varies for different data sizes on - different crypto accelerators. This option can be used - to achieve the best performance for a particular HW. - - ima.ahash_bufsize= [IMA] Asynchronous hash buffer size - Format: <bufsize> - Set hashing buffer size. Default: 4k. + ima= [IMA] Enable or disable IMA + Format: { "off" | "on" } + Default: "on" + Note that disabling IMA is limited to kdump kernel. + + indirect_target_selection= [X86,Intel] Mitigation control for Indirect + Target Selection(ITS) bug in Intel CPUs. Updated + microcode is also required for a fix in IBPB. + + on: Enable mitigation (default). + off: Disable mitigation. + force: Force the ITS bug and deploy default + mitigation. + vmexit: Only deploy mitigation if CPU is affected by + guest/host isolation part of ITS. + stuff: Deploy RSB-fill mitigation when retpoline is + also deployed. Otherwise, deploy the default + mitigation. - ahash performance varies for different chunk sizes on - different crypto accelerators. This option can be used - to achieve best performance for particular HW. + For details see: + Documentation/admin-guide/hw-vuln/indirect-target-selection.rst init= [KNL] Format: <full_path> @@ -2316,6 +2552,9 @@ per_cpu_perf_limits Allow per-logical-CPU P-State performance control limits using cpufreq sysfs interface + no_cas + Do not enable capacity-aware scheduling (CAS) on + hybrid systems intremap= [X86-64,Intel-IOMMU,EARLY] on enable Interrupt Remapping (default) @@ -2356,15 +2595,11 @@ Intel machines). This can be used to prevent the usage of an available hardware IOMMU. - [X86] pt - [X86] nopt - [PPC/POWERNV] - nobypass + nobypass [PPC/POWERNV] Disable IOMMU bypass, using IOMMU for PCI devices. - [X86] AMD Gart HW IOMMU-specific options: <size> @@ -2429,6 +2664,15 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH. + iommu.debug_pagealloc= + [KNL,EARLY] When CONFIG_IOMMU_DEBUG_PAGEALLOC is set, this + parameter enables the feature at boot time. By default, it + is disabled and the system behaves the same way as a kernel + built without CONFIG_IOMMU_DEBUG_PAGEALLOC. + Format: { "0" | "1" } + 0 - Sanitizer disabled. + 1 - Sanitizer enabled, expect runtime overhead. + io7= [HW] IO7 for Marvel-based Alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. @@ -2484,11 +2728,23 @@ requires the kernel to be built with CONFIG_ARM64_PSEUDO_NMI. + irqchip.riscv_imsic_noipi + [RISC-V,EARLY] + Force the kernel to not use IMSIC software injected MSIs + as IPIs. Intended for system where IMSIC is trap-n-emulated, + and thus want to reduce MMIO traps when triggering IPIs + to multiple harts. + irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken firmware running. + irqhandler.duration_warn_us= [KNL] + Warn if an IRQ handler exceeds the specified duration + threshold in microseconds. Useful for identifying + long-running IRQs in the system. + irqpoll [HW] When an interrupt is not handled search all handlers for it. Also check all handlers each timer @@ -2659,6 +2915,47 @@ for Movable pages. "nn[KMGTPE]", "nn%", and "mirror" are exclusive, so you cannot specify multiple forms. + kfence.burst= [MM,KFENCE] The number of additional successive + allocations to be attempted through KFENCE for each + sample interval. + Format: <unsigned integer> + Default: 0 + + kfence.check_on_panic= + [MM,KFENCE] Whether to check all KFENCE-managed objects' + canaries on panic. + Format: <bool> + Default: false + + kfence.deferrable= + [MM,KFENCE] Whether to use a deferrable timer to trigger + allocations. This avoids forcing CPU wake-ups if the + system is idle, at the risk of a less predictable + sample interval. + Format: <bool> + Default: CONFIG_KFENCE_DEFERRABLE + + kfence.fault= [MM,KFENCE] Controls the behavior when a KFENCE + error is detected. + report - print the error report and continue (default). + oops - print the error report and oops. + panic - print the error report and panic. + + kfence.sample_interval= + [MM,KFENCE] KFENCE's sample interval in milliseconds. + Format: <unsigned integer> + 0 - Disable KFENCE. + >0 - Enabled KFENCE with given sample interval. + Default: CONFIG_KFENCE_SAMPLE_INTERVAL + + kfence.skip_covered_thresh= + [MM,KFENCE] If pool utilization reaches this threshold + (pool usage%), KFENCE limits currently covered + allocations of the same source from further filling + up the pool. + Format: <unsigned integer> + Default: 75 + kgdbdbgp= [KGDB,HW,EARLY] kgdb over EHCI usb debug port. Format: <Controller#>[,poll interval] The controller # is the number of the ehci usb debug @@ -2698,6 +2995,31 @@ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. @@ -2761,6 +3083,26 @@ Default is Y (on). + kvm.enable_pmu=[KVM,X86] + If enabled, KVM will virtualize PMU functionality based + on the virtual CPU model defined by userspace. This + can be overridden on a per-VM basis via + KVM_CAP_PMU_CAPABILITY. + + If disabled, KVM will not virtualize PMU functionality, + e.g. MSRs, PMCs, PMIs, etc., even if userspace defines + a virtual CPU model that contains PMU assets. + + Note, KVM's vPMU support implicitly requires running + with an in-kernel local APIC, e.g. to deliver PMIs to + the guest. Running without an in-kernel local APIC is + not supported, though KVM will allow such a combination + (with severely degraded functionality). + + See also enable_mediated_pmu. + + Default is Y (on). + kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86] If enabled, KVM will enable virtualization in hardware when KVM is loaded, and disable virtualization when KVM @@ -2807,6 +3149,35 @@ If the value is 0 (the default), KVM will pick a period based on the ratio, such that a page is zapped after 1 hour on average. + kvm-{amd,intel}.enable_mediated_pmu=[KVM,AMD,INTEL] + If enabled, KVM will provide a mediated virtual PMU, + instead of the default perf-based virtual PMU (if + kvm.enable_pmu is true and PMU is enumerated via the + virtual CPU model). + + With a perf-based vPMU, KVM operates as a user of perf, + i.e. emulates guest PMU counters using perf events. + KVM-created perf events are managed by perf as regular + (guest-only) events, e.g. are scheduled in/out, contend + for hardware resources, etc. Using a perf-based vPMU + allows guest and host usage of the PMU to co-exist, but + incurs non-trivial overhead and can result in silently + dropped guest events (due to resource contention). + + With a mediated vPMU, hardware PMU state is context + switched around the world switch to/from the guest. + KVM mediates which events the guest can utilize, but + gives the guest direct access to all other PMU assets + when possible (KVM may intercept some accesses if the + virtual CPU model provides a subset of hardware PMU + functionality). Using a mediated vPMU significantly + reduces PMU virtualization overhead and eliminates lost + guest events, but is mutually exclusive with using perf + to profile KVM guests and adds latency to most VM-Exits + (to context switch PMU state). + + Default is N (off). + kvm-amd.nested= [KVM,AMD] Control nested virtualization feature in KVM/SVM. Default is 1 (enabled). @@ -2815,6 +3186,27 @@ (enabled). Disable by KVM if hardware lacks support for NPT. + kvm-amd.ciphertext_hiding_asids= + [KVM,AMD] Ciphertext hiding prevents disallowed accesses + to SNP private memory from reading ciphertext. Instead, + reads will see constant default values (0xff). + + If ciphertext hiding is enabled, the joint SEV-ES and + SEV-SNP ASID space is partitioned into separate SEV-ES + and SEV-SNP ASID ranges, with the SEV-SNP range being + [1..max_snp_asid] and the SEV-ES range being + (max_snp_asid..min_sev_asid), where min_sev_asid is + enumerated by CPUID.0x.8000_001F[EDX]. + + A non-zero value enables SEV-SNP ciphertext hiding and + adjusts the ASID ranges for SEV-ES and SEV-SNP guests. + KVM caps the number of SEV-SNP ASIDs at the maximum + possible value, e.g. specifying -1u will assign all + joint SEV-ES and SEV-SNP ASIDs to SEV-SNP. Note, + assigning all joint ASIDs to SEV-SNP, i.e. configuring + max_snp_asid == min_sev_asid-1, will effectively make + SEV-ES unusable. + kvm-arm.mode= [KVM,ARM,EARLY] Select one of KVM/arm64's modes of operation. @@ -2837,8 +3229,8 @@ for the host. To force nVHE on VHE hardware, add "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the command-line. - "nested" is experimental and should be used with - extreme caution. + "nested" and "protected" are experimental and should be + used with extreme caution. kvm-arm.vgic_v3_group0_trap= [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0 @@ -3108,6 +3500,11 @@ * [no]logdir: Enable or disable access to the general purpose log directory. + * max_sec=<sectors>: Set the transfer size limit, in + number of 512-byte sectors, to the value specified in + <sectors>. The value specified in <sectors> has to be + a non-zero positive integer. + * max_sec_128: Set transfer size limit to 128 sectors. * max_sec_1024: Set or clear transfer size limit to @@ -3116,6 +3513,8 @@ * max_sec_lba48: Set or clear transfer size limit to 65535 sectors. + * external: Mark port as external (hotplug-capable). + * [no]lpm: Enable or disable link power management. * [no]setxfer: Indicate if transfer speed mode setting @@ -3131,7 +3530,10 @@ If there are multiple matching configurations changing the same attribute, the last one is used. - load_ramdisk= [RAM] [Deprecated] + liveupdate= [KNL,EARLY] + Format: <bool> + Enable Live Update Orchestrator (LUO). + Default: off. lockd.nlm_grace_period=P [NFS] Assign grace period. Format: <integer> @@ -3556,7 +3958,7 @@ looking for corruption. Enabling this will both detect corruption and prevent the kernel from using the memory being corrupted. - However, its intended as a diagnostic tool; if + However, it's intended as a diagnostic tool; if repeatable BIOS-originated corruption always affects the same memory, you can use memmap= to prevent the kernel from using that memory. @@ -3623,8 +4025,16 @@ mga= [HW,DRM] - microcode.force_minrev= [X86] - Format: <bool> + microcode= [X86] Control the behavior of the microcode loader. + Available options, comma separated: + + base_rev=X - with <X> with format: <u32> + Set the base microcode revision of each thread when in + debug mode. + + dis_ucode_ldr: disable the microcode loader + + force_minrev: Enable or disable the microcode minimal revision enforcement for the runtime microcode loader. @@ -3664,6 +4074,7 @@ expose users to several CPU vulnerabilities. Equivalent to: if nokaslr then kpti=0 [ARM64] gather_data_sampling=off [X86] + indirect_target_selection=off [X86] kvm.nx_huge_pages=off [X86] l1tf=off [X86] mds=off [X86] @@ -3683,7 +4094,9 @@ spectre_v2_user=off [X86] srbds=off [X86,INTEL] ssbd=force-off [ARM64] + tsa=off [X86,AMD] tsx_async_abort=off [X86] + vmscape=off [X86] Exceptions: This does not have any effect on @@ -3708,6 +4121,10 @@ mmio_stale_data=full,nosmt [X86] retbleed=auto,nosmt [X86] + [X86] After one of the above options, additionally + supports attack-vector based controls as documented in + Documentation/admin-guide/hw-vuln/attack_vector_controls.rst + mminit_loglevel= [KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for @@ -4088,18 +4505,7 @@ no_hash_pointers [KNL,EARLY] - Force pointers printed to the console or buffers to be - unhashed. By default, when a pointer is printed via %p - format string, that pointer is "hashed", i.e. obscured - by hashing the pointer value. This is a security feature - that hides actual kernel addresses from unprivileged - users, but it also makes debugging the kernel more - difficult since unequal pointers can no longer be - compared. However, if this command-line option is - specified, then all normal pointers will have their true - value printed. This option should only be specified when - debugging the kernel. Please do not use on production - kernels. + Alias for "hash_pointers=never". nohibernate [HIBERNATION] Disable hibernation and resume. @@ -4135,8 +4541,10 @@ Note that this argument takes precedence over the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option. - noinitrd [RAM] Tells the kernel not to load any configured - initial RAM disk. + noinitrd [Deprecated,RAM] Tells the kernel not to load any configured + initial RAM disk. Currently this parameter applies to + initrd only, not to initramfs. But it applies to both + in EFI mode. nointremap [X86-64,Intel-IOMMU,EARLY] Do not enable interrupt remapping. @@ -4233,10 +4641,10 @@ nosmp [SMP,EARLY] Tells an SMP kernel to act as a UP kernel, and disable the IO APIC. legacy for "maxcpus=0". - nosmt [KNL,MIPS,PPC,S390,EARLY] Disable symmetric multithreading (SMT). + nosmt [KNL,MIPS,PPC,EARLY] Disable symmetric multithreading (SMT). Equivalent to smt=1. - [KNL,X86,PPC] Disable symmetric multithreading (SMT). + [KNL,LOONGARCH,X86,PPC,S390] Disable symmetric multithreading (SMT). nosmt=force: Force disable SMT, cannot be undone via the sysfs control file. @@ -4362,6 +4770,18 @@ This can be set from sysctl after boot. See Documentation/admin-guide/sysctl/vm.rst for details. + nvme.quirks= [NVME] A list of quirk entries to augment the built-in + nvme quirk list. List entries are separated by a + '-' character. + Each entry has the form VendorID:ProductID:quirk_names. + The IDs are 4-digits hex numbers and quirk_names is a + list of quirk names separated by commas. A quirk name + can be prefixed by '^', meaning that the specified + quirk must be disabled. + + Example: + nvme.quirks=7710:2267:bogus_nid,^identify_cns-9900:7711:broken_msi + ohci1394_dma=early [HW,EARLY] enable debugging via the ohci1394 driver. See Documentation/core-api/debugging-via-ohci1394.rst for more info. @@ -4444,6 +4864,21 @@ panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump on a WARN(). + panic_force_cpu= + [KNL,SMP] Force panic handling to execute on a specific CPU. + Format: <cpu number> + Some platforms require panic handling to occur on a + specific CPU for the crash kernel to function correctly. + This can be due to firmware limitations, interrupt routing + constraints, or platform-specific requirements where only + a particular CPU can safely enter the crash kernel. + When set, panic() will redirect execution to the specified + CPU before proceeding with the normal panic and kexec flow. + If the target CPU is offline or unavailable, panic proceeds + on the current CPU. + This option should only be used for systems with the above + constraints as it might cause the panic operation to be less reliable. + panic_print= Bitmask for printing system info when panic happens. User can chose combination of the following bits: bit 0: print all tasks info @@ -4451,7 +4886,7 @@ bit 2: print timer info bit 3: print locks info if CONFIG_LOCKDEP is on bit 4: print ftrace buffer - bit 5: print all printk messages in buffer + bit 5: replay all kernel messages on consoles at the end of panic bit 6: print all CPUs backtrace (if available in the arch) bit 7: print only tasks in uninterruptible (blocked) state *Be aware* that this option may print a _lot_ of lines, @@ -4459,6 +4894,25 @@ Use this option carefully, maybe worth to setup a bigger log buffer with "log_buf_len" along with this. + panic_sys_info= A comma separated list of extra information to be dumped + on panic. + Format: val[,val...] + Where @val can be any of the following: + + tasks: print all tasks info + mem: print system memory info + timers: print timers info + locks: print locks info if CONFIG_LOCKDEP is on + ftrace: print ftrace buffer + all_bt: print all CPUs backtrace (if available in the arch) + blocked_tasks: print only tasks in uninterruptible (blocked) state + + This is a human readable alternative to the 'panic_print' option. + + panic_console_replay + When panic happens, replay all kernel messages on + consoles at the end of panic. + parkbd.port= [HW] Parallel port number the keyboard adapter is connected to, default is 0. Format: <parport#> @@ -4918,6 +5372,18 @@ that number, otherwise (e.g., 'pmu_override=on'), MMCR1 remains 0. + pm_async= [PM] + Format: off + This parameter sets the initial value of the + /sys/power/pm_async sysfs knob at boot time. + If set to "off", disables asynchronous suspend and + resume of devices during system-wide power transitions. + This can be useful on platforms where device + dependencies are not well-defined, or for debugging + power management issues. Asynchronous operations are + enabled by default. + + pm_debug_messages [SUSPEND,KNL] Enable suspend/resume debug messages during boot up. @@ -5017,6 +5483,14 @@ Format: <bool> default: 0 (auto_verbose is enabled) + printk.debug_non_panic_cpus= + Allows storing messages from non-panic CPUs into + the printk log buffer during panic(). They are + flushed to consoles by the panic-CPU on + a best-effort basis. + Format: <bool> (1/Y/y=enable, 0/N/n=disable) + Default: disabled + printk.devkmsg={on,off,ratelimit} Control writing to /dev/kmsg. on - unlimited logging to /dev/kmsg from userspace @@ -5054,8 +5528,6 @@ Param: <number> - step/bucket size as a power of 2 for statistical time based profiling. - prompt_ramdisk= [RAM] [Deprecated] - prot_virt= [S390] enable hosting protected virtual machines isolated from the hypervisor (if hardware supports that). If enabled, the default kernel base address @@ -5112,7 +5584,7 @@ ramdisk_size= [RAM] Sizes of RAM disks in kilobytes See Documentation/admin-guide/blockdev/ramdisk.rst. - ramdisk_start= [RAM] RAM disk image start address + ramdisk_start= [Deprecated,RAM] RAM disk image start address random.trust_cpu=off [KNL,EARLY] Disable trusting the use of the CPU's @@ -5395,7 +5867,8 @@ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" - Default is 0. + Default is 1 if num_possible_cpus() <= 16 and it is not explicitly + disabled by the boot parameter passing 0. rcuscale.gp_async= [KNL] Measure performance of asynchronous @@ -5617,6 +6090,31 @@ are zero, rcutorture acts as if is interpreted they are all non-zero. + rcutorture.gpwrap_lag= [KNL] + Enable grace-period wrap lag testing. Setting + to false prevents the gpwrap lag test from + running. Default is true. + + rcutorture.gpwrap_lag_gps= [KNL] + Set the value for grace-period wrap lag during + active lag testing periods. This controls how many + grace periods differences we tolerate between + rdp and rnp's gp_seq before setting overflow flag. + The default is always set to 8. + + rcutorture.gpwrap_lag_cycle_mins= [KNL] + Set the total cycle duration for gpwrap lag + testing in minutes. This is the total time for + one complete cycle of active and inactive + testing periods. Default is 30 minutes. + + rcutorture.gpwrap_lag_active_mins= [KNL] + Set the duration for which gpwrap lag is active + within each cycle, in minutes. During this time, + the grace-period wrap lag will be set to the + value specified by gpwrap_lag_gps. Default is + 5 minutes. + rcutorture.irqreader= [KNL] Run RCU readers from irq handlers, or, more accurately, from a timer handler. Not all RCU @@ -5758,6 +6256,11 @@ rcutorture.test_boost_duration= [KNL] Duration (s) of each individual boost test. + rcutorture.test_boost_holdoff= [KNL] + Holdoff time (s) from start of test to the start + of RCU priority-boost testing. Defaults to zero, + that is, no holdoff. + rcutorture.test_boost_interval= [KNL] Interval (s) between each boost test. @@ -5870,13 +6373,6 @@ dynamically) adjusted. This parameter is intended for use in testing. - rcupdate.rcu_task_ipi_delay= [KNL] - Set time in jiffies during which RCU tasks will - avoid sending IPIs, starting with the beginning - of a given grace period. Setting a large - number avoids disturbing real-time workloads, - but lengthens grace periods. - rcupdate.rcu_task_lazy_lim= [KNL] Number of callbacks on a given CPU that will cancel laziness on that CPU. Use -1 to disable @@ -5920,14 +6416,6 @@ of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks(). - rcupdate.rcu_tasks_trace_lazy_ms= [KNL] - Set timeout in milliseconds RCU Tasks - Trace asynchronous callback batching for - call_rcu_tasks_trace(). A negative value - will take the default. A value of zero will - disable batching. Batching is always disabled - for synchronize_rcu_tasks_trace(). - rcupdate.rcu_self_test= [KNL] Run the RCU early boot self tests @@ -5946,9 +6434,14 @@ rdt= [HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, - mba, smba, bmec. + mba, smba, bmec, abmc, sdciae, energy[:guid], + perf[:guid]. E.g. to turn on cmt and turn off mba use: rdt=cmt,!mba + To turn off all energy telemetry monitoring and ensure that + perf telemetry monitoring associated with guid 0x12345 + is enabled use: + rdt=!energy,perf:0x12345 reboot= [KNL] Format (x86 or x86_64): @@ -6082,7 +6575,7 @@ is assumed to be I/O ports; otherwise it is memory. reserve_mem= [RAM] - Format: nn[KNG]:<align>:<label> + Format: nn[KMG]:<align>:<label> Reserve physical memory and label it with a name that other subsystems can use to access it. This is typically used for systems that do not wipe the RAM, and this command @@ -6143,7 +6636,7 @@ that don't. off - no mitigation - auto - automatically select a migitation + auto - automatically select a mitigation auto,nosmt - automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 @@ -6192,13 +6685,22 @@ replacement properties are not found. See the Kconfig entry for RISCV_ISA_FALLBACK. + riscv_nousercfi= + all Disable user CFI ABI to userspace even if cpu extension + are available. + bcfi Disable user backward CFI ABI to userspace even if + the shadow stack extension is available. + fcfi Disable user forward CFI ABI to userspace even if the + landing pad extension is available. + ro [KNL] Mount root device read-only on boot rodata= [KNL,EARLY] on Mark read-only kernel memory as read-only (default). off Leave read-only kernel memory writable for debugging. - full Mark read-only kernel memory and aliases as read-only - [arm64] + noalias Mark read-only kernel memory as read-only but retain + writable aliases in the direct map for regions outside + of the kernel image. [arm64] rockchip.usb_uart [EARLY] @@ -6208,7 +6710,7 @@ port and the regular usb controller gets disabled. root= [KNL] Root filesystem - Usually this a a block device specifier of some kind, + Usually this is a block device specifier of some kind, see the early_lookup_bdev comment in block/early-lookup.c for details. Alternatively this can be "ram" for the legacy initial @@ -6220,6 +6722,14 @@ rootflags= [KNL] Set root filesystem mount option string + rseq_slice_ext= [KNL] RSEQ based time slice extension + Format: boolean + Control enablement of RSEQ based time slice extension. + Default is 'on'. + + initramfs_options= [KNL] + Specify mount options for the initramfs mount. + rootfstype= [KNL] Set root filesystem type rootwait [KNL] Wait (indefinitely) for root device to show up. @@ -6235,6 +6745,15 @@ Memory area to be used by remote processor image, managed by CMA. + rseq_debug= [KNL] Enable or disable restartable sequence + debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE. + Format: <bool> + + rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling + when CONFIG_RT_GROUP_SCHED=y. Defaults to + !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. + Format: <bool> + rw [KNL] Mount root device read-write on boot S [KNL] Run init in single mode @@ -6262,6 +6781,11 @@ sa1100ir [NET] See drivers/net/irda/sa1100_ir.c. + sched_proxy_exec= [KNL] + Enables or disables "proxy execution" style + solution to mutex-based priority inversion. + Format: <bool> + sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages. schedstats= [KNL,X86] Enable or disable scheduled statistics. @@ -6433,14 +6957,18 @@ slab_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_debug legacy name also accepted for now) + Using this option implies the "no_hash_pointers" + option which can be undone by adding the + "hash_pointers=always" option. + slab_max_order= [MM] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_max_order legacy name also accepted for now) slab_merge [MM] @@ -6455,13 +6983,14 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_min_objects legacy name also accepted for now) slab_min_order= [MM] Determines the minimum page order for slabs. Must be lower or equal to slab_max_order. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_min_order legacy name also accepted for now) slab_nomerge [MM] @@ -6475,7 +7004,8 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_nomerge legacy name also accepted for now) slab_strict_numa [MM] @@ -6531,12 +7061,12 @@ softlockup_panic= [KNL] Should the soft-lockup detector generate panics. - Format: 0 | 1 + Format: <int> - A value of 1 instructs the soft-lockup detector - to panic the machine when a soft-lockup occurs. It is - also controlled by the kernel.softlockup_panic sysctl - and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the + A value of non-zero instructs the soft-lockup detector + to panic the machine when a soft-lockup duration exceeds + N thresholds. It is also controlled by the kernel.softlockup_panic + sysctl and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the respective build-time switch to that functionality. softlockup_all_cpu_backtrace= @@ -6582,6 +7112,8 @@ Selecting 'on' will also enable the mitigation against user space to user space task attacks. + Selecting specific mitigation does not force enable + user mitigations. Selecting 'off' will disable both the kernel and the user space protections. @@ -6861,8 +7393,13 @@ consumed by the stack hash table. By default this is set to false. + stack_depot_max_pools= [KNL,EARLY] + Specify the maximum number of pools to use for storing + stack traces. Pools are allocated on-demand up to this + limit. Default value is 8191 pools. + stacktrace [FTRACE] - Enabled the stack tracer on boot up. + Enable the stack tracer on boot up. stacktrace_filter=[function-list] [FTRACE] Limit the functions that the stack tracer @@ -6904,6 +7441,9 @@ them frequently to increase the rate of SLB faults on kernel addresses. + no_slb_preload [PPC,EARLY] + Disables slb preloading for userspace. + sunrpc.min_resvport= sunrpc.max_resvport= [NFS,SUNRPC] @@ -7087,6 +7627,14 @@ causing a major performance hit, and the space where machines are deployed is by other means guarded. + tpm_crb_ffa.busy_timeout_ms= [ARM64,TPM] + Maximum time in milliseconds to retry sending a message + to the TPM service before giving up. This parameter controls + how long the system will continue retrying when the TPM + service is busy. + Format: <unsigned int> + Default: 2000 (2 seconds) + tpm_suspend_pcr=[HW,TPM] Format: integer pcr id Specify that at suspend time, the tpm driver @@ -7143,7 +7691,7 @@ (converted into nanoseconds). Fast, but depending on the architecture, may not be in sync between CPUs. - global - Event time stamps are synchronize across + global - Event time stamps are synchronized across CPUs. May be slower than the local clock, but better for some race conditions. counter - Simple counting of events (1, 2, ..) @@ -7241,6 +7789,8 @@ This is just one of many ways that can clear memory. Make sure your system keeps the content of memory across reboots before relying on this option. + NB: Both the mapped address and size must be page aligned for the architecture. + See also Documentation/trace/debugging.rst @@ -7261,12 +7811,12 @@ section. trace_trigger=[trigger-list] - [FTRACE] Add a event trigger on specific events. + [FTRACE] Add an event trigger on specific events. Set a trigger on top of a specific event, with an optional filter. - The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..." - Where more than one trigger may be specified that are comma deliminated. + The format is "trace_trigger=<event>.<trigger>[ if <filter>],..." + Where more than one trigger may be specified that are comma delimited. For example: @@ -7274,11 +7824,20 @@ The above will enable the "stacktrace" trigger on the "sched_switch" event but only trigger it if the "prev_state" of the "sched_switch" - event is "2" (TASK_UNINTERUPTIBLE). + event is "2" (TASK_UNINTERRUPTIBLE). See also "Event triggers" in Documentation/trace/events.rst + traceoff_after_boot + [FTRACE] Sometimes tracing is used to debug issues + during the boot process. Since the trace buffer has a + limited amount of storage, it may be prudent to + disable tracing after the boot is finished, otherwise + the critical information may be overwritten. With this + option, the main tracing buffer will be turned off at + the end of the boot process. + traceoff_on_warning [FTRACE] enable this option to disable tracing when a warning is hit. This turns off "tracing_on". Tracing can @@ -7323,6 +7882,7 @@ - "tee" - "caam" - "dcp" + - "pkwm" If not specified then it defaults to iterating through the trust source list starting with TPM and assigns the first trust source as a backend which is initialized @@ -7350,6 +7910,19 @@ having this key zero'ed is acceptable. E.g. in testing scenarios. + tsa= [X86] Control mitigation for Transient Scheduler + Attacks on AMD CPUs. Search the following in your + favourite search engine for more details: + + "Technical guidance for mitigating transient scheduler + attacks". + + off - disable the mitigation + on - enable the mitigation (default) + user - mitigate only user/kernel transitions + vm - mitigate only guest/host transitions + + tsc= Disable clocksource stability checks for TSC. Format: <string> [x86] reliable: mark tsc clocksource as reliable, this @@ -7372,12 +7945,7 @@ (HPET or PM timer) on systems whose TSC frequency was obtained from HW or FW using either an MSR or CPUID(0x15). Warn if the difference is more than 500 ppm. - [x86] watchdog: Use TSC as the watchdog clocksource with - which to check other HW timers (HPET or PM timer), but - only on systems where TSC has been deemed trustworthy. - This will be suppressed by an earlier tsc=nowatchdog and - can be overridden by a later tsc=nowatchdog. A console - message will flag any such suppression or overriding. + [x86] watchdog: Enforce the clocksource watchdog on TSC tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given value instead. Useful when the early TSC frequency discovery @@ -7477,6 +8045,22 @@ Note that genuine overcurrent events won't be reported either. + unaligned_scalar_speed= + [RISCV] + Format: {slow | fast | unsupported} + Allow skipping scalar unaligned access speed tests. This + is useful for testing alternative code paths and to skip + the tests in environments where they run too slowly. All + CPUs must have the same scalar unaligned access speed. + + unaligned_vector_speed= + [RISCV] + Format: {slow | fast | unsupported} + Allow skipping vector unaligned access speed tests. This + is useful for testing alternative code paths and to skip + the tests in environments where they run too slowly. All + CPUs must have the same vector unaligned access speed. + unknown_nmi_panic [X86] Cause panic on unknown NMI. @@ -7589,6 +8173,9 @@ p = USB_QUIRK_SHORT_SET_ADDRESS_REQ_TIMEOUT (Reduce timeout of the SET_ADDRESS request from 5000 ms to 500 ms); + q = USB_QUIRK_FORCE_ONE_CONFIG (Device + claims zero configurations, + forcing to 1); Example: quirks=0781:5580:bk,0a5c:5834:gij usbhid.mousepoll= @@ -7672,13 +8259,6 @@ 16 - SIGBUS faults Example: user_debug=31 - userpte= - [X86,EARLY] Flags controlling user PTE allocations. - - nohigh = do not allocate PTE pages in - HIGHMEM regardless of setting - of CONFIG_HIGHPTE. - vdso= [X86,SH,SPARC] On X86_32, this is an alias for vdso32=. Otherwise: @@ -7769,6 +8349,16 @@ vmpoff= [KNL,S390] Perform z/VM CP command after power off. Format: <command> + vmscape= [X86] Controls mitigation for VMscape attacks. + VMscape attacks can leak information from a userspace + hypervisor to a guest via speculative side-channels. + + off - disable the mitigation + ibpb - use Indirect Branch Prediction Barrier + (IBPB) mitigation (default) + force - force vulnerability detection even on + unaffected processors + vsyscall= [X86-64,EARLY] Controls the behavior of vsyscalls (i.e. calls to fixed addresses of 0xffffffffff600x00 from legacy @@ -7779,7 +8369,9 @@ emulate Vsyscalls turn into traps and are emulated reasonably safely. The vsyscall page is - readable. + readable. This disables the Linear + Address Space Separation (LASS) security + feature and makes the system less secure. xonly [default] Vsyscalls turn into traps and are emulated reasonably safely. The vsyscall @@ -7872,7 +8464,16 @@ CONFIG_WQ_WATCHDOG. It sets the number times of the stall to trigger panic. - The default is 0, which disables the panic on stall. + The default is set by CONFIG_BOOTPARAM_WQ_STALL_PANIC, + which is 0 (disabled) if not configured. + + workqueue.panic_on_stall_time=<uint> + Panic when a workqueue stall has been continuous for + the specified number of seconds. Unlike panic_on_stall + which counts accumulated stall events, this triggers + based on the duration of a single continuous stall. + + The default is 0, which disables the time-based panic. workqueue.cpu_intensive_thresh_us= Per-cpu work items which run for longer than this @@ -7913,7 +8514,8 @@ workqueue.default_affinity_scope= Select the default affinity scope to use for unbound workqueues. Can be one of "cpu", "smt", "cache", - "numa" and "system". Default is "cache". For more + "cache_shard", "numa" and "system". Default is + "cache_shard". For more information, see the Affinity Scopes section in Documentation/core-api/workqueue.rst. @@ -7950,6 +8552,11 @@ save/restore/migration must be enabled to handle larger domains. + xen_console_io [XEN,EARLY] + Boolean option to enable/disable the usage of the Xen + console_io hypercalls to read and write to the console. + Mostly useful for debugging and development. + xen_emul_unplug= [HW,X86,XEN,EARLY] Unplug Xen emulated devices Format: [unplug0,][unplug1] diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index ea7fa2a8bbf0..ee9a6c94f383 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -278,12 +278,7 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - e. If running on Cell Processor, build your kernel with - CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from - spu_gov_work(). - WARNING: Please check your CPU specifications to - make sure that this is safe on your particular system. - f. If running on PowerMAC, build your kernel with + e. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). diff --git a/Documentation/admin-guide/laptops/alienware-wmi.rst b/Documentation/admin-guide/laptops/alienware-wmi.rst new file mode 100644 index 000000000000..e532c60db8e2 --- /dev/null +++ b/Documentation/admin-guide/laptops/alienware-wmi.rst @@ -0,0 +1,127 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Alienware WMI Driver +==================== + +Kurt Borja <kuurtb@gmail.com> + +This is a driver for the "WMAX" WMI device, which is found in most Dell gaming +laptops and controls various special features. + +Before the launch of M-Series laptops (~2018), the "WMAX" device controlled +basic RGB lighting, deep sleep mode, HDMI mode and amplifier status. + +Later, this device was completely repurpused. Now it mostly deals with thermal +profiles, sensor monitoring and overclocking. This interface is named "AWCC" and +is known to be used by the AWCC OEM application to control these features. + +The alienware-wmi driver controls both interfaces. + +AWCC Interface +============== + +WMI device documentation: Documentation/wmi/devices/alienware-wmi.rst + +Supported devices +----------------- + +- Alienware M-Series laptops +- Alienware X-Series laptops +- Alienware Aurora Desktops +- Dell G-Series laptops + +If you believe your device supports the AWCC interface and you don't have any of +the features described in this document, try the following alienware-wmi module +parameters: + +- ``force_platform_profile=1``: Forces probing for platform profile support +- ``force_hwmon=1``: Forces probing for HWMON support + +If the module loads successfully with these parameters, consider submitting a +patch adding your model to the ``awcc_dmi_table`` located in +``drivers/platform/x86/dell/alienware-wmi-wmax.c`` or contacting the maintainer +for further guidance. + +Status +------ + +The following features are currently supported: + +- :ref:`Platform Profile <platform-profile>`: + + - Thermal profile control + + - G-Mode toggling + +- :ref:`HWMON <hwmon>`: + + - Sensor monitoring + + - Manual fan control + +.. _platform-profile: + +Platform Profile +---------------- + +The AWCC interface exposes various firmware defined thermal profiles. These are +exposed to user-space through the Platform Profile class interface. Refer to +:ref:`sysfs-class-platform-profile <abi_file_testing_sysfs_class_platform_profile>` +for more information. + +The name of the platform-profile class device exported by this driver is +"alienware-wmi" and it's path can be found with: + +:: + + grep -l "alienware-wmi" /sys/class/platform-profile/platform-profile-*/name | sed 's|/[^/]*$||' + +If the device supports G-Mode, it is also toggled when selecting the +``performance`` profile. + +.. note:: + You may set the ``force_gmode`` module parameter to always try to toggle this + feature, without checking if your model supports it. + +.. _hwmon: + +HWMON +----- + +The AWCC interface also supports sensor monitoring and manual fan control. Both +of these features are exposed to user-space through the HWMON interface. + +The name of the hwmon class device exported by this driver is "alienware_wmi" +and it's path can be found with: + +:: + + grep -l "alienware_wmi" /sys/class/hwmon/hwmon*/name | sed 's|/[^/]*$||' + +Sensor monitoring is done through the standard HWMON interface. Refer to +:ref:`sysfs-class-hwmon <abi_file_testing_sysfs_class_hwmon>` for more +information. + +Manual fan control on the other hand, is not exposed directly by the AWCC +interface. Instead it let's us control a fan `boost` value. This `boost` value +has the following approximate behavior over the fan pwm: + +:: + + pwm = pwm_base + (fan_boost / 255) * (pwm_max - pwm_base) + +Due to the above behavior, the fan `boost` control is exposed to user-space +through the following, custom hwmon sysfs attribute: + +=============================== ======= ======================================= +Name Perm Description +=============================== ======= ======================================= +fan[1-4]_boost RW Fan boost value. + + Integer value between 0 and 255 +=============================== ======= ======================================= + +.. note:: + In some devices, manual fan control only works reliably if the ``custom`` + platform profile is selected. diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst index cd9a1c2695fd..c0b911d05c59 100644 --- a/Documentation/admin-guide/laptops/index.rst +++ b/Documentation/admin-guide/laptops/index.rst @@ -7,11 +7,13 @@ Laptop Drivers .. toctree:: :maxdepth: 1 + alienware-wmi asus-laptop disk-shock-protection - laptop-mode lg-laptop + samsung-galaxybook sony-laptop sonypi thinkpad-acpi toshiba_haps + uniwill-laptop diff --git a/Documentation/admin-guide/laptops/laptop-mode.rst b/Documentation/admin-guide/laptops/laptop-mode.rst deleted file mode 100644 index b61cc601d298..000000000000 --- a/Documentation/admin-guide/laptops/laptop-mode.rst +++ /dev/null @@ -1,770 +0,0 @@ -=============================================== -How to conserve battery power using laptop-mode -=============================================== - -Document Author: Bart Samwel (bart@samwel.tk) - -Date created: January 2, 2004 - -Last modified: December 06, 2004 - -Introduction ------------- - -Laptop mode is used to minimize the time that the hard disk needs to be spun up, -to conserve battery power on laptops. It has been reported to cause significant -power savings. - -.. Contents - - * Introduction - * Installation - * Caveats - * The Details - * Tips & Tricks - * Control script - * ACPI integration - * Monitoring tool - - -Installation ------------- - -To use laptop mode, you don't need to set any kernel configuration options -or anything. Simply install all the files included in this document, and -laptop mode will automatically be started when you're on battery. For -your convenience, a tarball containing an installer can be downloaded at: - - http://www.samwel.tk/laptop_mode/laptop_mode/ - -To configure laptop mode, you need to edit the configuration file, which is -located in /etc/default/laptop-mode on Debian-based systems, or in -/etc/sysconfig/laptop-mode on other systems. - -Unfortunately, automatic enabling of laptop mode does not work for -laptops that don't have ACPI. On those laptops, you need to start laptop -mode manually. To start laptop mode, run "laptop_mode start", and to -stop it, run "laptop_mode stop". (Note: The laptop mode tools package now -has experimental support for APM, you might want to try that first.) - - -Caveats -------- - -* The downside of laptop mode is that you have a chance of losing up to 10 - minutes of work. If you cannot afford this, don't use it! The supplied ACPI - scripts automatically turn off laptop mode when the battery almost runs out, - so that you won't lose any data at the end of your battery life. - -* Most desktop hard drives have a very limited lifetime measured in spindown - cycles, typically about 50.000 times (it's usually listed on the spec sheet). - Check your drive's rating, and don't wear down your drive's lifetime if you - don't need to. - -* If you mount some of your ext3/reiserfs filesystems with the -n option, then - the control script will not be able to remount them correctly. You must set - DO_REMOUNTS=0 in the control script, otherwise it will remount them with the - wrong options -- or it will fail because it cannot write to /etc/mtab. - -* If you have your filesystems listed as type "auto" in fstab, like I did, then - the control script will not recognize them as filesystems that need remounting. - You must list the filesystems with their true type instead. - -* It has been reported that some versions of the mutt mail client use file access - times to determine whether a folder contains new mail. If you use mutt and - experience this, you must disable the noatime remounting by setting the option - DO_REMOUNT_NOATIME to 0 in the configuration file. - - -The Details ------------ - -Laptop mode is controlled by the knob /proc/sys/vm/laptop_mode. This knob is -present for all kernels that have the laptop mode patch, regardless of any -configuration options. When the knob is set, any physical disk I/O (that might -have caused the hard disk to spin up) causes Linux to flush all dirty blocks. The -result of this is that after a disk has spun down, it will not be spun up -anymore to write dirty blocks, because those blocks had already been written -immediately after the most recent read operation. The value of the laptop_mode -knob determines the time between the occurrence of disk I/O and when the flush -is triggered. A sensible value for the knob is 5 seconds. Setting the knob to -0 disables laptop mode. - -To increase the effectiveness of the laptop_mode strategy, the laptop_mode -control script increases dirty_expire_centisecs and dirty_writeback_centisecs in -/proc/sys/vm to about 10 minutes (by default), which means that pages that are -dirtied are not forced to be written to disk as often. The control script also -changes the dirty background ratio, so that background writeback of dirty pages -is not done anymore. Combined with a higher commit value (also 10 minutes) for -ext3 or ReiserFS filesystems (also done automatically by the control script), -this results in concentration of disk activity in a small time interval which -occurs only once every 10 minutes, or whenever the disk is forced to spin up by -a cache miss. The disk can then be spun down in the periods of inactivity. - - -Configuration -------------- - -The laptop mode configuration file is located in /etc/default/laptop-mode on -Debian-based systems, or in /etc/sysconfig/laptop-mode on other systems. It -contains the following options: - -MAX_AGE: - -Maximum time, in seconds, of hard drive spindown time that you are -comfortable with. Worst case, it's possible that you could lose this -amount of work if your battery fails while you're in laptop mode. - -MINIMUM_BATTERY_MINUTES: - -Automatically disable laptop mode if the remaining number of minutes of -battery power is less than this value. Default is 10 minutes. - -AC_HD/BATT_HD: - -The idle timeout that should be set on your hard drive when laptop mode -is active (BATT_HD) and when it is not active (AC_HD). The defaults are -20 seconds (value 4) for BATT_HD and 2 hours (value 244) for AC_HD. The -possible values are those listed in the manual page for "hdparm" for the -"-S" option. - -HD: - -The devices for which the spindown timeout should be adjusted by laptop mode. -Default is /dev/hda. If you specify multiple devices, separate them by a space. - -READAHEAD: - -Disk readahead, in 512-byte sectors, while laptop mode is active. A large -readahead can prevent disk accesses for things like executable pages (which are -loaded on demand while the application executes) and sequentially accessed data -(MP3s). - -DO_REMOUNTS: - -The control script automatically remounts any mounted journaled filesystems -with appropriate commit interval options. When this option is set to 0, this -feature is disabled. - -DO_REMOUNT_NOATIME: - -When remounting, should the filesystems be remounted with the noatime option? -Normally, this is set to "1" (enabled), but there may be programs that require -access time recording. - -DIRTY_RATIO: - -The percentage of memory that is allowed to contain "dirty" or unsaved data -before a writeback is forced, while laptop mode is active. Corresponds to -the /proc/sys/vm/dirty_ratio sysctl. - -DIRTY_BACKGROUND_RATIO: - -The percentage of memory that is allowed to contain "dirty" or unsaved data -after a forced writeback is done due to an exceeding of DIRTY_RATIO. Set -this nice and low. This corresponds to the /proc/sys/vm/dirty_background_ratio -sysctl. - -Note that the behaviour of dirty_background_ratio is quite different -when laptop mode is active and when it isn't. When laptop mode is inactive, -dirty_background_ratio is the threshold percentage at which background writeouts -start taking place. When laptop mode is active, however, background writeouts -are disabled, and the dirty_background_ratio only determines how much writeback -is done when dirty_ratio is reached. - -DO_CPU: - -Enable CPU frequency scaling when in laptop mode. (Requires CPUFreq to be setup. -See Documentation/admin-guide/pm/cpufreq.rst for more info. Disabled by default.) - -CPU_MAXFREQ: - -When on battery, what is the maximum CPU speed that the system should use? Legal -values are "slowest" for the slowest speed that your CPU is able to operate at, -or a value listed in /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies. - - -Tips & Tricks -------------- - -* Bartek Kania reports getting up to 50 minutes of extra battery life (on top - of his regular 3 to 3.5 hours) using a spindown time of 5 seconds (BATT_HD=1). - -* You can spin down the disk while playing MP3, by setting disk readahead - to 8MB (READAHEAD=16384). Effectively, the disk will read a complete MP3 at - once, and will then spin down while the MP3 is playing. (Thanks to Bartek - Kania.) - -* Drew Scott Daniels observed: "I don't know why, but when I decrease the number - of colours that my display uses it consumes less battery power. I've seen - this on powerbooks too. I hope that this is a piece of information that - might be useful to the Laptop Mode patch or its users." - -* In syslog.conf, you can prefix entries with a dash `-` to omit syncing the - file after every logging. When you're using laptop-mode and your disk doesn't - spin down, this is a likely culprit. - -* Richard Atterer observed that laptop mode does not work well with noflushd - (http://noflushd.sourceforge.net/), it seems that noflushd prevents laptop-mode - from doing its thing. - -* If you're worried about your data, you might want to consider using a USB - memory stick or something like that as a "working area". (Be aware though - that flash memory can only handle a limited number of writes, and overuse - may wear out your memory stick pretty quickly. Do _not_ use journalling - filesystems on flash memory sticks.) - - -Configuration file for control and ACPI battery scripts -------------------------------------------------------- - -This allows the tunables to be changed for the scripts via an external -configuration file - -It should be installed as /etc/default/laptop-mode on Debian, and as -/etc/sysconfig/laptop-mode on Red Hat, SUSE, Mandrake, and other work-alikes. - -Config file:: - - # Maximum time, in seconds, of hard drive spindown time that you are - # comfortable with. Worst case, it's possible that you could lose this - # amount of work if your battery fails you while in laptop mode. - #MAX_AGE=600 - - # Automatically disable laptop mode when the number of minutes of battery - # that you have left goes below this threshold. - MINIMUM_BATTERY_MINUTES=10 - - # Read-ahead, in 512-byte sectors. You can spin down the disk while playing MP3/OGG - # by setting the disk readahead to 8MB (READAHEAD=16384). Effectively, the disk - # will read a complete MP3 at once, and will then spin down while the MP3/OGG is - # playing. - #READAHEAD=4096 - - # Shall we remount journaled fs. with appropriate commit interval? (1=yes) - #DO_REMOUNTS=1 - - # And shall we add the "noatime" option to that as well? (1=yes) - #DO_REMOUNT_NOATIME=1 - - # Dirty synchronous ratio. At this percentage of dirty pages the process - # which - # calls write() does its own writeback - #DIRTY_RATIO=40 - - # - # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been - # exceeded, the kernel will wake flusher threads which will then reduce the - # amount of dirty memory to dirty_background_ratio. Set this nice and low, - # so once some writeout has commenced, we do a lot of it. - # - #DIRTY_BACKGROUND_RATIO=5 - - # kernel default dirty buffer age - #DEF_AGE=30 - #DEF_UPDATE=5 - #DEF_DIRTY_BACKGROUND_RATIO=10 - #DEF_DIRTY_RATIO=40 - #DEF_XFS_AGE_BUFFER=15 - #DEF_XFS_SYNC_INTERVAL=30 - #DEF_XFS_BUFD_INTERVAL=1 - - # This must be adjusted manually to the value of HZ in the running kernel - # on 2.4, until the XFS people change their 2.4 external interfaces to work in - # centisecs. This can be automated, but it's a work in progress that still - # needs# some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for - # external interfaces, and that is currently always set to 100. So you don't - # need to change this on 2.6. - #XFS_HZ=100 - - # Should the maximum CPU frequency be adjusted down while on battery? - # Requires CPUFreq to be setup. - # See Documentation/admin-guide/pm/cpufreq.rst for more info - #DO_CPU=0 - - # When on battery what is the maximum CPU speed that the system should - # use? Legal values are "slowest" for the slowest speed that your - # CPU is able to operate at, or a value listed in: - # /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies - # Only applicable if DO_CPU=1. - #CPU_MAXFREQ=slowest - - # Idle timeout for your hard drive (man hdparm for valid values, -S option) - # Default is 2 hours on AC (AC_HD=244) and 20 seconds for battery (BATT_HD=4). - #AC_HD=244 - #BATT_HD=4 - - # The drives for which to adjust the idle timeout. Separate them by a space, - # e.g. HD="/dev/hda /dev/hdb". - #HD="/dev/hda" - - # Set the spindown timeout on a hard drive? - #DO_HD=1 - - -Control script --------------- - -Please note that this control script works for the Linux 2.4 and 2.6 series (thanks -to Kiko Piris). - -Control script:: - - #!/bin/bash - - # start or stop laptop_mode, best run by a power management daemon when - # ac gets connected/disconnected from a laptop - # - # install as /sbin/laptop_mode - # - # Contributors to this script: Kiko Piris - # Bart Samwel - # Micha Feigin - # Andrew Morton - # Herve Eychenne - # Dax Kelson - # - # Original Linux 2.4 version by: Jens Axboe - - ############################################################################# - - # Source config - if [ -f /etc/default/laptop-mode ] ; then - # Debian - . /etc/default/laptop-mode - elif [ -f /etc/sysconfig/laptop-mode ] ; then - # Others - . /etc/sysconfig/laptop-mode - fi - - # Don't raise an error if the config file is incomplete - # set defaults instead: - - # Maximum time, in seconds, of hard drive spindown time that you are - # comfortable with. Worst case, it's possible that you could lose this - # amount of work if your battery fails you while in laptop mode. - MAX_AGE=${MAX_AGE:-'600'} - - # Read-ahead, in kilobytes - READAHEAD=${READAHEAD:-'4096'} - - # Shall we remount journaled fs. with appropriate commit interval? (1=yes) - DO_REMOUNTS=${DO_REMOUNTS:-'1'} - - # And shall we add the "noatime" option to that as well? (1=yes) - DO_REMOUNT_NOATIME=${DO_REMOUNT_NOATIME:-'1'} - - # Shall we adjust the idle timeout on a hard drive? - DO_HD=${DO_HD:-'1'} - - # Adjust idle timeout on which hard drive? - HD="${HD:-'/dev/hda'}" - - # spindown time for HD (hdparm -S values) - AC_HD=${AC_HD:-'244'} - BATT_HD=${BATT_HD:-'4'} - - # Dirty synchronous ratio. At this percentage of dirty pages the process which - # calls write() does its own writeback - DIRTY_RATIO=${DIRTY_RATIO:-'40'} - - # cpu frequency scaling - # See Documentation/admin-guide/pm/cpufreq.rst for more info - DO_CPU=${CPU_MANAGE:-'0'} - CPU_MAXFREQ=${CPU_MAXFREQ:-'slowest'} - - # - # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been - # exceeded, the kernel will wake flusher threads which will then reduce the - # amount of dirty memory to dirty_background_ratio. Set this nice and low, - # so once some writeout has commenced, we do a lot of it. - # - DIRTY_BACKGROUND_RATIO=${DIRTY_BACKGROUND_RATIO:-'5'} - - # kernel default dirty buffer age - DEF_AGE=${DEF_AGE:-'30'} - DEF_UPDATE=${DEF_UPDATE:-'5'} - DEF_DIRTY_BACKGROUND_RATIO=${DEF_DIRTY_BACKGROUND_RATIO:-'10'} - DEF_DIRTY_RATIO=${DEF_DIRTY_RATIO:-'40'} - DEF_XFS_AGE_BUFFER=${DEF_XFS_AGE_BUFFER:-'15'} - DEF_XFS_SYNC_INTERVAL=${DEF_XFS_SYNC_INTERVAL:-'30'} - DEF_XFS_BUFD_INTERVAL=${DEF_XFS_BUFD_INTERVAL:-'1'} - - # This must be adjusted manually to the value of HZ in the running kernel - # on 2.4, until the XFS people change their 2.4 external interfaces to work in - # centisecs. This can be automated, but it's a work in progress that still needs - # some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for external - # interfaces, and that is currently always set to 100. So you don't need to - # change this on 2.6. - XFS_HZ=${XFS_HZ:-'100'} - - ############################################################################# - - KLEVEL="$(uname -r | - { - IFS='.' read a b c - echo $a.$b - } - )" - case "$KLEVEL" in - "2.4"|"2.6") - ;; - *) - echo "Unhandled kernel version: $KLEVEL ('uname -r' = '$(uname -r)')" >&2 - exit 1 - ;; - esac - - if [ ! -e /proc/sys/vm/laptop_mode ] ; then - echo "Kernel is not patched with laptop_mode patch." >&2 - exit 1 - fi - - if [ ! -w /proc/sys/vm/laptop_mode ] ; then - echo "You do not have enough privileges to enable laptop_mode." >&2 - exit 1 - fi - - # Remove an option (the first parameter) of the form option=<number> from - # a mount options string (the rest of the parameters). - parse_mount_opts () { - OPT="$1" - shift - echo ",$*," | sed \ - -e 's/,'"$OPT"'=[0-9]*,/,/g' \ - -e 's/,,*/,/g' \ - -e 's/^,//' \ - -e 's/,$//' - } - - # Remove an option (the first parameter) without any arguments from - # a mount option string (the rest of the parameters). - parse_nonumber_mount_opts () { - OPT="$1" - shift - echo ",$*," | sed \ - -e 's/,'"$OPT"',/,/g' \ - -e 's/,,*/,/g' \ - -e 's/^,//' \ - -e 's/,$//' - } - - # Find out the state of a yes/no option (e.g. "atime"/"noatime") in - # fstab for a given filesystem, and use this state to replace the - # value of the option in another mount options string. The device - # is the first argument, the option name the second, and the default - # value the third. The remainder is the mount options string. - # - # Example: - # parse_yesno_opts_wfstab /dev/hda1 atime atime defaults,noatime - # - # If fstab contains, say, "rw" for this filesystem, then the result - # will be "defaults,atime". - parse_yesno_opts_wfstab () { - L_DEV="$1" - OPT="$2" - DEF_OPT="$3" - shift 3 - L_OPTS="$*" - PARSEDOPTS1="$(parse_nonumber_mount_opts $OPT $L_OPTS)" - PARSEDOPTS1="$(parse_nonumber_mount_opts no$OPT $PARSEDOPTS1)" - # Watch for a default atime in fstab - FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)" - if echo "$FSTAB_OPTS" | grep "$OPT" > /dev/null ; then - # option specified in fstab: extract the value and use it - if echo "$FSTAB_OPTS" | grep "no$OPT" > /dev/null ; then - echo "$PARSEDOPTS1,no$OPT" - else - # no$OPT not found -- so we must have $OPT. - echo "$PARSEDOPTS1,$OPT" - fi - else - # option not specified in fstab -- choose the default. - echo "$PARSEDOPTS1,$DEF_OPT" - fi - } - - # Find out the state of a numbered option (e.g. "commit=NNN") in - # fstab for a given filesystem, and use this state to replace the - # value of the option in another mount options string. The device - # is the first argument, and the option name the second. The - # remainder is the mount options string in which the replacement - # must be done. - # - # Example: - # parse_mount_opts_wfstab /dev/hda1 commit defaults,commit=7 - # - # If fstab contains, say, "commit=3,rw" for this filesystem, then the - # result will be "rw,commit=3". - parse_mount_opts_wfstab () { - L_DEV="$1" - OPT="$2" - shift 2 - L_OPTS="$*" - PARSEDOPTS1="$(parse_mount_opts $OPT $L_OPTS)" - # Watch for a default commit in fstab - FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)" - if echo "$FSTAB_OPTS" | grep "$OPT=" > /dev/null ; then - # option specified in fstab: extract the value, and use it - echo -n "$PARSEDOPTS1,$OPT=" - echo ",$FSTAB_OPTS," | sed \ - -e 's/.*,'"$OPT"'=//' \ - -e 's/,.*//' - else - # option not specified in fstab: set it to 0 - echo "$PARSEDOPTS1,$OPT=0" - fi - } - - deduce_fstype () { - MP="$1" - # My root filesystem unfortunately has - # type "unknown" in /etc/mtab. If we encounter - # "unknown", we try to get the type from fstab. - cat /etc/fstab | - grep -v '^#' | - while read FSTAB_DEV FSTAB_MP FSTAB_FST FSTAB_OPTS FSTAB_DUMP FSTAB_DUMP ; do - if [ "$FSTAB_MP" = "$MP" ]; then - echo $FSTAB_FST - exit 0 - fi - done - } - - if [ $DO_REMOUNT_NOATIME -eq 1 ] ; then - NOATIME_OPT=",noatime" - fi - - case "$1" in - start) - AGE=$((100*$MAX_AGE)) - XFS_AGE=$(($XFS_HZ*$MAX_AGE)) - echo -n "Starting laptop_mode" - - if [ -d /proc/sys/vm/pagebuf ] ; then - # (For 2.4 and early 2.6.) - # This only needs to be set, not reset -- it is only used when - # laptop mode is enabled. - echo $XFS_AGE > /proc/sys/vm/pagebuf/lm_flush_age - echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval - elif [ -f /proc/sys/fs/xfs/lm_age_buffer ] ; then - # (A couple of early 2.6 laptop mode patches had these.) - # The same goes for these. - echo $XFS_AGE > /proc/sys/fs/xfs/lm_age_buffer - echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval - elif [ -f /proc/sys/fs/xfs/age_buffer ] ; then - # (2.6.6) - # But not for these -- they are also used in normal - # operation. - echo $XFS_AGE > /proc/sys/fs/xfs/age_buffer - echo $XFS_AGE > /proc/sys/fs/xfs/sync_interval - elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then - # (2.6.7 upwards) - # And not for these either. These are in centisecs, - # not USER_HZ, so we have to use $AGE, not $XFS_AGE. - echo $AGE > /proc/sys/fs/xfs/age_buffer_centisecs - echo $AGE > /proc/sys/fs/xfs/xfssyncd_centisecs - echo 3000 > /proc/sys/fs/xfs/xfsbufd_centisecs - fi - - case "$KLEVEL" in - "2.4") - echo 1 > /proc/sys/vm/laptop_mode - echo "30 500 0 0 $AGE $AGE 60 20 0" > /proc/sys/vm/bdflush - ;; - "2.6") - echo 5 > /proc/sys/vm/laptop_mode - echo "$AGE" > /proc/sys/vm/dirty_writeback_centisecs - echo "$AGE" > /proc/sys/vm/dirty_expire_centisecs - echo "$DIRTY_RATIO" > /proc/sys/vm/dirty_ratio - echo "$DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio - ;; - esac - if [ $DO_REMOUNTS -eq 1 ]; then - cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do - PARSEDOPTS="$(parse_mount_opts "$OPTS")" - if [ "$FST" = 'unknown' ]; then - FST=$(deduce_fstype $MP) - fi - case "$FST" in - "ext3"|"reiserfs") - PARSEDOPTS="$(parse_mount_opts commit "$OPTS")" - mount $DEV -t $FST $MP -o remount,$PARSEDOPTS,commit=$MAX_AGE$NOATIME_OPT - ;; - "xfs") - mount $DEV -t $FST $MP -o remount,$OPTS$NOATIME_OPT - ;; - esac - if [ -b $DEV ] ; then - blockdev --setra $(($READAHEAD * 2)) $DEV - fi - done - fi - if [ $DO_HD -eq 1 ] ; then - for THISHD in $HD ; do - /sbin/hdparm -S $BATT_HD $THISHD > /dev/null 2>&1 - /sbin/hdparm -B 1 $THISHD > /dev/null 2>&1 - done - fi - if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then - if [ $CPU_MAXFREQ = 'slowest' ]; then - CPU_MAXFREQ=`cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq` - fi - echo $CPU_MAXFREQ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq - fi - echo "." - ;; - stop) - U_AGE=$((100*$DEF_UPDATE)) - B_AGE=$((100*$DEF_AGE)) - echo -n "Stopping laptop_mode" - echo 0 > /proc/sys/vm/laptop_mode - if [ -f /proc/sys/fs/xfs/age_buffer -a ! -f /proc/sys/fs/xfs/lm_age_buffer ] ; then - # These need to be restored, if there are no lm_*. - echo $(($XFS_HZ*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer - echo $(($XFS_HZ*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/sync_interval - elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then - # These need to be restored as well. - echo $((100*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer_centisecs - echo $((100*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/xfssyncd_centisecs - echo $((100*$DEF_XFS_BUFD_INTERVAL)) > /proc/sys/fs/xfs/xfsbufd_centisecs - fi - case "$KLEVEL" in - "2.4") - echo "30 500 0 0 $U_AGE $B_AGE 60 20 0" > /proc/sys/vm/bdflush - ;; - "2.6") - echo "$U_AGE" > /proc/sys/vm/dirty_writeback_centisecs - echo "$B_AGE" > /proc/sys/vm/dirty_expire_centisecs - echo "$DEF_DIRTY_RATIO" > /proc/sys/vm/dirty_ratio - echo "$DEF_DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio - ;; - esac - if [ $DO_REMOUNTS -eq 1 ] ; then - cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do - # Reset commit and atime options to defaults. - if [ "$FST" = 'unknown' ]; then - FST=$(deduce_fstype $MP) - fi - case "$FST" in - "ext3"|"reiserfs") - PARSEDOPTS="$(parse_mount_opts_wfstab $DEV commit $OPTS)" - PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $PARSEDOPTS)" - mount $DEV -t $FST $MP -o remount,$PARSEDOPTS - ;; - "xfs") - PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $OPTS)" - mount $DEV -t $FST $MP -o remount,$PARSEDOPTS - ;; - esac - if [ -b $DEV ] ; then - blockdev --setra 256 $DEV - fi - done - fi - if [ $DO_HD -eq 1 ] ; then - for THISHD in $HD ; do - /sbin/hdparm -S $AC_HD $THISHD > /dev/null 2>&1 - /sbin/hdparm -B 255 $THISHD > /dev/null 2>&1 - done - fi - if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then - echo `cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq` > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq - fi - echo "." - ;; - *) - echo "Usage: $0 {start|stop}" 2>&1 - exit 1 - ;; - - esac - - exit 0 - - -ACPI integration ----------------- - -Dax Kelson submitted this so that the ACPI acpid daemon will -kick off the laptop_mode script and run hdparm. The part that -automatically disables laptop mode when the battery is low was -written by Jan Topinski. - -/etc/acpi/events/ac_adapter:: - - event=ac_adapter - action=/etc/acpi/actions/ac.sh %e - -/etc/acpi/events/battery:: - - event=battery.* - action=/etc/acpi/actions/battery.sh %e - -/etc/acpi/actions/ac.sh:: - - #!/bin/bash - - # ac on/offline event handler - - status=`awk '/^state: / { print $2 }' /proc/acpi/ac_adapter/$2/state` - - case $status in - "on-line") - /sbin/laptop_mode stop - exit 0 - ;; - "off-line") - /sbin/laptop_mode start - exit 0 - ;; - esac - - -/etc/acpi/actions/battery.sh:: - - #! /bin/bash - - # Automatically disable laptop mode when the battery almost runs out. - - BATT_INFO=/proc/acpi/battery/$2/state - - if [[ -f /proc/sys/vm/laptop_mode ]] - then - LM=`cat /proc/sys/vm/laptop_mode` - if [[ $LM -gt 0 ]] - then - if [[ -f $BATT_INFO ]] - then - # Source the config file only now that we know we need - if [ -f /etc/default/laptop-mode ] ; then - # Debian - . /etc/default/laptop-mode - elif [ -f /etc/sysconfig/laptop-mode ] ; then - # Others - . /etc/sysconfig/laptop-mode - fi - MINIMUM_BATTERY_MINUTES=${MINIMUM_BATTERY_MINUTES:-'10'} - - ACTION="`cat $BATT_INFO | grep charging | cut -c 26-`" - if [[ ACTION -eq "discharging" ]] - then - PRESENT_RATE=`cat $BATT_INFO | grep "present rate:" | sed "s/.* \([0-9][0-9]* \).*/\1/" ` - REMAINING=`cat $BATT_INFO | grep "remaining capacity:" | sed "s/.* \([0-9][0-9]* \).*/\1/" ` - fi - if (($REMAINING * 60 / $PRESENT_RATE < $MINIMUM_BATTERY_MINUTES)) - then - /sbin/laptop_mode stop - fi - else - logger -p daemon.warning "You are using laptop mode and your battery interface $BATT_INFO is missing. This may lead to loss of data when the battery runs out. Check kernel ACPI support and /proc/acpi/battery folder, and edit /etc/acpi/battery.sh to set BATT_INFO to the correct path." - fi - fi - fi - - -Monitoring tool ---------------- - -Bartek Kania submitted this, it can be used to measure how much time your disk -spends spun up/down. See tools/laptop/dslm/dslm.c diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst index 67fd6932cef4..c4dd534f91ed 100644 --- a/Documentation/admin-guide/laptops/lg-laptop.rst +++ b/Documentation/admin-guide/laptops/lg-laptop.rst @@ -48,8 +48,8 @@ This value is reset to 100 when the kernel boots. Fan mode -------- -Writing 1/0 to /sys/devices/platform/lg-laptop/fan_mode disables/enables -the fan silent mode. +Writing 0/1/2 to /sys/devices/platform/lg-laptop/fan_mode sets fan mode to +Optimal/Silent/Performance respectively. USB charge diff --git a/Documentation/admin-guide/laptops/samsung-galaxybook.rst b/Documentation/admin-guide/laptops/samsung-galaxybook.rst new file mode 100644 index 000000000000..752b8f1a4a74 --- /dev/null +++ b/Documentation/admin-guide/laptops/samsung-galaxybook.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +========================== +Samsung Galaxy Book Driver +========================== + +Joshua Grisham <josh@joshuagrisham.com> + +This is a Linux x86 platform driver for Samsung Galaxy Book series notebook +devices which utilizes Samsung's ``SCAI`` ACPI device in order to control +extra features and receive various notifications. + +Supported devices +================= + +Any device with one of the supported ACPI device IDs should be supported. This +covers most of the "Samsung Galaxy Book" series notebooks that are currently +available as of this writing, and could include other Samsung notebook devices +as well. + +Status +====== + +The following features are currently supported: + +- :ref:`Keyboard backlight <keyboard-backlight>` control +- :ref:`Performance mode <performance-mode>` control implemented using the + platform profile interface +- :ref:`Battery charge control end threshold + <battery-charge-control-end-threshold>` (stop charging battery at given + percentage value) implemented as a battery hook +- :ref:`Firmware Attributes <firmware-attributes>` to allow control of various + device settings +- :ref:`Handling of Fn hotkeys <keyboard-hotkey-actions>` for various actions +- :ref:`Handling of ACPI notifications and hotkeys + <acpi-notifications-and-hotkey-actions>` + +Because different models of these devices can vary in their features, there is +logic built within the driver which attempts to test each implemented feature +for a valid response before enabling its support (registering additional devices +or extensions, adding sysfs attributes, etc). Therefore, it can be important to +note that not all features may be supported for your particular device. + +The following features might be possible to implement but will require +additional investigation and are therefore not supported at this time: + +- "Dolby Atmos" mode for the speakers +- "Outdoor Mode" for increasing screen brightness on models with ``SAM0427`` +- "Silent Mode" on models with ``SAM0427`` + +.. _keyboard-backlight: + +Keyboard backlight +================== + +A new LED class named ``samsung-galaxybook::kbd_backlight`` is created which +will then expose the device using the standard sysfs-based LED interface at +``/sys/class/leds/samsung-galaxybook::kbd_backlight``. Brightness can be +controlled by writing the desired value to the ``brightness`` sysfs attribute or +with any other desired userspace utility. + +.. note:: + Most of these devices have an ambient light sensor which also turns + off the keyboard backlight under well-lit conditions. This behavior does not + seem possible to control at this time, but can be good to be aware of. + +.. _performance-mode: + +Performance mode +================ + +This driver implements the +Documentation/userspace-api/sysfs-platform_profile.rst interface for working +with the "performance mode" function of the Samsung ACPI device. + +Mapping of each Samsung "performance mode" to its respective platform profile is +performed dynamically by the driver, as not all models support all of the same +performance modes. Your device might have one or more of the following mappings: + +- "Silent" maps to ``low-power`` +- "Quiet" maps to ``quiet`` +- "Optimized" maps to ``balanced`` +- "High performance" maps to ``performance`` + +The result of the mapping can be printed in the kernel log when the module is +loaded. Supported profiles can also be retrieved from +``/sys/firmware/acpi/platform_profile_choices``, while +``/sys/firmware/acpi/platform_profile`` can be used to read or write the +currently selected profile. + +The ``balanced`` platform profile will be set during module load if no profile +has been previously set. + +.. _battery-charge-control-end-threshold: + +Battery charge control end threshold +==================================== + +This platform driver will add the ability to set the battery's charge control +end threshold, but does not have the ability to set a start threshold. + +This feature is typically called "Battery Saver" by the various Samsung +applications in Windows, but in Linux we have implemented the standardized +"charge control threshold" sysfs interface on the battery device to allow for +controlling this functionality from the userspace. + +The sysfs attribute +``/sys/class/power_supply/BAT1/charge_control_end_threshold`` can be used to +read or set the desired charge end threshold. + +If you wish to maintain interoperability with the Samsung Settings application +in Windows, then you should set the value to 100 to represent "off", or enable +the feature using only one of the following values: 50, 60, 70, 80, or 90. +Otherwise, the driver will accept any value between 1 and 100 as the percentage +that you wish the battery to stop charging at. + +.. note:: + Some devices have been observed as automatically "turning off" the charge + control end threshold if an input value of less than 30 is given. + +.. _firmware-attributes: + +Firmware Attributes +=================== + +The following enumeration-typed firmware attributes are set up by this driver +and should be accessible under +``/sys/class/firmware-attributes/samsung-galaxybook/attributes/`` if your device +supports them: + +- ``power_on_lid_open`` (device should power on when the lid is opened) +- ``usb_charging`` (USB ports can deliver power to connected devices even when + the device is powered off or in a low sleep state) +- ``block_recording`` (blocks access to camera and microphone) + +All of these attributes are simple boolean-like enumeration values which use 0 +to represent "off" and 1 to represent "on". Use the ``current_value`` attribute +to get or change the setting on the device. + +Note that when ``block_recording`` is updated, the input device "Samsung Galaxy +Book Lens Cover" will receive a ``SW_CAMERA_LENS_COVER`` switch event which +reflects the current state. + +.. _keyboard-hotkey-actions: + +Keyboard hotkey actions (i8042 filter) +====================================== + +The i8042 filter will swallow the keyboard events for the Fn+F9 hotkey (Multi- +level keyboard backlight toggle) and Fn+F10 hotkey (Block recording toggle) +and instead execute their actions within the driver itself. + +Fn+F9 will cycle through the brightness levels of the keyboard backlight. A +notification will be sent using ``led_classdev_notify_brightness_hw_changed`` +so that the userspace can be aware of the change. This mimics the behavior of +other existing devices where the brightness level is cycled internally by the +embedded controller and then reported via a notification. + +Fn+F10 will toggle the value of the "block recording" setting, which blocks +or allows usage of the built-in camera and microphone (and generates the same +Lens Cover switch event mentioned above). + +.. _acpi-notifications-and-hotkey-actions: + +ACPI notifications and hotkey actions +===================================== + +ACPI notifications will generate ACPI netlink events under the device class +``samsung-galaxybook`` and bus ID matching the Samsung ACPI device ID found on +your device. The events can be received using userspace tools such as +``acpi_listen`` and ``acpid``. + +The Fn+F11 Performance mode hotkey will be handled by the driver; each keypress +will cycle to the next available platform profile. diff --git a/Documentation/admin-guide/laptops/sonypi.rst b/Documentation/admin-guide/laptops/sonypi.rst index 190da1234314..7541f56e0007 100644 --- a/Documentation/admin-guide/laptops/sonypi.rst +++ b/Documentation/admin-guide/laptops/sonypi.rst @@ -25,7 +25,7 @@ generate, like: (when available) Those events (see linux/sonypi.h) can be polled using the character device node -/dev/sonypi (major 10, minor auto allocated or specified as a option). +/dev/sonypi (major 10, minor auto allocated or specified as an option). A simple daemon which translates the jogdial movements into mouse wheel events can be downloaded at: <http://popies.net/sonypi/> diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 4ab0fef7d440..f874db31801d 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -54,6 +54,7 @@ detailed description): - Setting keyboard language - WWAN Antenna type - Auxmac + - Hardware damage detection capability A compatibility table by model and feature is maintained on the web site, http://ibm-acpi.sf.net/. I appreciate any success or failure @@ -1521,6 +1522,27 @@ Currently 2 antenna types are supported as mentioned below: The property is read-only. If the platform doesn't have support the sysfs class is not created. +doubletap_enable +---------------- + +sysfs: doubletap_enable + +Controls whether TrackPoint doubletap events are filtered out. Doubletap is a +feature where quickly tapping the TrackPoint twice triggers a special function key event. + +The available commands are:: + + cat /sys/devices/platform/thinkpad_acpi/doubletap_enable + echo 1 | sudo tee /sys/devices/platform/thinkpad_acpi/doubletap_enable + echo 0 | sudo tee /sys/devices/platform/thinkpad_acpi/doubletap_enable + +Values: + + * 1 - doubletap events are processed (default) + * 0 - doubletap events are filtered out (ignored) + + This setting can also be toggled via the Fn+doubletap hotkey. + Auxmac ------ @@ -1576,6 +1598,42 @@ percentage level, above which charging will stop. The exact semantics of the attributes may be found in Documentation/ABI/testing/sysfs-class-power. +Hardware damage detection capability +------------------------------------ + +sysfs attributes: hwdd_status, hwdd_detail + +Thinkpads are adding the ability to detect and report hardware damage. +Add new sysfs interface to identify the damaged device status. +Initial support is available for the USB-C replaceable connector. + +The command to check device damaged status is:: + + cat /sys/devices/platform/thinkpad_acpi/hwdd_status + +This value displays status of device damaged. + +- 0 = Not Damaged +- 1 = Damaged + +The command to check location of damaged device is:: + + cat /sys/devices/platform/thinkpad_acpi/hwdd_detail + +This value displays location of damaged device having 1 line per damaged "item". +For example: + +if no damage is detected: + +- No damage detected + +if damage detected: + +- TYPE-C: Base, Right side, Center port + +The property is read-only. If feature is not supported then sysfs +attribute is not created. + Multiple Commands, Module Parameters ------------------------------------ diff --git a/Documentation/admin-guide/laptops/toshiba_haps.rst b/Documentation/admin-guide/laptops/toshiba_haps.rst index d28b6c3f2849..0226225b82e1 100644 --- a/Documentation/admin-guide/laptops/toshiba_haps.rst +++ b/Documentation/admin-guide/laptops/toshiba_haps.rst @@ -43,7 +43,7 @@ RSSS Shuts down the HDD protection interface for a few seconds, ==== ===================================================================== Note: - The presence of Solid State Drives (SSD) can make this driver to fail loading, + The presence of Solid State Drives (SSD) can cause this driver to fail loading, given the fact that such drives have no movable parts, and thus, not requiring any "protection" as well as failing during the evaluation of the _STA method found under this device. diff --git a/Documentation/admin-guide/laptops/uniwill-laptop.rst b/Documentation/admin-guide/laptops/uniwill-laptop.rst new file mode 100644 index 000000000000..1f3ca84c7d88 --- /dev/null +++ b/Documentation/admin-guide/laptops/uniwill-laptop.rst @@ -0,0 +1,82 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +Uniwill laptop extra features +============================= + +On laptops manufactured by Uniwill (either directly or as ODM), the ``uniwill-laptop`` driver +handles various platform-specific features. + +Module Loading +-------------- + +The ``uniwill-laptop`` driver relies on a DMI table to automatically load on supported devices. +When using the ``force`` module parameter, this DMI check will be omitted, allowing the driver +to be loaded on unsupported devices for testing purposes. + +Hotkeys +------- + +Usually the FN keys work without a special driver. However as soon as the ``uniwill-laptop`` driver +is loaded, the FN keys need to be handled manually. This is done automatically by the driver itself. + +Keyboard settings +----------------- + +The ``uniwill-laptop`` driver allows the user to enable/disable: + + - the FN lock and super key of the integrated keyboard + - the touchpad toggle functionality of the integrated touchpad + +See Documentation/ABI/testing/sysfs-driver-uniwill-laptop for details. + +Hwmon interface +--------------- + +The ``uniwill-laptop`` driver supports reading of the CPU and GPU temperature and supports up to +two fans. Userspace applications can access sensor readings over the hwmon sysfs interface. + +Platform profile +---------------- + +Support for changing the platform performance mode is currently not implemented. + +Battery Charging Control +------------------------ + +.. warning:: Some devices do not properly implement the charging threshold interface. Forcing + the driver to enable access to said interface on such devices might damage the + battery [1]_. Because of this the driver will not enable said feature even when + using the ``force`` module parameter. + +The ``uniwill-laptop`` driver supports controlling the battery charge limit. This happens over +the standard ``charge_control_end_threshold`` power supply sysfs attribute. All values +between 1 and 100 percent are supported. + +Additionally the driver signals the presence of battery charging issues through the standard +``health`` power supply sysfs attribute. + +It also lets you set whether a USB-C power source should prioritise charging the battery or +delivering immediate power to the cpu. See Documentation/ABI/testing/sysfs-driver-uniwill-laptop for +details. + +Lightbar +-------- + +The ``uniwill-laptop`` driver exposes the lightbar found on some models as a standard multicolor +LED class device. The default name of this LED class device is ``uniwill:multicolor:status``. + +See Documentation/ABI/testing/sysfs-driver-uniwill-laptop for details on how to control the various +animation modes of the lightbar. + +Configurable TGP +---------------- + +The ``uniwill-laptop`` driver allows to set the configurable TGP for devices with NVIDIA GPUs that +allow it. + +See Documentation/ABI/testing/sysfs-driver-uniwill-laptop for details. + +References +========== + +.. [1] https://www.reddit.com/r/XMG_gg/comments/ld9yyf/battery_limit_hidden_function_discovered_on/ diff --git a/Documentation/admin-guide/lockup-watchdogs.rst b/Documentation/admin-guide/lockup-watchdogs.rst index 3e09284a8b9b..8f245f4a95b7 100644 --- a/Documentation/admin-guide/lockup-watchdogs.rst +++ b/Documentation/admin-guide/lockup-watchdogs.rst @@ -16,7 +16,7 @@ details), and a compile option, "BOOTPARAM_SOFTLOCKUP_PANIC", are provided for this. A 'hardlockup' is defined as a bug that causes the CPU to loop in -kernel mode for more than 10 seconds (see "Implementation" below for +kernel mode for several seconds (see "Implementation" below for details), without letting other interrupts have a chance to run. Similarly to the softlockup case, the current stack trace is displayed upon detection and the system will stay locked up unless the default @@ -30,39 +30,135 @@ timeout is set through the confusingly named "kernel.panic" sysctl), to cause the system to reboot automatically after a specified amount of time. +Configuration +============= + +A kernel knob is provided that allows administrators to configure +this period. The "watchdog_thresh" parameter (default 10 seconds) +controls the threshold. The right value for a particular environment +is a trade-off between fast response to lockups and detection overhead. + Implementation ============== -The soft and hard lockup detectors are built on top of the hrtimer and -perf subsystems, respectively. A direct consequence of this is that, -in principle, they should work in any architecture where these -subsystems are present. - -A periodic hrtimer runs to generate interrupts and kick the watchdog -job. An NMI perf event is generated every "watchdog_thresh" -(compile-time initialized to 10 and configurable through sysctl of the -same name) seconds to check for hardlockups. If any CPU in the system -does not receive any hrtimer interrupt during that time the -'hardlockup detector' (the handler for the NMI perf event) will -generate a kernel warning or call panic, depending on the -configuration. - -The watchdog job runs in a stop scheduling thread that updates a -timestamp every time it is scheduled. If that timestamp is not updated -for 2*watchdog_thresh seconds (the softlockup threshold) the +The soft and hard lockup detectors are built around an hrtimer. +In addition, the softlockup detector regularly schedules a job, and +the hard lockup detector might use Perf/NMI events on architectures +that support it. + +Frequency and Heartbeats +------------------------ + +The core of the detectors is an hrtimer. It serves multiple purposes: + +- schedules watchdog job for the softlockup detector +- bumps the interrupt counter for hardlockup detectors (heartbeat) +- detects softlockups +- detects hardlockups in Buddy mode + +The period of this hrtimer is 2*watchdog_thresh/5, which is 4 seconds +by default. The hrtimer has two or three chances to generate an interrupt +(heartbeat) before the hardlockup detector kicks in. + +Softlockup Detector +------------------- + +The watchdog job is scheduled by the hrtimer and runs in a stop scheduling +thread. It updates a timestamp every time it is scheduled. If that timestamp +is not updated for 2*watchdog_thresh seconds (the softlockup threshold) the 'softlockup detector' (coded inside the hrtimer callback function) will dump useful debug information to the system log, after which it will call panic if it was instructed to do so or resume execution of other kernel code. -The period of the hrtimer is 2*watchdog_thresh/5, which means it has -two or three chances to generate an interrupt before the hardlockup -detector kicks in. +Hardlockup Detector (NMI/Perf) +------------------------------ + +On architectures that support NMI (Non-Maskable Interrupt) perf events, +a periodic NMI is generated every "watchdog_thresh" seconds. + +If any CPU in the system does not receive any hrtimer interrupt +(heartbeat) during the "watchdog_thresh" window, the 'hardlockup +detector' (the handler for the NMI perf event) will generate a kernel +warning or call panic. + +**Detection Overhead (NMI):** + +The time to detect a lockup can vary depending on when the lockup +occurs relative to the NMI check window. Examples below assume a watchdog_thresh of 10. + +* **Best Case:** The lockup occurs just before the first heartbeat is + due. The detector will notice the missing hrtimer interrupt almost + immediately during the next check. + + :: + + Time 100.0: cpu 1 heartbeat + Time 100.1: hardlockup_check, cpu1 stores its state + Time 103.9: Hard Lockup on cpu1 + Time 104.0: cpu 1 heartbeat never comes + Time 110.1: hardlockup_check, cpu1 checks the state again, should be the same, declares lockup + + Time to detection: ~6 seconds + +* **Worst Case:** The lockup occurs shortly after a valid interrupt + (heartbeat) which itself happened just after the NMI check. The next + NMI check sees that the interrupt count has changed (due to that one + heartbeat), assumes the CPU is healthy, and resets the baseline. The + lockup is only detected at the subsequent check. + + :: + + Time 100.0: hardlockup_check, cpu1 stores its state + Time 100.1: cpu 1 heartbeat + Time 100.2: Hard Lockup on cpu1 + Time 110.0: hardlockup_check, cpu1 stores its state (misses lockup as state changed) + Time 120.0: hardlockup_check, cpu1 checks the state again, should be the same, declares lockup + + Time to detection: ~20 seconds + +Hardlockup Detector (Buddy) +--------------------------- + +On architectures or configurations where NMI perf events are not +available (or disabled), the kernel may use the "buddy" hardlockup +detector. This mechanism requires SMP (Symmetric Multi-Processing). + +In this mode, each CPU is assigned a "buddy" CPU to monitor. The +monitoring CPU runs its own hrtimer (the same one used for softlockup +detection) and checks if the buddy CPU's hrtimer interrupt count has +increased. + +To ensure timeliness and avoid false positives, the buddy system performs +checks at every hrtimer interval (2*watchdog_thresh/5, which is 4 seconds +by default). It uses a missed-interrupt threshold of 3. If the buddy's +interrupt count has not changed for 3 consecutive checks, it is assumed +that the buddy CPU is hardlocked (interrupts disabled). The monitoring +CPU will then trigger the hardlockup response (warning or panic). + +**Detection Overhead (Buddy):** + +With a default check interval of 4 seconds (watchdog_thresh = 10): + +* **Best case:** Lockup occurs just before a check. + Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd). +* **Worst case:** Lockup occurs just after a check. + Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd). + +**Limitations of the Buddy Detector:** + +1. **All-CPU Lockup:** If all CPUs lock up simultaneously, the buddy + detector cannot detect the condition because the monitoring CPUs + are also frozen. +2. **Stack Traces:** Unlike the NMI detector, the buddy detector + cannot directly interrupt the locked CPU to grab a stack trace. + It relies on architecture-specific mechanisms (like NMI backtrace + support) to try and retrieve the status of the locked CPU. If + such support is missing, the log may only show that a lockup + occurred without providing the locked CPU's stack. -As explained above, a kernel knob is provided that allows -administrators to configure the period of the hrtimer and the perf -event. The right value for a particular environment is a trade-off -between fast response to lockups and detection overhead. +Watchdog Core Exclusion +======================= By default, the watchdog runs on all online cores. However, on a kernel configured with NO_HZ_FULL, by default the watchdog runs only diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst index 4ff2cc291d18..dc7eab191caa 100644 --- a/Documentation/admin-guide/md.rst +++ b/Documentation/admin-guide/md.rst @@ -238,6 +238,16 @@ All md devices contain: the number of devices in a raid4/5/6, or to support external metadata formats which mandate such clipping. + logical_block_size + Configure the array's logical block size in bytes. This attribute + is only supported for 1.x meta. Write the value before starting + array. The final array LBS uses the maximum between this + configuration and LBS of all combined devices. Note that + LBS cannot exceed PAGE_SIZE before RAID supports folio. + WARNING: Arrays created on new kernel cannot be assembled at old + kernel due to padding check, Set module parameter 'check_new_feature' + to false to bypass, but data loss may occur. + reshape_position This is either ``none`` or a sector number within the devices of the array where ``reshape`` is up to. If this is set, the three @@ -347,6 +357,54 @@ All md devices contain: active-idle like active, but no writes have been seen for a while (safe_mode_delay). + consistency_policy + This indicates how the array maintains consistency in case of unexpected + shutdown. It can be: + + none + Array has no redundancy information, e.g. raid0, linear. + + resync + Full resync is performed and all redundancy is regenerated when the + array is started after unclean shutdown. + + bitmap + Resync assisted by a write-intent bitmap. + + journal + For raid4/5/6, journal device is used to log transactions and replay + after unclean shutdown. + + ppl + For raid5 only, Partial Parity Log is used to close the write hole and + eliminate resync. + + The accepted values when writing to this file are ``ppl`` and ``resync``, + used to enable and disable PPL. + + uuid + This indicates the UUID of the array in the following format: + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + + bitmap_type + [RW] When read, this file will display the current and available + bitmap for this array. The currently active bitmap will be enclosed + in [] brackets. Writing an bitmap name or ID to this file will switch + control of this array to that new bitmap. Note that writing a new + bitmap for created array is forbidden. + + none + No bitmap + bitmap + The default internal bitmap + llbitmap + The lockless internal bitmap + +If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or +llbitmap/xxx will be created after md device KOBJ_CHANGE event. + +If bitmap_type is bitmap, then the md device will also contain: + bitmap/location This indicates where the write-intent bitmap for the array is stored. @@ -401,35 +459,23 @@ All md devices contain: once the array becomes non-degraded, and this fact has been recorded in the metadata. - consistency_policy - This indicates how the array maintains consistency in case of unexpected - shutdown. It can be: - - none - Array has no redundancy information, e.g. raid0, linear. - - resync - Full resync is performed and all redundancy is regenerated when the - array is started after unclean shutdown. - - bitmap - Resync assisted by a write-intent bitmap. +If bitmap_type is llbitmap, then the md device will also contain: - journal - For raid4/5/6, journal device is used to log transactions and replay - after unclean shutdown. + llbitmap/bits + This is read-only, show status of bitmap bits, the number of each + value. - ppl - For raid5 only, Partial Parity Log is used to close the write hole and - eliminate resync. - - The accepted values when writing to this file are ``ppl`` and ``resync``, - used to enable and disable PPL. + llbitmap/metadata + This is read-only, show bitmap metadata, include chunksize, chunkshift, + chunks, offset and daemon_sleep. - uuid - This indicates the UUID of the array in the following format: - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + llbitmap/daemon_sleep + This is read-write, time in seconds that daemon function will be + triggered to clear dirty bits. + llbitmap/barrier_idle + This is read-write, time in seconds that page barrier will be idled, + means dirty bits in the page will be cleared. As component devices are added to an md array, they appear in the ``md`` directory as new directories named:: @@ -758,7 +804,7 @@ These currently include: journal_mode (currently raid5 only) The cache mode for raid5. raid5 could include an extra disk for - caching. The mode can be "write-throuth" and "write-back". The + caching. The mode can be "write-through" or "write-back". The default is "write-through". ppl_write_hint diff --git a/Documentation/admin-guide/media/c3-isp.dot b/Documentation/admin-guide/media/c3-isp.dot new file mode 100644 index 000000000000..42dc931ee84a --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.dot @@ -0,0 +1,26 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0 | <port1> 1} | c3-isp-core\n/dev/v4l-subdev0 | {<port2> 2 | <port3> 3 | <port4> 4 | <port5> 5}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port3 -> n00000008:port0 + n00000001:port4 -> n0000000b:port0 + n00000001:port5 -> n0000000e:port0 + n00000001:port2 -> n00000027 + n00000008 [label="{{<port0> 0} | c3-isp-resizer0\n/dev/v4l-subdev1 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000008:port1 -> n00000016 [style=bold] + n0000000b [label="{{<port0> 0} | c3-isp-resizer1\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000b:port1 -> n0000001a [style=bold] + n0000000e [label="{{<port0> 0} | c3-isp-resizer2\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000e:port1 -> n00000023 [style=bold] + n00000011 [label="{{<port0> 0} | c3-mipi-adapter\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000011:port1 -> n00000001:port0 [style=bold] + n00000016 [label="c3-isp-cap0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n0000001a [label="c3-isp-cap1\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n0000001e [label="{{<port0> 0} | c3-mipi-csi2\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000001e:port1 -> n00000011:port0 [style=bold] + n00000023 [label="c3-isp-cap2\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n00000027 [label="c3-isp-stats\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n0000002b [label="c3-isp-params\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n0000002b -> n00000001:port1 + n0000003f [label="{{} | imx290 2-001a\n/dev/v4l-subdev6 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n0000003f:port0 -> n0000001e:port0 [style=bold] +} diff --git a/Documentation/admin-guide/media/c3-isp.rst b/Documentation/admin-guide/media/c3-isp.rst new file mode 100644 index 000000000000..ac508b8c6831 --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR MIT) + +.. include:: <isonum.txt> + +================================================= +Amlogic C3 Image Signal Processing (C3ISP) driver +================================================= + +Introduction +============ + +This file documents the Amlogic C3ISP driver located under +drivers/media/platform/amlogic/c3/isp. + +The current version of the driver supports the C3ISP found on +Amlogic C308L processor. + +The driver implements V4L2, Media controller and V4L2 subdev interfaces. +Camera sensor using V4L2 subdev interface in the kernel is supported. + +The driver has been tested on AW419-C308L-Socket platform. + +Amlogic C3 ISP +============== + +The Camera hardware found on C308L processors and supported by +the driver consists of: + +- 1 MIPI-CSI-2 module: handles the physical layer of the MIPI CSI-2 receiver and + receives data from the connected camera sensor. +- 1 MIPI-ADAPTER module: organizes MIPI data to meet ISP input requirements and + send MIPI data to ISP. +- 1 ISP (Image Signal Processing) module: contains a pipeline of image processing + hardware blocks. The ISP pipeline contains three resizers at the end each of + them connected to a DMA interface which writes the output data to memory. + +A high-level functional view of the C3 ISP is presented below.:: + + +----------+ +-------+ + | Resizer |--->| WRMIF | + +---------+ +------------+ +--------------+ +-------+ |----------+ +-------+ + | Sensor |--->| MIPI CSI-2 |--->| MIPI ADAPTER |--->| ISP |---|----------+ +-------+ + +---------+ +------------+ +--------------+ +-------+ | Resizer |--->| WRMIF | + +----------+ +-------+ + |----------+ +-------+ + | Resizer |--->| WRMIF | + +----------+ +-------+ + +Driver architecture and design +============================== + +With the goal to model the hardware links between the modules and to expose a +clean, logical and usable interface, the driver registers the following V4L2 +sub-devices: + +- 1 `c3-mipi-csi2` sub-device - the MIPI CSI-2 receiver +- 1 `c3-mipi-adapter` sub-device - the MIPI adapter +- 1 `c3-isp-core` sub-device - the ISP core +- 3 `c3-isp-resizer` sub-devices - the ISP resizers + +The `c3-isp-core` sub-device is linked to 2 video device nodes for statistics +capture and parameters programming: + +- the `c3-isp-stats` capture video device node for statistics capture +- the `c3-isp-params` output video device for parameters programming + +Each `c3-isp-resizer` sub-device is linked to a capture video device node where +frames are captured from: + +- `c3-isp-resizer0` is linked to the `c3-isp-cap0` capture video device +- `c3-isp-resizer1` is linked to the `c3-isp-cap1` capture video device +- `c3-isp-resizer2` is linked to the `c3-isp-cap2` capture video device + +The media controller pipeline graph is as follows (with connected a +IMX290 camera sensor): + +.. _isp_topology_graph: + +.. kernel-figure:: c3-isp.dot + :alt: c3-isp.dot + :align: center + + Media pipeline topology + +Implementation +============== + +Runtime configuration of the ISP hardware is performed on the `c3-isp-params` +video device node using the :ref:`V4L2_META_FMT_C3ISP_PARAMS +<v4l2-meta-fmt-c3isp-params>` as data format. The buffer structure is defined by +:c:type:`c3_isp_params_cfg`. + +Statistics are captured from the `c3-isp-stats` video device node using the +:ref:`V4L2_META_FMT_C3ISP_STATS <v4l2-meta-fmt-c3isp-stats>` data format. + +The final picture size and format is configured using the V4L2 video +capture interface on the `c3-isp-cap[0, 2]` video device nodes. + +The Amlogic C3 ISP is supported by `libcamera <https://libcamera.org>`_ with a +dedicated pipeline handler and algorithms that perform run-time image correction +and enhancement. diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst index 92690e1f2183..b2e7a300494a 100644 --- a/Documentation/admin-guide/media/cec.rst +++ b/Documentation/admin-guide/media/cec.rst @@ -451,7 +451,7 @@ configure the CEC devices for HDMI Input and the HDMI Outputs manually. --------------------- A three character manufacturer name that is used in the EDID for the HDMI -Input. If not set, then userspace is reponsible for configuring an EDID. +Input. If not set, then userspace is responsible for configuring an EDID. If set, then the driver will update the EDID automatically based on the resolutions supported by the connected displays, and it will not be possible anymore to manually set the EDID for the HDMI Input. diff --git a/Documentation/admin-guide/media/i2c-cardlist.rst b/Documentation/admin-guide/media/i2c-cardlist.rst index 1825a0bb47bd..fff962558cd5 100644 --- a/Documentation/admin-guide/media/i2c-cardlist.rst +++ b/Documentation/admin-guide/media/i2c-cardlist.rst @@ -91,7 +91,6 @@ ov5647 OmniVision OV5647 sensor ov5670 OmniVision OV5670 sensor ov5675 OmniVision OV5675 sensor ov5695 OmniVision OV5695 sensor -ov6650 OmniVision OV6650 sensor ov7251 OmniVision OV7251 sensor ov7640 OmniVision OV7640 sensor ov7670 OmniVision OV7670 sensor diff --git a/Documentation/admin-guide/media/imx.rst b/Documentation/admin-guide/media/imx.rst index b8fa70f854fd..bb68100d8acb 100644 --- a/Documentation/admin-guide/media/imx.rst +++ b/Documentation/admin-guide/media/imx.rst @@ -96,7 +96,7 @@ Some of the features of this driver include: motion compensation modes: low, medium, and high motion. Pipelines are defined that allow sending frames to the VDIC subdev directly from the CSI. There is also support in the future for sending frames to the - VDIC from memory buffers via a output/mem2mem devices. + VDIC from memory buffers via output/mem2mem devices. - Includes a Frame Interval Monitor (FIM) that can correct vertical sync problems with the ADV718x video decoders. diff --git a/Documentation/admin-guide/media/ivtv.rst b/Documentation/admin-guide/media/ivtv.rst index 101f16d0263e..8b65ac3f5321 100644 --- a/Documentation/admin-guide/media/ivtv.rst +++ b/Documentation/admin-guide/media/ivtv.rst @@ -3,7 +3,7 @@ The ivtv driver =============== -Author: Hans Verkuil <hverkuil@xs4all.nl> +Author: Hans Verkuil <hverkuil@kernel.org> This is a v4l2 device driver for the Conexant cx23415/6 MPEG encoder/decoder. The cx23415 can do both encoding and decoding, the cx23416 can only do MPEG diff --git a/Documentation/admin-guide/media/mali-c55-graph.dot b/Documentation/admin-guide/media/mali-c55-graph.dot new file mode 100644 index 000000000000..0775ba42bf4c --- /dev/null +++ b/Documentation/admin-guide/media/mali-c55-graph.dot @@ -0,0 +1,19 @@ +digraph board { + rankdir=TB + n00000001 [label="{{} | mali-c55 tpg\n/dev/v4l-subdev0 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port0 -> n00000003:port0 [style=dashed] + n00000003 [label="{{<port0> 0} | mali-c55 isp\n/dev/v4l-subdev1 | {<port1> 1 | <port2> 2}}", shape=Mrecord, style=filled, fillcolor=green] + n00000003:port1 -> n00000007:port0 [style=bold] + n00000003:port2 -> n00000007:port2 [style=bold] + n00000003:port1 -> n0000000b:port0 [style=bold] + n00000007 [label="{{<port0> 0 | <port2> 2} | mali-c55 resizer fr\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000007:port1 -> n0000000e [style=bold] + n0000000b [label="{{<port0> 0} | mali-c55 resizer ds\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000b:port1 -> n00000012 [style=bold] + n0000000e [label="mali-c55 fr\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n00000012 [label="mali-c55 ds\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n00000022 [label="{{<port0> 0} | csi2-rx\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000022:port1 -> n00000003:port0 + n00000027 [label="{{} | imx415 1-001a\n/dev/v4l-subdev5 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n00000027:port0 -> n00000022:port0 [style=bold] +}
\ No newline at end of file diff --git a/Documentation/admin-guide/media/mali-c55.rst b/Documentation/admin-guide/media/mali-c55.rst new file mode 100644 index 000000000000..315f982000c4 --- /dev/null +++ b/Documentation/admin-guide/media/mali-c55.rst @@ -0,0 +1,413 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +ARM Mali-C55 Image Signal Processor driver +========================================== + +Introduction +============ + +This file documents the driver for ARM's Mali-C55 Image Signal Processor. The +driver is located under drivers/media/platform/arm/mali-c55. + +The Mali-C55 ISP receives data in either raw Bayer format or RGB/YUV format from +sensors through either a parallel interface or a memory bus before processing it +and outputting it through an internal DMA engine. Two output pipelines are +possible (though one may not be fitted, depending on the implementation). These +are referred to as "Full resolution" and "Downscale", but the naming is historic +and both pipes are capable of cropping/scaling operations. The full resolution +pipe is also capable of outputting RAW data, bypassing much of the ISP's +processing. The downscale pipe cannot output RAW data. An integrated test +pattern generator can be used to drive the ISP and produce image data in the +absence of a connected camera sensor. The driver module is named mali_c55, and +is enabled through the CONFIG_VIDEO_MALI_C55 config option. + +The driver implements V4L2, Media Controller and V4L2 Subdevice interfaces and +expects camera sensors connected to the ISP to have V4L2 subdevice interfaces. + +Mali-C55 ISP hardware +===================== + +A high level functional view of the Mali-C55 ISP is presented below. The ISP +takes input from either a live source or through a DMA engine for memory input, +depending on the SoC integration.:: + + +---------+ +----------+ +--------+ + | Sensor |--->| CSI-2 Rx | "Full Resolution" | DMA | + +---------+ +----------+ |\ Output +--->| Writer | + | | \ | +--------+ + | | \ +----------+ +------+---> Streaming I/O + +------------+ +------->| | | | | + | | | |-->| Mali-C55 |--+ + | DMA Reader |--------------->| | | ISP | | + | | | / | | | +---> Streaming I/O + +------------+ | / +----------+ | | + |/ +------+ + | +--------+ + +--->| DMA | + "Downscaled" | Writer | + Output +--------+ + +Media Controller Topology +========================= + +An example of the ISP's topology (as implemented in a system with an IMX415 +camera sensor and generic CSI-2 receiver) is below: + + +.. kernel-figure:: mali-c55-graph.dot + :alt: mali-c55-graph.dot + :align: center + +The driver has 4 V4L2 subdevices: + +- `mali_c55 isp`: Responsible for configuring input crop and color space + conversion +- `mali_c55 tpg`: The test pattern generator, emulating a camera sensor. +- `mali_c55 resizer fr`: The Full-Resolution pipe resizer +- `mali_c55 resizer ds`: The Downscale pipe resizer + +The driver has 3 V4L2 video devices: + +- `mali-c55 fr`: The full-resolution pipe's capture device +- `mali-c55 ds`: The downscale pipe's capture device +- `mali-c55 3a stats`: The 3A statistics capture device + +Frame sequences are synchronised across to two capture devices, meaning if one +pipe is started later than the other the sequence numbers returned in its +buffers will match those of the other pipe rather than starting from zero. + +Idiosyncrasies +-------------- + +**mali-c55 isp** +The `mali-c55 isp` subdevice has a single sink pad to which all sources of data +should be connected. The active source is selected by enabling the appropriate +media link and disabling all others. The ISP has two source pads, reflecting the +different paths through which it can internally route data. Tap points within +the ISP allow users to divert data to avoid processing by some or all of the +hardware's processing steps. The diagram below is intended only to highlight how +the bypassing works and is not a true reflection of those processing steps; for +a high-level functional block diagram see ARM's developer page for the +ISP [3]_:: + + +--------------------------------------------------------------+ + | Possible Internal ISP Data Routes | + | +------------+ +----------+ +------------+ | + +---+ | | | | | Colour | +---+ + | 0 |--+-->| Processing |->| Demosaic |->| Space |--->| 1 | + +---+ | | | | | | Conversion | +---+ + | | +------------+ +----------+ +------------+ | + | | +---+ + | +---------------------------------------------------| 2 | + | +---+ + | | + +--------------------------------------------------------------+ + + +.. flat-table:: + :header-rows: 1 + + * - Pad + - Direction + - Purpose + + * - 0 + - sink + - Data input, connected to the TPG and camera sensors + + * - 1 + - source + - RGB/YUV data, connected to the FR and DS V4L2 subdevices + + * - 2 + - source + - RAW bayer data, connected to the FR V4L2 subdevices + +The ISP is limited to both input and output resolutions between 640x480 and +8192x8192, and this is reflected in the ISP and resizer subdevice's .set_fmt() +operations. + +**mali-c55 resizer fr** +The `mali-c55 resizer fr` subdevice has two _sink_ pads to reflect the different +insertion points in the hardware (either RAW or demosaiced data): + +.. flat-table:: + :header-rows: 1 + + * - Pad + - Direction + - Purpose + + * - 0 + - sink + - Data input connected to the ISP's demosaiced stream. + + * - 1 + - source + - Data output connected to the capture video device + + * - 2 + - sink + - Data input connected to the ISP's raw data stream + +The data source in use is selected through the routing API; two routes each of a +single stream are available: + +.. flat-table:: + :header-rows: 1 + + * - Sink Pad + - Source Pad + - Purpose + + * - 0 + - 1 + - Demosaiced data route + + * - 2 + - 1 + - Raw data route + + +If the demosaiced route is active then the FR pipe is only capable of output +in RGB/YUV formats. If the raw route is active then the output reflects the +input (which may be either Bayer or RGB/YUV data). + +Using the driver to capture video +================================= + +Using the media controller APIs we can configure the input source and ISP to +capture images in a variety of formats. In the examples below, configuring the +media graph is done with the v4l-utils [1]_ package's media-ctl utility. +Capturing the images is done with yavta [2]_. + +Configuring the input source +---------------------------- + +The first step is to set the input source that we wish by enabling the correct +media link. Using the example topology above, we can select the TPG as follows: + +.. code-block:: none + + media-ctl -l "'lte-csi2-rx':1->'mali-c55 isp':0[0]" + media-ctl -l "'mali-c55 tpg':0->'mali-c55 isp':0[1]" + +Configuring which video devices will stream data +------------------------------------------------ + +The driver will wait for all video devices to have their VIDIOC_STREAMON ioctl +called before it tells the sensor to start streaming. To facilitate this we need +to enable links to the video devices that we want to use. In the example below +we enable the links to both of the image capture video devices + +.. code-block:: none + + media-ctl -l "'mali-c55 resizer fr':1->'mali-c55 fr':0[1]" + media-ctl -l "'mali-c55 resizer ds':1->'mali-c55 ds':0[1]" + +Capturing bayer data from the source and processing to RGB/YUV +-------------------------------------------------------------- + +To capture 1920x1080 bayer data from the source and push it through the ISP's +full processing pipeline, we configure the data formats appropriately on the +source, ISP and resizer subdevices and set the FR resizer's routing to select +processed data. The media bus format on the resizer's source pad will be either +RGB121212_1X36 or YUV10_1X30, depending on whether you want to capture RGB or +YUV. The ISP's debayering block outputs RGB data natively, setting the source +pad format to YUV10_1X30 enables the colour space conversion block. + +In this example we target RGB565 output, so select RGB121212_1X36 as the resizer +source pad's format: + +.. code-block:: none + + # Set formats on the TPG and ISP + media-ctl -V "'mali-c55 tpg':0[fmt:SRGGB20_1X20/1920x1080]" + media-ctl -V "'mali-c55 isp':0[fmt:SRGGB20_1X20/1920x1080]" + media-ctl -V "'mali-c55 isp':1[fmt:SRGGB20_1X20/1920x1080]" + + # Set routing on the FR resizer + media-ctl -R "'mali-c55 resizer fr'[0/0->1/0[1],2/0->1/0[0]]" + + # Set format on the resizer, must be done AFTER the routing. + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB121212_1X36/1920x1080]" + +The downscale output can also be used to stream data at the same time. In this +case since only processed data can be captured through the downscale output no +routing need be set: + +.. code-block:: none + + # Set format on the resizer + media-ctl -V "'mali-c55 resizer ds':1[fmt:RGB121212_1X36/1920x1080]" + +Following which images can be captured from both the FR and DS output's video +devices (simultaneously, if desired): + +.. code-block:: none + + yavta -f RGB565 -s 1920x1080 -c10 /dev/video0 + yavta -f RGB565 -s 1920x1080 -c10 /dev/video1 + +Cropping the image +~~~~~~~~~~~~~~~~~~ + +Both the full resolution and downscale pipes can crop to a minimum resolution of +640x480. To crop the image simply configure the resizer's sink pad's crop and +compose rectangles and set the format on the video device: + +.. code-block:: none + + media-ctl -V "'mali-c55 resizer fr':0[fmt:RGB121212_1X36/1920x1080 crop:(480,270)/640x480 compose:(0,0)/640x480]" + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB121212_1X36/640x480]" + yavta -f RGB565 -s 640x480 -c10 /dev/video0 + +Downscaling the image +~~~~~~~~~~~~~~~~~~~~~ + +Both the full resolution and downscale pipes can downscale the image by up to 8x +provided the minimum 640x480 output resolution is adhered to. For the best image +result the scaling ratio for each direction should be the same. To configure +scaling we use the compose rectangle on the resizer's sink pad: + +.. code-block:: none + + media-ctl -V "'mali-c55 resizer fr':0[fmt:RGB121212_1X36/1920x1080 crop:(0,0)/1920x1080 compose:(0,0)/640x480]" + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB121212_1X36/640x480]" + yavta -f RGB565 -s 640x480 -c10 /dev/video0 + +Capturing images in YUV formats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If we need to output YUV data rather than RGB the color space conversion block +needs to be active, which is achieved by setting MEDIA_BUS_FMT_YUV10_1X30 on the +resizer's source pad. We can then configure a capture format like NV12 (here in +its multi-planar variant) + +.. code-block:: none + + media-ctl -V "'mali-c55 resizer fr':1[fmt:YUV10_1X30/1920x1080]" + yavta -f NV12M -s 1920x1080 -c10 /dev/video0 + +Capturing RGB data from the source and processing it with the resizers +---------------------------------------------------------------------- + +The Mali-C55 ISP can work with sensors capable of outputting RGB data. In this +case although none of the image quality blocks would be used it can still +crop/scale the data in the usual way. For this reason RGB data input to the ISP +still goes through the ISP subdevice's pad 1 to the resizer. + +To achieve this, the ISP's sink pad's format is set to +MEDIA_BUS_FMT_RGB202020_1X60 - this reflects the format that data must be in to +work with the ISP. Converting the camera sensor's output to that format is the +responsibility of external hardware. + +In this example we ask the test pattern generator to give us RGB data instead of +bayer. + +.. code-block:: none + + media-ctl -V "'mali-c55 tpg':0[fmt:RGB202020_1X60/1920x1080]" + media-ctl -V "'mali-c55 isp':0[fmt:RGB202020_1X60/1920x1080]" + +Cropping or scaling the data can be done in exactly the same way as outlined +earlier. + +Capturing raw data from the source and outputting it unmodified +----------------------------------------------------------------- + +The ISP can additionally capture raw data from the source and output it on the +full resolution pipe only, completely unmodified. In this case the downscale +pipe can still process the data normally and be used at the same time. + +To configure raw bypass the FR resizer's subdevice's routing table needs to be +configured, followed by formats in the appropriate places: + +.. code-block:: none + + media-ctl -R "'mali-c55 resizer fr'[0/0->1/0[0],2/0->1/0[1]]" + media-ctl -V "'mali-c55 isp':0[fmt:RGB202020_1X60/1920x1080]" + media-ctl -V "'mali-c55 resizer fr':2[fmt:RGB202020_1X60/1920x1080]" + media-ctl -V "'mali-c55 resizer fr':1[fmt:RGB202020_1X60/1920x1080]" + + # Set format on the video device and stream + yavta -f RGB565 -s 1920x1080 -c10 /dev/video0 + +.. _mali-c55-3a-stats: + +Capturing ISP Statistics +======================== + +The ISP is capable of producing statistics for consumption by image processing +algorithms running in userspace. These statistics can be captured by queueing +buffers to the `mali-c55 3a stats` V4L2 Device whilst the ISP is streaming. Only +the :ref:`V4L2_META_FMT_MALI_C55_STATS <v4l2-meta-fmt-mali-c55-stats>` +format is supported, so no format-setting need be done: + +.. code-block:: none + + # We assume the media graph has been configured to support RGB565 capture + # from the mali-c55 fr V4L2 Device, which is at /dev/video0. The statistics + # V4L2 device is at /dev/video3 + + yavta -f RGB565 -s 1920x1080 -c32 /dev/video0 && \ + yavta -c10 -F /dev/video3 + +The layout of the buffer is described by :c:type:`mali_c55_stats_buffer`, +but broadly statistics are generated to support three image processing +algorithms; AEXP (Auto-Exposure), AWB (Auto-White Balance) and AF (Auto-Focus). +These stats can be drawn from various places in the Mali C55 ISP pipeline, known +as "tap points". This high-level block diagram is intended to explain where in +the processing flow the statistics can be drawn from:: + + +--> AEXP-2 +----> AEXP-1 +--> AF-0 + | +----> AF-1 | + | | | + +---------+ | +--------------+ | +--------------+ | + | Input +-+-->+ Digital Gain +---+-->+ Black Level +---+---+ + +---------+ +--------------+ +--------------+ | + +-----------------------------------------------------------------+ + | + | +--------------+ +---------+ +----------------+ + +-->| Sinter Noise +-+ White +--+--->| Lens Shading +--+---------------+ + | Reduction | | Balance | | | | | | + +--------------+ +---------+ | +----------------+ | | + +---> AEXP-0 (A) +--> AEXP-0 (B) | + +--------------------------------------------------------------------------+ + | + | +----------------+ +--------------+ +----------------+ + +-->| Tone mapping +-+--->| Demosaicing +->+ Purple Fringe +-+-----------+ + | | | +--------------+ | Correction | | | + +----------------+ +-> AEXP-IRIDIX +----------------+ +---> AWB-0 | + +----------------------------------------------------------------------------+ + | +-------------+ +-------------+ + +------------------->| Colour +---+--->| Output | + | Correction | | | Pipelines | + +-------------+ | +-------------+ + +--> AWB-1 + +By default all statistics are drawn from the 0th tap point for each algorithm; +I.E. AEXP statistics from AEXP-0 (A), AWB statistics from AWB-0 and AF +statistics from AF-0. This is configurable for AEXP and AWB statsistics through +programming the ISP's parameters. + +.. _mali-c55-3a-params: + +Programming ISP Parameters +========================== + +The ISP can be programmed with various parameters from userspace to apply to the +hardware before and during video stream. This allows userspace to dynamically +change values such as black level, white balance and lens shading gains and so +on. + +The buffer format and how to populate it are described by the +:ref:`V4L2_META_FMT_MALI_C55_PARAMS <v4l2-meta-fmt-mali-c55-params>` format, +which should be set as the data format for the `mali-c55 3a params` video node. + +References +========== +.. [1] https://git.linuxtv.org/v4l-utils.git/ +.. [2] https://git.ideasonboard.org/yavta.git +.. [3] https://developer.arm.com/Processors/Mali-C55 diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index b9da127c074d..8e429fd77712 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -1,8 +1,17 @@ .. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + The mgb4 driver =============== +Copyright |copy| 2023 - 2025 Digiteq Automotive + author: Martin Tůma <martin.tuma@digiteqautomotive.com> + +This is a v4l2 device driver for the Digiteq Automotive FrameGrabber 4, a PCIe +card capable of capturing and generating FPD-Link III and GMSL2/3 video streams +as used in the automotive industry. + sysfs interface --------------- @@ -22,7 +31,11 @@ Global (PCI card) parameters | 0 - No module present | 1 - FPDL3 - | 2 - GMSL + | 2 - GMSL3 (one serializer, two daisy chained deserializers) + | 3 - GMSL3 (one serializer, two deserializers) + | 4 - GMSL3 (two deserializers with two daisy chain outputs) + | 6 - GMSL1 + | 8 - GMSL3 coax **module_version** (R): Module version number. Zero in case of a missing module. @@ -31,7 +44,8 @@ Global (PCI card) parameters Firmware type. | 1 - FPDL3 - | 2 - GMSL + | 2 - GMSL3 + | 3 - GMSL1 **fw_version** (R): Firmware version number. @@ -60,6 +74,7 @@ Common FPDL3/GMSL input parameters | 0 - OLDI/JEIDA | 1 - SPWG/VESA (default) + | 2 - ZDML **link_status** (R): Video link status. If the link is locked, chips are properly connected and @@ -226,6 +241,13 @@ Common FPDL3/GMSL output parameters *Note: This parameter can not be changed while the output v4l2 device is open.* +**color_mapping** (RW): + Mapping of the outgoing bits in the signal to the colour bits of the pixels. + + | 0 - OLDI/JEIDA + | 1 - SPWG/VESA (default) + | 2 - ZDML + **frame_rate** (RW): Output video signal frame rate limit in frames per second. Due to the limited output pixel clock steps, the card can not always generate diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst index 7d8e3c8987db..239879634ea5 100644 --- a/Documentation/admin-guide/media/pci-cardlist.rst +++ b/Documentation/admin-guide/media/pci-cardlist.rst @@ -86,7 +86,6 @@ saa7134 Philips SAA7134 saa7164 NXP SAA7164 smipcie SMI PCIe DVBSky cards solo6x10 Bluecherry / Softlogic 6x10 capture cards (MPEG-4/H.264) -sta2x11_vip STA2X11 VIP Video For Linux tw5864 Techwell TW5864 video/audio grabber and encoder tw686x Intersil/Techwell TW686x tw68 Techwell tw68x Video For Linux diff --git a/Documentation/admin-guide/media/platform-cardlist.rst b/Documentation/admin-guide/media/platform-cardlist.rst index 1230ae4037ad..63f4b19c3628 100644 --- a/Documentation/admin-guide/media/platform-cardlist.rst +++ b/Documentation/admin-guide/media/platform-cardlist.rst @@ -18,8 +18,6 @@ am437x-vpfe TI AM437x VPFE aspeed-video Aspeed AST2400 and AST2500 atmel-isc ATMEL Image Sensor Controller (ISC) atmel-isi ATMEL Image Sensor Interface (ISI) -c8sectpfe SDR platform devices -c8sectpfe SDR platform devices cafe_ccic Marvell 88ALP01 (Cafe) CMOS Camera Controller cdns-csi2rx Cadence MIPI-CSI2 RX Controller cdns-csi2tx Cadence MIPI-CSI2 TX Controller diff --git a/Documentation/admin-guide/media/radio-cardlist.rst b/Documentation/admin-guide/media/radio-cardlist.rst index a82a146bf912..cec724256812 100644 --- a/Documentation/admin-guide/media/radio-cardlist.rst +++ b/Documentation/admin-guide/media/radio-cardlist.rst @@ -30,7 +30,6 @@ radio-terratec TerraTec ActiveRadio ISA Standalone radio-timb Enable the Timberdale radio driver radio-trust Trust FM radio card radio-typhoon Typhoon Radio (a.k.a. EcoRadio) -radio-wl1273 Texas Instruments WL1273 I2C FM Radio fm_drv ISA radio devices fm_drv ISA radio devices radio-zoltrix Zoltrix Radio diff --git a/Documentation/admin-guide/media/rkcif-rk3568-vicap.dot b/Documentation/admin-guide/media/rkcif-rk3568-vicap.dot new file mode 100644 index 000000000000..3fac59335459 --- /dev/null +++ b/Documentation/admin-guide/media/rkcif-rk3568-vicap.dot @@ -0,0 +1,8 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0} | rkcif-dvp0\n/dev/v4l-subdev0 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port1 -> n00000004 + n00000004 [label="rkcif-dvp0-id0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n00000025 [label="{{} | it6801 2-0048\n/dev/v4l-subdev1 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n00000025:port0 -> n00000001:port0 +} diff --git a/Documentation/admin-guide/media/rkcif.rst b/Documentation/admin-guide/media/rkcif.rst new file mode 100644 index 000000000000..2558c121abc4 --- /dev/null +++ b/Documentation/admin-guide/media/rkcif.rst @@ -0,0 +1,79 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +Rockchip Camera Interface (CIF) +========================================= + +Introduction +============ + +The Rockchip Camera Interface (CIF) is featured in many Rockchip SoCs in +different variants. +The different variants are combinations of common building blocks, such as + +* INTERFACE blocks of different types, namely + + * the Digital Video Port (DVP, a parallel data interface) + * the interface block for the MIPI CSI-2 receiver + +* CROP units + +* MIPI CSI-2 receiver (not available on all variants): This unit is referred + to as MIPI CSI HOST in the Rockchip documentation. + Technically, it is a separate hardware block, but it is strongly coupled to + the CIF and therefore included here. + +* MUX units (not available on all variants) that pass the video data to an + image signal processor (ISP) + +* SCALE units (not available on all variants) + +* DMA engines that transfer video data into system memory using a + double-buffering mechanism called ping-pong mode + +* Support for four streams per INTERFACE block (not available on all + variants), e.g., for MIPI CSI-2 Virtual Channels (VCs) + +This document describes the different variants of the CIF, their hardware +layout, as well as their representation in the media controller centric rkcif +device driver, which is located under drivers/media/platform/rockchip/rkcif. + +Variants +======== + +Rockchip PX30 Video Input Processor (VIP) +----------------------------------------- + +The PX30 Video Input Processor (VIP) features a digital video port that accepts +parallel video data or BT.656. +Since these protocols do not feature multiple streams, the VIP has one DMA +engine that transfers the input video data into system memory. + +The rkcif driver represents this hardware variant by exposing one V4L2 subdevice +(the DVP INTERFACE/CROP block) and one V4L2 device (the DVP DMA engine). + +Rockchip RK3568 Video Capture (VICAP) +------------------------------------- + +The RK3568 Video Capture (VICAP) unit features a digital video port and a MIPI +CSI-2 receiver that can receive video data independently. +The DVP accepts parallel video data, BT.656 and BT.1120. +Since the BT.1120 protocol may feature more than one stream, the RK3568 VICAP +DVP features four DMA engines that can capture different streams. +Similarly, the RK3568 VICAP MIPI CSI-2 receiver features four DMA engines to +handle different Virtual Channels (VCs). + +The rkcif driver represents this hardware variant by exposing up the following +V4L2 subdevices: + +* rkcif-dvp0: INTERFACE/CROP block for the DVP + +and the following video devices: + +* rkcif-dvp0-id0: The support for multiple streams on the DVP is not yet + implemented, as it is hard to find test hardware. Thus, this video device + represents the first DMA engine of the RK3568 DVP. + +.. kernel-figure:: rkcif-rk3568-vicap.dot + :alt: Topology of the RK3568 Video Capture (VICAP) unit + :align: center diff --git a/Documentation/admin-guide/media/si4713.rst b/Documentation/admin-guide/media/si4713.rst index be8e6b49b7b4..85dcf1cd2df8 100644 --- a/Documentation/admin-guide/media/si4713.rst +++ b/Documentation/admin-guide/media/si4713.rst @@ -13,7 +13,7 @@ Contact: Eduardo Valentin <eduardo.valentin@nokia.com> Information about the Device ---------------------------- -This chip is a Silicon Labs product. It is a I2C device, currently on 0x63 address. +This chip is a Silicon Labs product. It is an I2C device, currently on 0x63 address. Basically, it has transmission and signal noise level measurement features. The Si4713 integrates transmit functions for FM broadcast stereo transmission. @@ -28,7 +28,7 @@ Users must comply with local regulations on radio frequency (RF) transmission. Device driver description ------------------------- -There are two modules to handle this device. One is a I2C device driver +There are two modules to handle this device. One is an I2C device driver and the other is a platform driver. The I2C device driver exports a v4l2-subdev interface to the kernel. @@ -113,7 +113,7 @@ Here is a summary of them: - acomp_attack_time - Sets the attack time for audio dynamic range control. - acomp_release_time - Sets the release time for audio dynamic range control. -* Limiter setups audio deviation limiter feature. Once a over deviation occurs, +* Limiter sets up the audio deviation limiter feature. Once an over deviation occurs, it is possible to adjust the front-end gain of the audio input and always prevent over deviation. diff --git a/Documentation/admin-guide/media/starfive_camss.rst b/Documentation/admin-guide/media/starfive_camss.rst deleted file mode 100644 index ca42e9447c47..000000000000 --- a/Documentation/admin-guide/media/starfive_camss.rst +++ /dev/null @@ -1,72 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -.. include:: <isonum.txt> - -================================ -Starfive Camera Subsystem driver -================================ - -Introduction ------------- - -This file documents the driver for the Starfive Camera Subsystem found on -Starfive JH7110 SoC. The driver is located under drivers/staging/media/starfive/ -camss. - -The driver implements V4L2, Media controller and v4l2_subdev interfaces. Camera -sensor using V4L2 subdev interface in the kernel is supported. - -The driver has been successfully used on the Gstreamer 1.18.5 with v4l2src -plugin. - - -Starfive Camera Subsystem hardware ----------------------------------- - -The Starfive Camera Subsystem hardware consists of:: - - |\ +---------------+ +-----------+ - +----------+ | \ | | | | - | | | | | | | | - | MIPI |----->| |----->| ISP |----->| | - | | | | | | | | - +----------+ | | | | | Memory | - |MUX| +---------------+ | Interface | - +----------+ | | | | - | | | |---------------------------->| | - | Parallel |----->| | | | - | | | | | | - +----------+ | / | | - |/ +-----------+ - -- MIPI: The MIPI interface, receiving data from a MIPI CSI-2 camera sensor. - -- Parallel: The parallel interface, receiving data from a parallel sensor. - -- ISP: The ISP, processing raw Bayer data from an image sensor and producing - YUV frames. - - -Topology --------- - -The media controller pipeline graph is as follows: - -.. _starfive_camss_graph: - -.. kernel-figure:: starfive_camss_graph.dot - :alt: starfive_camss_graph.dot - :align: center - -The driver has 2 video devices: - -- capture_raw: The capture device, capturing image data directly from a sensor. -- capture_yuv: The capture device, capturing YUV frame data processed by the - ISP module - -The driver has 3 subdevices: - -- stf_isp: is responsible for all the isp operations, outputs YUV frames. -- cdns_csi2rx: a CSI-2 bridge supporting up to 4 CSI lanes in input, and 4 - different pixel streams in output. -- imx219: an image sensor, image data is sent through MIPI CSI-2. diff --git a/Documentation/admin-guide/media/starfive_camss_graph.dot b/Documentation/admin-guide/media/starfive_camss_graph.dot deleted file mode 100644 index 8eff1f161ac7..000000000000 --- a/Documentation/admin-guide/media/starfive_camss_graph.dot +++ /dev/null @@ -1,12 +0,0 @@ -digraph board { - rankdir=TB - n00000001 [label="{{<port0> 0} | stf_isp\n/dev/v4l-subdev0 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] - n00000001:port1 -> n00000008 [style=dashed] - n00000004 [label="capture_raw\n/dev/video0", shape=box, style=filled, fillcolor=yellow] - n00000008 [label="capture_yuv\n/dev/video1", shape=box, style=filled, fillcolor=yellow] - n0000000e [label="{{<port0> 0} | cdns_csi2rx.19800000.csi-bridge\n | {<port1> 1 | <port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green] - n0000000e:port1 -> n00000001:port0 [style=dashed] - n0000000e:port1 -> n00000004 [style=dashed] - n00000018 [label="{{} | imx219 6-0010\n/dev/v4l-subdev1 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] - n00000018:port0 -> n0000000e:port0 [style=bold] -} diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index e8761561b2fe..d31da8e0a54f 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -10,6 +10,7 @@ Video4Linux (V4L) driver-specific documentation :maxdepth: 2 bttv + c3-isp cafe_ccic cx88 fimc @@ -18,19 +19,20 @@ Video4Linux (V4L) driver-specific documentation ipu3 ipu6-isys ivtv + mali-c55 mgb4 omap3isp philips qcom_camss raspberrypi-pisp-be rcar-fdp1 + rkcif rkisp1 raspberrypi-rp1-cfe saa7134 si470x si4713 si476x - starfive_camss vimc visl vivid diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 7367e6294ef6..4120e9cb0cd5 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -12,10 +12,16 @@ its CMA name like below: The structure of the files created under that directory is as follows: - - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. + - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area. + This is the same as ranges/0/base_pfn. - [RO] count: Amount of memory in the CMA area. - [RO] order_per_bit: Order of pages represented by one bit. - - [RO] bitmap: The bitmap of page states in the zone. + - [RO] bitmap: The bitmap of allocated pages in the area. + This is the same as ranges/0/base_pfn. + - [RO] ranges/N/base_pfn: The base PFN of contiguous range N + in the CMA area. + - [RO] ranges/N/bitmap: The bit map of allocated pages in + range N in the CMA area. - [WO] alloc: Allocate N pages from that CMA area. For example:: echo 5 > <debugfs>/cma/<cma_name>/alloc diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..3ce3164480c7 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 @@ -15,3 +14,4 @@ optimize those. usage reclaim lru_sort + stat diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst index 7b0775d281b4..14cc6b2db897 100644 --- a/Documentation/admin-guide/mm/damon/lru_sort.rst +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -79,6 +79,47 @@ of parametrs except ``enabled`` again. Once the re-reading is done, this parameter is set as ``N``. If invalid parameters are found while the re-reading, DAMON_LRU_SORT will be disabled. +Once ``Y`` is written to this parameter, the user must not write to any +parameters until reading ``commit_inputs`` again returns ``N``. If users +violate this rule, the kernel may exhibit undefined behavior. + +active_mem_bp +------------- + +Desired active to [in]active memory ratio in bp (1/10,000). + +While keeping the caps that set by other quotas, DAMON_LRU_SORT automatically +increases and decreases the effective level of the quota aiming the LRU +[de]prioritizations of the hot and cold memory resulting in this active to +[in]active memory ratio. Value zero means disabling this auto-tuning feature. + +Disabled by default. + +autotune_monitoring_intervals +----------------------------- + +If this parameter is set as ``Y``, DAMON_LRU_SORT automatically tunes DAMON's +sampling and aggregation intervals. The auto-tuning aims to capture meaningful +amount of access events in each DAMON-snapshot, while keeping the sampling +interval 5 milliseconds in minimum, and 10 seconds in maximum. Setting this as +``N`` disables the auto-tuning. + +Disabled by default. + +filter_young_pages +------------------ + +Filter [non-]young pages accordingly for LRU [de]prioritizations. + +If this is set, check page level access (youngness) once again before each +LRU [de]prioritization operation. LRU prioritization operation is skipped +if the page has not accessed since the last check (not young). LRU +deprioritization operation is skipped if the page has accessed since the +last check (young). The feature is enabled or disabled if this parameter is +set as ``Y`` or ``N``, respectively. + +Disabled by default. + hot_thres_access_freq --------------------- @@ -184,6 +225,10 @@ But, setting this too high could result in increased monitoring overhead. Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by default. +Note that this must be 3 or higher. Please refer to the :ref:`Monitoring +<damon_design_monitoring>` section of the design document for the rationale +behind this lower bound. + max_nr_regions -------------- @@ -211,6 +256,28 @@ End of target memory region in physical address. The end physical address of memory region that DAMON_LRU_SORT will do work against. By default, biggest System RAM is used as the region. +addr_unit +--------- + +A scale factor for memory addresses and bytes. + +This parameter is for setting and getting the :ref:`address unit +<damon_design_addr_unit>` parameter of the DAMON instance for DAMON_RECLAIM. + +``monitor_region_start`` and ``monitor_region_end`` should be provided in this +unit. For example, let's suppose ``addr_unit``, ``monitor_region_start`` and +``monitor_region_end`` are set as ``1024``, ``0`` and ``10``, respectively. +Then DAMON_LRU_SORT will work for 10 KiB length of physical address range that +starts from address zero (``[0 * 1024, 10 * 1024)`` in bytes). + +Stat parameters having ``bytes_`` prefix are also in this unit. For example, +let's suppose values of ``addr_unit``, ``bytes_lru_sort_tried_hot_regions`` and +``bytes_lru_sorted_hot_regions`` are ``1024``, ``42``, and ``32``, +respectively. Then it means DAMON_LRU_SORT tried to LRU-sort 42 KiB of hot +memory and successfully LRU-sorted 32 KiB of the memory in total. + +If unsure, use only the default value (``1``) and forget about this. + kdamond_pid ----------- @@ -292,3 +359,8 @@ the LRU-list based page granularity reclamation. :: # echo 400 > wmarks_mid # echo 200 > wmarks_low # echo Y > enabled + +Note that this module (damon_lru_sort) cannot run simultaneously with other +DAMON-based special-purpose modules. Refer to :ref:`DAMON design special +purpose modules exclusivity <damon_design_special_purpose_modules_exclusivity>` +for more details. diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index af05ae617018..d7a0225b4950 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -71,6 +71,10 @@ of parametrs except ``enabled`` again. Once the re-reading is done, this parameter is set as ``N``. If invalid parameters are found while the re-reading, DAMON_RECLAIM will be disabled. +Once ``Y`` is written to this parameter, the user must not write to any +parameters until reading ``commit_inputs`` again returns ``N``. If users +violate this rule, the kernel may exhibit undefined behavior. + min_age ------- @@ -204,6 +208,10 @@ monitoring. This can be used to set lower-bound of the monitoring quality. But, setting this too high could result in increased monitoring overhead. Please refer to the DAMON documentation (:doc:`usage`) for more detail. +Note that this must be 3 or higher. Please refer to the :ref:`Monitoring +<damon_design_monitoring>` section of the design document for the rationale +behind this lower bound. + max_nr_regions -------------- @@ -232,6 +240,28 @@ The end physical address of memory region that DAMON_RECLAIM will do work against. That is, DAMON_RECLAIM will find cold memory regions in this region and reclaims. By default, biggest System RAM is used as the region. +addr_unit +--------- + +A scale factor for memory addresses and bytes. + +This parameter is for setting and getting the :ref:`address unit +<damon_design_addr_unit>` parameter of the DAMON instance for DAMON_RECLAIM. + +``monitor_region_start`` and ``monitor_region_end`` should be provided in this +unit. For example, let's suppose ``addr_unit``, ``monitor_region_start`` and +``monitor_region_end`` are set as ``1024``, ``0`` and ``10``, respectively. +Then DAMON_RECLAIM will work for 10 KiB length of physical address range that +starts from address zero (``[0 * 1024, 10 * 1024)`` in bytes). + +``bytes_reclaim_tried_regions`` and ``bytes_reclaimed_regions`` are also in +this unit. For example, let's suppose values of ``addr_unit``, +``bytes_reclaim_tried_regions`` and ``bytes_reclaimed_regions`` are ``1024``, +``42``, and ``32``, respectively. Then it means DAMON_RECLAIM tried to reclaim +42 KiB memory and successfully reclaimed 32 KiB memory in total. + +If unsure, use only the default value (``1``) and forget about this. + skip_anon --------- @@ -296,6 +326,11 @@ granularity reclamation. :: # echo 200 > wmarks_low # echo Y > enabled +Note that this module (damon_reclaim) cannot run simultaneously with other +DAMON-based special-purpose modules. Refer to :ref:`DAMON design special +purpose modules exclusivity <damon_design_special_purpose_modules_exclusivity>` +for more details. + .. [1] https://research.google/pubs/pub48551/ .. [2] https://lwn.net/Articles/787611/ .. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index ede14b679d02..ec8c34b2d32f 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -175,4 +175,4 @@ Below command makes every memory region of size >=4K that has not accessed for $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \ --damos_age 60s max --damos_action pageout \ - <pid of your workload> + --target_pid <pid of your workload> diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst new file mode 100644 index 000000000000..c4b14daeb2dd --- /dev/null +++ b/Documentation/admin-guide/mm/damon/stat.rst @@ -0,0 +1,91 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Data Access Monitoring Results Stat +=================================== + +Data Access Monitoring Results Stat (DAMON_STAT) is a static kernel module that +is aimed to be used for simple access pattern monitoring. It monitors accesses +on the system's entire physical memory using DAMON, and provides simplified +access monitoring results statistics, namely idle time percentiles and +estimated memory bandwidth. + +.. _damon_stat_monitoring_accuracy_overhead: + +Monitoring Accuracy and Overhead +================================ + +DAMON_STAT uses monitoring intervals :ref:`auto-tuning +<damon_design_monitoring_intervals_autotuning>` to make its accuracy high and +overhead minimum. It auto-tunes the intervals aiming 4 % of observable access +events to be captured in each snapshot, while limiting the resulting sampling +interval to be 5 milliseconds in minimum and 10 seconds in maximum. On a few +production server systems, it resulted in consuming only 0.x % single CPU time, +while capturing reasonable quality of access patterns. The tuning-resulting +intervals can be retrieved via ``aggr_interval_us`` :ref:`parameter +<damon_stat_aggr_interval_us>`. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_STAT=y``. The feature can be enabled by +default at build time, by setting ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` true. + +To let sysadmins enable or disable it at boot and/or runtime, and read the +monitoring results, DAMON_STAT provides module parameters. Following +sections are descriptions of the parameters. + +enabled +------- + +Enable or disable DAMON_STAT. + +You can enable DAMON_STAT by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_STAT. The default value is set by +``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option. + +Note that this module (damon_stat) cannot run simultaneously with other +DAMON-based special-purpose modules. Refer to :ref:`DAMON design special +purpose modules exclusivity <damon_design_special_purpose_modules_exclusivity>` +for more details. + +.. _damon_stat_aggr_interval_us: + +aggr_interval_us +---------------- + +Auto-tuned aggregation time interval in microseconds. + +Users can read the aggregation interval of DAMON that is being used by the +DAMON instance for DAMON_STAT. It is :ref:`auto-tuned +<damon_stat_monitoring_accuracy_overhead>` and therefore the value is +dynamically changed. + +estimated_memory_bandwidth +-------------------------- + +Estimated memory bandwidth consumption (bytes per second) of the system. + +DAMON_STAT reads observed access events on the current DAMON results snapshot +and converts it to memory bandwidth consumption estimation in bytes per second. +The resulting metric is exposed to user via this read-only parameter. Because +DAMON uses sampling, this is only an estimation of the access intensity rather +than accurate memory bandwidth. + +memory_idle_ms_percentiles +-------------------------- + +Per-byte idle time (milliseconds) percentiles of the system. + +DAMON_STAT calculates how long each byte of the memory was not accessed until +now (idle time), based on the current DAMON results snapshot. For regions +having access frequency (nr_accesses) larger than zero, how long the current +access frequency level was kept multiplied by ``-1`` becomes the idlee time of +every byte of the region. If a region has zero access frequency (nr_accesses), +how long the region was keeping the zero access frequency (age) becomes the +idle time of every byte of the region. Then, DAMON_STAT exposes the +percentiles of the idle time values via this read-only parameter. Reading the +parameter returns 101 idle time values in milliseconds, separated by comma. +Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle +times. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 47a44bd348ab..534e1199cf09 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -6,6 +6,11 @@ Detailed Usages DAMON provides below interfaces for different users. +- *Special-purpose DAMON modules.* + :ref:`This <damon_modules_special_purpose>` is for people who are building, + distributing, and/or administrating the kernel with special-purpose DAMON + usages. Using this, users can use DAMON's major features for the given + purposes in build, boot, or runtime in simple ways. - *DAMON user space tool.* `This <https://github.com/damonitor/damo>`_ is for privileged people such as system administrators who want a just-working human-friendly interface. @@ -59,14 +64,15 @@ comma (","). :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds - │ │ :ref:`0 <sysfs_kdamond>`/state,pid + │ │ :ref:`0 <sysfs_kdamond>`/state,pid,refresh_ms │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts - │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations + │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations,addr_unit │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us + │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us │ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets - │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target + │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target,obsolete_target │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end │ │ │ │ │ │ │ │ ... @@ -77,14 +83,16 @@ comma (","). │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max │ │ │ │ │ │ │ │ age/min,max - │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes + │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes,goal_tuner │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx - │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds + │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max + │ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests + │ │ │ │ │ │ │ │ 0/id,weight + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds,nr_snapshots,max_nr_snapshots │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed │ │ │ │ │ │ │ │ ... @@ -120,8 +128,8 @@ kdamond. kdamonds/<N>/ ------------- -In each kdamond directory, two files (``state`` and ``pid``) and one directory -(``contexts``) exist. +In each kdamond directory, three files (``state``, ``pid`` and ``refresh_ms``) +and one directory (``contexts``) exist. Reading ``state`` returns ``on`` if the kdamond is currently running, or ``off`` if it is not running. @@ -131,7 +139,13 @@ Users can write below commands for the kdamond to the ``state`` file. - ``on``: Start running. - ``off``: Stop running. - ``commit``: Read the user inputs in the sysfs files except ``state`` file - again. + again. Monitoring :ref:`target region <sysfs_regions>` inputs are also be + ignored if no target region is specified. +- ``update_tuned_intervals``: Update the contents of ``sample_us`` and + ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling + interval`` and ``aggregation interval`` for the files. Please refer to + :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>` + for more details. - ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' :ref:`quota goals <sysfs_schemes_quota_goals>`. - ``update_schemes_stats``: Update the contents of stats files for each @@ -153,6 +167,13 @@ Users can write below commands for the kdamond to the ``state`` file. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. +Users can ask the kernel to periodically update files showing auto-tuned +parameters and DAMOS stats instead of manually writing +``update_tuned_intervals`` like keywords to ``state`` file. For this, users +should write the desired update time interval in milliseconds to ``refresh_ms`` +file. If the interval is zero, the periodic update is disabled. Reading the +file shows currently set time interval. + ``contexts`` directory contains files for controlling the monitoring contexts that this kdamond will execute. @@ -173,9 +194,9 @@ details). At the moment, only one context per kdamond is supported, so only contexts/<N>/ ------------- -In each context directory, two files (``avail_operations`` and ``operations``) -and three directories (``monitoring_attrs``, ``targets``, and ``schemes``) -exist. +In each context directory, three files (``avail_operations``, ``operations`` +and ``addr_unit``) and three directories (``monitoring_attrs``, ``targets``, +and ``schemes``) exist. DAMON supports multiple types of :ref:`monitoring operations <damon_design_configurable_operations_set>`, including those for virtual address @@ -190,6 +211,9 @@ You can set and get what type of monitoring operations DAMON will use for the context by writing one of the keywords listed in ``avail_operations`` file and reading from the ``operations`` file. +``addr_unit`` file is for setting and getting the :ref:`address unit +<damon_design_addr_unit>` parameter of the operations set. + .. _sysfs_monitoring_attrs: contexts/<N>/monitoring_attrs/ @@ -213,6 +237,25 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _damon_usage_sysfs_monitoring_intervals_goal: + +contexts/<N>/monitoring_attrs/intervals/intervals_goal/ +------------------------------------------------------- + +Under the ``intervals`` directory, one directory for automated tuning of +``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists. +Under the directory, four files for the auto-tuning control, namely +``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist. +Please refer to the :ref:`design document of the feature +<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning +mechanism. Reading and writing the four files under ``intervals_goal`` +directory shows and updates the tuning parameters that described in the +:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same +names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The +tuning-applied current values of the two intervals can be read from the +``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to +the ``state`` file. + .. _sysfs_targets: contexts/<N>/targets/ @@ -227,13 +270,20 @@ to ``N-1``. Each directory represents each monitoring target. targets/<N>/ ------------ -In each target directory, one file (``pid_target``) and one directory -(``regions``) exist. +In each target directory, two files (``pid_target`` and ``obsolete_target``) +and one directory (``regions``) exist. If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should be a process. You can specify the process to DAMON by writing the pid of the process to the ``pid_target`` file. +Users can selectively remove targets in the middle of the targets array by +writing non-zero value to ``obsolete_target`` file and committing it (writing +``commit`` to ``state`` file). DAMON will remove the matching targets from its +internal targets array. Users are responsible to construct target directories +again, so that those correctly represent the changed internal targets array. + + .. _sysfs_regions: targets/<N>/regions @@ -252,6 +302,11 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each initial monitoring target region. +If ``nr_regions`` is zero when committing new DAMON parameters online (writing +``commit`` to ``state`` file of :ref:`kdamond <sysfs_kdamond>`), the commit +logic ignores the target regions. In other words, the current monitoring +results for the target are preserved. + .. _sysfs_region: regions/<N>/ @@ -282,9 +337,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes/<N>/ ------------ -In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files -(``action``, ``target_nid`` and ``apply_interval``) exist. +In each scheme directory, eight directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``, +``stats``, and ``tried_regions``) and three files (``action``, ``target_nid`` +and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read @@ -321,9 +377,9 @@ schemes/<N>/quotas/ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given DAMON-based operation scheme. -Under ``quotas`` directory, four files (``ms``, ``bytes``, -``reset_interval_ms``, ``effective_bytes``) and two directores (``weights`` and -``goals``) exist. +Under ``quotas`` directory, five files (``ms``, ``bytes``, +``reset_interval_ms``, ``effective_bytes`` and ``goal_tuner``) and two +directories (``weights`` and ``goals``) exist. You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and ``reset interval`` in milliseconds by writing the values to the three files, @@ -334,6 +390,14 @@ apply the action to only up to ``bytes`` bytes of memory regions within the quota limits unless at least one :ref:`goal <sysfs_schemes_quota_goals>` is set. +You can set the goal-based effective quota auto-tuning algorithm to use, by +writing the algorithm name to ``goal_tuner`` file. Reading the file returns +the currently selected tuner algorithm. Refer to the design documentation of +:ref:`automatic quota tuning goals <damon_design_damos_quotas_auto_tuning>` for +the background design of the feature and the name of the selectable algorithms. +Refer to :ref:`goals directory <sysfs_schemes_quota_goals>` for the goals +setup. + The time quota is internally transformed to a size quota. Between the transformed size quota and user-specified size quota, smaller one is applied. Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the @@ -364,11 +428,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains five files, namely ``target_metric``, +``target_value``, ``current_value`` ``nid`` and ``path``. Users can set and +get the five parameters for the quota auto-tuning goals that specified on the +:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory <sysfs_kdamond>` to pass the feedback to DAMON. @@ -395,33 +459,43 @@ The ``interval`` should written in microseconds unit. .. _sysfs_filters: -schemes/<N>/filters/ --------------------- +schemes/<N>/{core\_,ops\_,}filters/ +----------------------------------- -The directory for the :ref:`filters <damon_design_damos_filters>` of the given +Directories for :ref:`filters <damon_design_damos_filters>` of the given DAMON-based operation scheme. -In the beginning, this directory has only one file, ``nr_filters``. Writing a +``core_filters`` and ``ops_filters`` directories are for the filters handled by +the DAMON core layer and operations set layer, respectively. ``filters`` +directory can be used for installing filters regardless of their handled +layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be +installed before those of ``filters``. All three directories have same files. + +Use of ``filters`` directory can make expecting evaluation orders of given +filters with the files under directory bit confusing. Users are hence +recommended to use ``core_filters`` and ``ops_filters`` directories. The +``filters`` directory could be deprecated in future. + +In the beginning, the directory has only one file, ``nr_filters``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains seven files, namely ``type``, ``matching``, -``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. -To ``type`` file, you can write one of five special keywords: ``anon`` for -anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young -pages, ``addr`` for specific address range (an open-ended interval), or -``target`` for specific DAMON monitoring target filtering. Meaning of the -types are same to the description on the :ref:`design doc -<damon_design_damos_filters>`. - -In case of the memory cgroup filtering, you can specify the memory cgroup of -the interest by writing the path of the memory cgroup from the cgroups mount -point to ``memcg_path`` file. In case of the address range filtering, you can -specify the start and end address of the range to ``addr_start`` and -``addr_end`` files, respectively. For the DAMON monitoring target filtering, -you can specify the index of the target between the list of the DAMON context's -monitoring targets list to ``target_idx`` file. +Each filter directory contains nine files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max`` +and ``target_idx``. To ``type`` file, you can write the type of the filter. +Refer to :ref:`the design doc <damon_design_damos_filters>` for available type +names, their meaning and on what layer those are handled. + +For ``memcg`` type, you can specify the memory cgroup of the interest by +writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. For ``addr`` type, you can specify the start and end +address of the range (open-ended interval) to ``addr_start`` and ``addr_end`` +files, respectively. For ``hugepage_size`` type, you can specify the minimum +and maximum size of the range (closed interval) to ``min`` and ``max`` files, +respectively. For ``target`` type, you can specify the index of the target +between the list of the DAMON context's monitoring targets list to +``target_idx`` file. You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter is for memory that matches the ``type``. You can write ``Y`` or ``N`` to @@ -431,6 +505,7 @@ the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: + # cd ops_filters/0/ # echo 2 > nr_filters # # disallow anonymous pages echo anon > 0/type @@ -447,6 +522,29 @@ Refer to the :ref:`DAMOS filters design documentation of different ``allow`` works, when each of the filters are supported, and differences on stats. +.. _damon_sysfs_dests: + +schemes/<N>/dests/ +------------------ + +Directory for specifying the destinations of given DAMON-based operation +scheme's action. This directory is ignored if the action of the given scheme +is not supporting multiple destinations. Only ``DAMOS_MIGRATE_{HOT,COLD}`` +actions are supporting multiple destinations. + +In the beginning, the directory has only one file, ``nr_dests``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each action destination. + +Each destination directory contains two files, namely ``id`` and ``weight``. +Users can write and read the identifier of the destination to ``id`` file. +For ``DAMOS_MIGRATE_{HOT,COLD}`` actions, the migrate destination node's node +id should be written to ``id`` file. Users can write and read the weight of +the destination among the given destinations to the ``weight`` file. The +weight can be an arbitrary integer. When DAMOS apply the action to each entity +of the memory region, it will select the destination of the action based on the +relative weights of the destinations. + .. _sysfs_schemes_stats: schemes/<N>/stats/ @@ -458,10 +556,14 @@ online analysis or tuning of the schemes. Refer to :ref:`design doc The statistics can be retrieved by reading the files under ``stats`` directory (``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, -``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not -updated in real time, so you should ask DAMON sysfs interface to update the -content of the files for the stats by writing a special keyword, -``update_schemes_stats`` to the relevant ``kdamonds/<N>/state`` file. +``sz_ops_filter_passed``, ``qt_exceeds``, ``nr_snapshots`` and +``max_nr_snapshots``), respectively. + +The files are not updated in real time by default. Users should ask DAMON +sysfs interface to periodically update those using ``refresh_ms``, or do a one +time update by writing a special keyword, ``update_schemes_stats`` to the +relevant ``kdamonds/<N>/state`` file. Refer to :ref:`kdamond directory +<sysfs_kdamond>` for more details. .. _sysfs_schemes_tried_regions: diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index f34a0d798d5b..67a941903fd2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -145,7 +145,17 @@ hugepages It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. If the node number is invalid, the parameter will be ignored. +hugepage_alloc_threads + Specify the number of threads that should be used to allocate hugepages + during boot. This parameter can be used to improve system bootup time + when allocating a large amount of huge pages. + The default value is 25% of the available hardware threads. + Example to use 8 allocation threads:: + + hugepage_alloc_threads=8 + + Note that this parameter only applies to non-gigantic huge pages. default_hugepagesz Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..bbb563cba5d2 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,8 +37,9 @@ the Linux memory management. numaperf pagemap shrinker_debugfs + slab soft-dirty - swap_numa transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..2c26e560bd78 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts <kho-concepts>`. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +To perform a KHO kexec, load the target payload and kexec into it. It +is important that you use the ``-s`` parameter to use the in-kernel +kexec file loader, as user space kexec tooling currently has no +support for KHO with the user space based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Kexec Metadata +============== + +KHO automatically tracks metadata about the kexec chain, passing information +about the previous kernel to the next kernel. This feature helps diagnose +bugs that only reproduce when kexecing from specific kernel versions. + +On each KHO kexec, the kernel logs the previous kernel's version and the +number of kexec reboots since the last cold boot:: + + [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) + +The metadata includes: + +``previous_release`` + The kernel version string (from ``uname -r``) of the kernel that + initiated the kexec. + +``kexec_count`` + The number of kexec boots since the last cold boot. On cold boot, + this counter starts at 0 and increments with each kexec. This helps + identify issues that only manifest after multiple consecutive kexec + reboots. + +Use Cases +--------- + +This metadata is particularly useful for debugging kexec transition bugs, +where a buggy kernel kexecs into a new kernel and the bug manifests only +in the second kernel. Examples of such bugs include: + +- Memory corruption from the previous kernel affecting the new kernel +- Incorrect hardware state left by the previous kernel +- Firmware/ACPI state issues that only appear in kexec scenarios + +At scale, correlating crashes to the previous kernel version enables +faster root cause analysis when issues only occur in specific kernel +transition scenarios. + +debugfs Interfaces +================== + +These debugfs interfaces are available when the kernel is compiled with +``CONFIG_KEXEC_HANDOVER_DEBUGFS`` enabled. + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/fdt`` + The kernel exposes the flattened device tree blob that carries its + current KHO state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + KHO producers can register their own FDT or another binary blob under + this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 33c886f3d198..0207f8725142 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -603,17 +603,18 @@ ZONE_MOVABLE, especially when fine-tuning zone ratios: memory for metadata and page tables in the direct map; having a lot of offline memory blocks is not a typical case, though. -- Memory ballooning without balloon compaction is incompatible with - ZONE_MOVABLE. Only some implementations, such as virtio-balloon and - pseries CMM, fully support balloon compaction. +- Memory ballooning without support for balloon memory migration is incompatible + with ZONE_MOVABLE. Only some implementations, such as virtio-balloon and + pseries CMM, fully support balloon memory migration. - Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be + Further, the CONFIG_BALLOON_MIGRATION kernel configuration option might be disabled. In that case, balloon inflation will only perform unmovable allocations and silently create a zone imbalance, usually triggered by inflation requests from the hypervisor. -- Gigantic pages are unmovable, resulting in user space consuming a - lot of unmovable memory. +- Gigantic pages are unmovable when an architecture does not support + huge page migration and/or the ``movable_gigantic_pages`` sysctl is false. + See Documentation/admin-guide/sysctl/vm.rst for more info on this sysctl. - Huge pages are unmovable when an architectures does not support huge page migration, resulting in a similar issue as with gigantic pages. @@ -672,6 +673,15 @@ block might fail: - Concurrent activity that operates on the same physical memory area, such as allocating gigantic pages, can result in temporary offlining failures. +- When an admin sets the ``movable_gigantic_pages`` sysctl to true, gigantic + pages are allowed in ZONE_MOVABLE. This only allows migratable gigantic + pages to be allocated; however, if there are no eligible destination gigantic + pages at offline, the offlining operation will fail. + + Users leveraging ``movable_gigantic_pages`` should weigh the value of + ZONE_MOVABLE for increasing the reliability of gigantic page allocation + against the potential loss of hot-unplug reliability. + - Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap Optimization (HVO) is enabled. diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/Documentation/admin-guide/mm/nommu-mmap.rst b/Documentation/admin-guide/mm/nommu-mmap.rst index 530fed08de2c..8a1949b3690f 100644 --- a/Documentation/admin-guide/mm/nommu-mmap.rst +++ b/Documentation/admin-guide/mm/nommu-mmap.rst @@ -38,7 +38,7 @@ and it's also much more restricted in the latter case: In the no-MMU case: - - If one exists, the kernel will re-use an existing mapping to the + - If one exists, the kernel will reuse an existing mapping to the same segment of the same file if that has compatible permissions, even if this was created by another process. diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index a70f20ce1ffb..90ab26e805a9 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -217,7 +217,7 @@ MPOL_PREFERRED the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described below. -MPOL_INTERLEAVED +MPOL_INTERLEAVE This mode specifies that page allocations be interleaved, on a page granularity, across the nodes specified in the policy. This mode also behaves slightly differently, based on the diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index caba0f52dd36..c57e61b5d8aa 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,7 +21,8 @@ There are four components to pagemap: * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see Documentation/admin-guide/mm/userfaultfd.rst) - * Bits 58-60 zero + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present @@ -37,12 +38,28 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. + Traditionally, bit 56 indicates that a page is mapped exactly once and bit + 56 is clear when a page is mapped multiple times, even when mapped in the + same process multiple times. In some kernel configurations, the semantics + for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set + if all pages part of the corresponding large allocation are *certainly* + mapped in the same process, even if the page is mapped multiple times in that + process. Bit 56 is clear when any page page of the larger allocation + is *maybe* mapped in a different process. In some cases, a large allocation + might be treated as "maybe mapped by multiple processes" even though this + is no longer the case. + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. @@ -98,7 +115,8 @@ Short descriptions to the page flags A free memory block managed by the buddy system allocator. The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag - set for and _only_ for the first page. + set for all pages. + Before 4.6 only the first page of the block had the flag set. 15 - COMPOUND_HEAD A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its @@ -233,6 +251,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/slab.rst b/Documentation/admin-guide/mm/slab.rst new file mode 100644 index 000000000000..14429ab90611 --- /dev/null +++ b/Documentation/admin-guide/mm/slab.rst @@ -0,0 +1,469 @@ +======================================== +Short users guide for the slab allocator +======================================== + +The slab allocator includes full debugging support (when built with +CONFIG_SLUB_DEBUG=y) but it is off by default (unless built with +CONFIG_SLUB_DEBUG_ON=y). You can enable debugging only for selected +slabs in order to avoid an impact on overall system performance which +may make a bug more difficult to find. + +In order to switch debugging on one can add an option ``slab_debug`` +to the kernel command line. That will enable full debugging for +all slabs. + +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists +slabs that have data in them. See "slabinfo -h" for more options when +running the command. ``slabinfo`` can be compiled with +:: + + gcc -o slabinfo tools/mm/slabinfo.c + +Some of the modes of operation of ``slabinfo`` require that slub debugging +be enabled on the command line. F.e. no tracking information will be +available without debugging on and validation can only partially +be performed if debugging was not switched on. + +Some more sophisticated uses of slab_debug: +------------------------------------------- + +Parameters may be given to ``slab_debug``. If none is specified then full +debugging is enabled. Format: + +slab_debug=<Debug-Options> + Enable options for all slabs + +slab_debug=<Debug-Options>,<slab name1>,<slab name2>,... + Enable options only for select slabs (no spaces + after a comma) + +Multiple blocks of options for all slabs or selected slabs can be given, with +blocks of options delimited by ';'. The last of "all slabs" blocks is applied +to all slabs except those that match one of the "select slabs" block. Options +of the first "select slabs" blocks that matches the slab's name are applied. + +Possible debug options are:: + + F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS + Sorry SLAB legacy issues) + Z Red zoning + P Poisoning (object and padding) + U User tracking (free and alloc) + T Trace (please only use on single slabs) + A Enable failslab filter mark for the cache + O Switch debugging off for caches that would have + caused higher minimum slab orders + - Switch all debugging off (useful if the kernel is + configured with CONFIG_SLUB_DEBUG_ON) + +F.e. in order to boot just with sanity checks and red zoning one would specify:: + + slab_debug=FZ + +Trying to find an issue in the dentry cache? Try:: + + slab_debug=,dentry + +to only enable debugging on the dentry cache. You may use an asterisk at the +end of the slab name, in order to cover all slabs with the same prefix. For +example, here's how you can poison the dentry cache as well as all kmalloc +slabs:: + + slab_debug=P,kmalloc-*,dentry + +Red zoning and tracking may realign the slab. We can just apply sanity checks +to the dentry cache with:: + + slab_debug=F,dentry + +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher likelihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use:: + + slab_debug=O + +You can apply different options to different list of slab names, using blocks +of options. This will enable red zoning for dentry and user tracking for +kmalloc. All other slabs will not get any debugging enabled:: + + slab_debug=Z,dentry;U,kmalloc-* + +You can also enable options (e.g. sanity checks and poisoning) for all caches +except some that are deemed too performance critical and don't need to be +debugged by specifying global debug options followed by a list of slab names +with "-" as options:: + + slab_debug=FZ;-,zs_handle,zspage + +The state of each debug option for a slab can be found in the respective files +under:: + + /sys/kernel/slab/<slab name>/ + +If the file contains 1, the option is enabled, 0 means disabled. The debug +options from the ``slab_debug`` parameter translate to the following files:: + + F sanity_checks + Z red_zone + P poison + U store_user + T trace + A failslab + +failslab file is writable, so writing 1 or 0 will enable or disable +the option at runtime. Write returns -EINVAL if cache is an alias. +Careful with tracing: It may spew out lots of information and never stop if +used on the wrong slab. + +Slab merging +============ + +If no debug options are specified then SLUB may merge similar slabs together +in order to reduce overhead and increase cache hotness of objects. +``slabinfo -a`` displays which slabs were merged together. + +Slab validation +=============== + +SLUB can validate all object if the kernel was booted with slab_debug. In +order to do so you must have the ``slabinfo`` tool. Then you can do +:: + + slabinfo -v + +which will test all objects. Output will be generated to the syslog. + +This also works in a more limited way if boot was without slab debug. +In that case ``slabinfo -v`` simply tests all reachable objects. Usually +these are in the cpu slabs and the partial slabs. Full slabs are not +tracked by SLUB in a non debug situation. + +Getting more performance +======================== + +To some degree SLUB's performance is limited by the need to take the +list_lock once in a while to deal with partial slabs. That overhead is +governed by the order of the allocation for each slab. The allocations +can be influenced by kernel parameters: + +.. slab_min_objects=x (default: automatically scaled by number of cpus) +.. slab_min_order=x (default 0) +.. slab_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) + +``slab_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. + +``slab_min_order`` + specifies a minimum order of slabs. A similar effect like + ``slab_min_objects``. + +``slab_max_order`` + specified the order at which ``slab_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slab_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slab_max_order`` to 0, what cause minimum possible order of + slabs allocation. + +``slab_strict_numa`` + Enables the application of memory policies on each + allocation. This results in more accurate placement of + objects which may result in the reduction of accesses + to remote nodes. The default is to only apply memory + policies at the folio level when a new folio is acquired + or a folio is retrieved from the lists. Enabling this + option reduces the fastpath performance of the slab allocator. + +SLUB Debug output +================= + +Here is a sample of slub debug output:: + + ==================================================================== + BUG kmalloc-8: Right Redzone overwritten + -------------------------------------------------------------------- + + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + + Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object (0xc90f6d20): 31 30 31 39 2e 30 30 35 1019.005 + Redzone (0xc90f6d28): 00 cc cc cc . + Padding (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + + [<c010523d>] dump_trace+0x63/0x1eb + [<c01053df>] show_trace_log_lvl+0x1a/0x2f + [<c010601d>] show_trace+0x12/0x14 + [<c0106035>] dump_stack+0x16/0x18 + [<c017e0fa>] object_err+0x143/0x14b + [<c017e2cc>] check_object+0x66/0x234 + [<c017eb43>] __slab_free+0x239/0x384 + [<c017f446>] kfree+0xa6/0xc6 + [<c02e2335>] get_modalias+0xb9/0xf5 + [<c02e23b7>] dmi_dev_uevent+0x27/0x3c + [<c027866a>] dev_uevent+0x1ad/0x1da + [<c0205024>] kobject_uevent_env+0x20a/0x45b + [<c020527f>] kobject_uevent+0xa/0xf + [<c02779f1>] store_uevent+0x4f/0x58 + [<c027758e>] dev_attr_store+0x29/0x2f + [<c01bec4f>] sysfs_write_file+0x16e/0x19c + [<c0183ba7>] vfs_write+0xd1/0x15a + [<c01841d7>] sys_write+0x3d/0x72 + [<c0104112>] sysenter_past_esp+0x5f/0x99 + [<b7f7b410>] 0xb7f7b410 + ======================= + + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + +If SLUB encounters a corrupted object (full detection requires the kernel +to be booted with slab_debug) then the following output will be dumped +into the syslog: + +1. Description of the problem encountered + + This will be a message in the system log starting with:: + + =============================================== + BUG <slab cache affected>: <What went wrong> + ----------------------------------------------- + + INFO: <corruption start>-<corruption_end> <more info> + INFO: Slab <address> <slab information> + INFO: Object <address> <object information> + INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by + cpu> pid=<pid of the process> + INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> + pid=<pid of the process> + + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slab_debug sets that option) + +2. The object contents if an object was involved. + + Various types of lines can follow the BUG SLUB line: + + Bytes b4 <address> : <bytes> + Shows a few bytes before the object where the problem was detected. + Can be useful if the corruption does not stop with the start of the + object. + + Object <address> : <bytes> + The bytes of the object. If the object is inactive then the bytes + typically contain poison values. Any non-poison value shows a + corruption by a write after free. + + Redzone <address> : <bytes> + The Redzone following the object. The Redzone is used to detect + writes after the object. All bytes should always have the same + value. If there is any deviation then it is due to a write after + the object boundary. + + (Redzone information is only available if SLAB_RED_ZONE is set. + slab_debug sets that option) + + Padding <address> : <bytes> + Unused data to fill up the space in order to get the next object + properly aligned. In the debug case we make sure that there are + at least 4 bytes of padding. This allows the detection of writes + before the object. + +3. A stackdump + + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. + +4. Report on how the problem was dealt with in order to ensure the continued + operation of the system. + + These are messages in the system log beginning with:: + + FIX <slab cache affected>: <corrective action taken> + + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. + +Emergency operations +==================== + +Minimal debugging (sanity checks alone) can be enabled by booting with:: + + slab_debug=F + +This will be generally be enough to enable the resiliency features of slub +which will keep the system running even if a bad kernel component will +keep corrupting objects. This may be important for production systems. +Performance will be impacted by the sanity checks and there will be a +continual stream of error messages to the syslog but no additional memory +will be used (unlike full debugging). + +No guarantees. The kernel component still needs to be fixed. Performance +may be optimized further by locating the slab that experiences corruption +and enabling debugging only for that cache + +I.e.:: + + slab_debug=F,dentry + +If the corruption occurs by writing after the end of the object then it +may be advisable to enable a Redzone to avoid corrupting the beginning +of other objects:: + + slab_debug=FZ,dentry + +Extended slabinfo mode and plotting +=================================== + +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N <num> slabs, default 1) + - Slabs sorted by loss (up to -N <num> slabs, default 1) + +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. + +To generate plots: + +a) collect slabinfo extended records, for example:: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: + + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: + + while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done + +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and height + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). + + +DebugFS files for SLUB +====================== + +For more information about current state of SLUB caches with the user tracking +debug option enabled, debugfs files are available, typically under +/sys/kernel/debug/slab/<cache>/ (created only for caches with enabled user +tracking). There are 2 types of these files with the following debug +information: + +1. alloc_traces:: + + Prints information about unique allocation traces of the currently + allocated objects. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, allocating function, possible memory wastage of + kmalloc objects(total/per-object), minimal/average/maximal jiffies + since alloc, pid range of the allocating processes, cpu mask of + allocating cpus, numa node mask of origins of memory, and stack trace. + + Example::: + + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 + __kmem_cache_alloc_node+0x11f/0x4e0 + kmalloc_trace+0x26/0xa0 + pci_alloc_dev+0x2c/0xa0 + pci_scan_single_device+0xd2/0x150 + pci_scan_slot+0xf7/0x2d0 + pci_scan_child_bus_extend+0x4e/0x360 + acpi_pci_root_create+0x32e/0x3b0 + pci_acpi_scan_root+0x2b9/0x2d0 + acpi_pci_root_add.cold.11+0x110/0xb0a + acpi_bus_attach+0x262/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + +2. free_traces:: + + Prints information about unique freeing traces of the currently allocated + objects. The freeing traces thus come from the previous life-cycle of the + objects and are reported as not available for objects allocated for the first + time. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, freeing function, minimal/average/maximal jiffies since free, + pid range of the freeing processes, cpu mask of freeing cpus, and stack trace. + + Example::: + + 1980 <not-available> age=4294912290 pid=0 cpus=0 + 51 acpi_ut_update_ref_count+0x6a6/0x782 age=236886/237027/237772 pid=1 cpus=1 + kfree+0x2db/0x420 + acpi_ut_update_ref_count+0x6a6/0x782 + acpi_ut_update_object_reference+0x1ad/0x234 + acpi_ut_remove_reference+0x7d/0x84 + acpi_rs_get_prt_method_data+0x97/0xd6 + acpi_get_irq_routing_table+0x82/0xc4 + acpi_pci_irq_find_prt_entry+0x8e/0x2e0 + acpi_pci_irq_lookup+0x3a/0x1e0 + acpi_pci_irq_enable+0x77/0x240 + pcibios_enable_device+0x39/0x40 + do_pci_enable_device.part.0+0x5d/0xe0 + pci_enable_device_flags+0xfc/0x120 + pci_enable_device+0x13/0x20 + virtio_pci_probe+0x9e/0x170 + local_pci_probe+0x48/0x80 + pci_device_probe+0x105/0x1c0 + +Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst deleted file mode 100644 index 2e630627bcee..000000000000 --- a/Documentation/admin-guide/mm/swap_numa.rst +++ /dev/null @@ -1,78 +0,0 @@ -=========================================== -Automatically bind swap device to numa node -=========================================== - -If the system has more than one swap device and swap device has the node -information, we can make use of this information to decide which swap -device to use in get_swap_pages() to get better performance. - - -How to use this feature -======================= - -Swap device has priority and that decides the order of it to be used. To make -use of automatically binding, there is no need to manipulate priority settings -for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and -swapB, with swapA attached to node 0 and swapB attached to node 1, are going -to be swapped on. Simply swapping them on by doing:: - - # swapon /dev/swapA - # swapon /dev/swapB - -Then node 0 will use the two swap devices in the order of swapA then swapB and -node 1 will use the two swap devices in the order of swapB then swapA. Note -that the order of them being swapped on doesn't matter. - -A more complex example on a 4 node machine. Assume 6 swap devices are going to -be swapped on: swapA and swapB are attached to node 0, swapC is attached to -node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. -The way to swap them on is the same as above:: - - # swapon /dev/swapA - # swapon /dev/swapB - # swapon /dev/swapC - # swapon /dev/swapD - # swapon /dev/swapE - # swapon /dev/swapF - -Then node 0 will use them in the order of:: - - swapA/swapB -> swapC -> swapD -> swapE -> swapF - -swapA and swapB will be used in a round robin mode before any other swap device. - -node 1 will use them in the order of:: - - swapC -> swapA -> swapB -> swapD -> swapE -> swapF - -node 2 will use them in the order of:: - - swapD/swapE -> swapA -> swapB -> swapC -> swapF - -Similaly, swapD and swapE will be used in a round robin mode before any -other swap devices. - -node 3 will use them in the order of:: - - swapF -> swapA -> swapB -> swapC -> swapD -> swapE - - -Implementation details -====================== - -The current code uses a priority based list, swap_avail_list, to decide -which swap device to use and if multiple swap devices share the same -priority, they are used round robin. This change here replaces the single -global swap_avail_list with a per-numa-node list, i.e. for each numa node, -it sees its own priority based list of available swap devices. Swap -device's priority can be promoted on its matching node's swap_avail_list. - -The current swap device's priority is set as: user can set a >=0 value, -or the system will pick one starting from -1 then downwards. The priority -value in the swap_avail_list is the negated value of the swap device's -due to plist being sorted from low to high. The new policy doesn't change -the semantics for priority >=0 cases, the previous starting from -1 then -downwards now becomes starting from -2 then downwards and -1 is reserved -as the promoted value. So if multiple swap devices are attached to the same -node, they will all be promoted to priority -1 on that node's plist and will -be used round robin before any other swap devices. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index dff8d5985f0f..5fbc3d89bb07 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -107,7 +107,7 @@ sysfs Global THP controls ------------------- -Transparent Hugepage Support for anonymous memory can be entirely disabled +Transparent Hugepage Support for anonymous memory can be disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled system wide. This can be achieved per-supported-THP-size with one of:: @@ -119,6 +119,11 @@ system wide. This can be achieved per-supported-THP-size with one of:: where <size> is the hugepage size being addressed, the available sizes for which vary by system. +.. note:: Setting "never" in all sysfs THP controls does **not** disable + Transparent Huge Pages globally. This is because ``madvise(..., + MADV_COLLAPSE)`` ignores these settings and collapses ranges to + PMD-sized huge pages unconditionally. + For example:: echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled @@ -187,7 +192,9 @@ madvise behaviour. never - should be self-explanatory. + should be self-explanatory. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be + obtained even if this mode is specified everywhere. By default kernel tries to use huge, PMD-mappable zero page on read page fault to anonymous mapping. It's possible to disable huge zero @@ -218,6 +225,42 @@ to "always" or "madvise"), and it'll be automatically shutdown when PMD-sized THP is disabled (when both the per-size anon control and the top-level control are "never") +process THP controls +-------------------- + +A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE`` +and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using +``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls +support the following arguments:: + + prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0): + This will disable THPs completely for the process, irrespective + of global THP controls or madvise(..., MADV_COLLAPSE) being used. + + prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0): + This will disable THPs for the process except when the usage of THPs is + advised. Consequently, THPs will only be used when: + - Global THP controls are set to "always" or "madvise" and + madvise(..., MADV_HUGEPAGE) or madvise(..., MADV_COLLAPSE) is used. + - Global THP controls are set to "never" and madvise(..., MADV_COLLAPSE) + is used. This is the same behavior as if THPs would not be disabled on + a process level. + Note that MADV_COLLAPSE is currently always rejected if + madvise(..., MADV_NOHUGEPAGE) is set on an area. + + prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0): + This will re-enable THPs for the process, as if they were never disabled. + Whether THPs will actually be used depends on global THP controls and + madvise() calls. + + prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0): + This returns a value whose bits indicate how THP-disable is configured: + Bits + 1 0 Value Description + |0|0| 0 No THP-disable behaviour specified. + |0|1| 1 THP is entirely disabled for this process. + |1|1| 3 THP-except-advised mode is set for this process. + Khugepaged controls ------------------- @@ -338,6 +381,11 @@ hugepage allocation policy for the tmpfs mount by using the kernel parameter four valid policies for tmpfs (``always``, ``within_size``, ``advise``, ``never``). The tmpfs mount default policy is ``never``. +Additionally, Kconfig options are available to set the default hugepage +policies for shmem (``CONFIG_TRANSPARENT_HUGEPAGE_SHMEM_HUGE_*``) and tmpfs +(``CONFIG_TRANSPARENT_HUGEPAGE_TMPFS_HUGE_*``) at build time. Refer to the +Kconfig help for more details. + In the same manner as ``thp_anon`` controls each supported anonymous THP size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` has the same format as ``thp_anon``, but also supports the policy @@ -376,12 +424,18 @@ option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; never - Do not allocate huge pages; + Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` + can still cause transparent huge pages to be obtained even if this mode + is specified everywhere; within_size - Only allocate huge page if it will be fully within i_size. + Only allocate huge page if it will be fully within i_size; + Always try PMD-sized huge pages first, and fall back to smaller-sized + huge pages if the PMD-sized huge page allocation fails; Also respect madvise() hints; advise @@ -434,7 +488,9 @@ inherit have enabled="inherit" and all other hugepage sizes have enabled="never"; never - Do not allocate <size> huge pages; + Do not allocate <size> huge pages. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained + even if this mode is specified everywhere; within_size Only allocate <size> huge page if it will be fully within i_size. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 3598dcd7dbe7..2464425c783d 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -53,28 +53,17 @@ Zswap receives pages for compression from the swap subsystem and is able to evict pages from its own compressed pool on an LRU basis and write them back to the backing swap device in the case that the compressed pool is full. -Zswap makes use of zpool for the managing the compressed memory pool. Each -allocation in zpool is not directly accessible by address. Rather, a handle is +Zswap makes use of zsmalloc for the managing the compressed memory pool. Each +allocation in zsmalloc is not directly accessible by address. Rather, a handle is returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed -pages are freed. The pool is not preallocated. By default, a zpool -of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, -but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs -``zpool`` attribute, e.g.:: +pages are freed. The pool is not preallocated. - echo zbud > /sys/module/zswap/parameters/zpool - -The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which -means the compression ratio will always be 2:1 or worse (because of half-full -zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. - -When a swap page is passed from swapout to zswap, zswap maintains a mapping -of the swap entry, a combination of the swap type and swap offset, to the zpool +When a swap page is passed from swapout to zswap, zswap maintains a mapping of +the swap entry, a combination of the swap type and swap offset, to the zsmalloc handle that references that compressed swap page. This mapping is achieved -with a red-black tree per swap type. The swap offset is the search key for the -tree nodes. +with an xarray per swap type. The swap offset is the search key for the xarray +nodes. During a page fault on a PTE that is a swap entry, the swapin code calls the zswap load function to decompress the page into the page allocated by the page @@ -98,11 +87,11 @@ attribute, e.g.:: echo lzo > /sys/module/zswap/parameters/compressor -When the zpool and/or compressor parameter is changed at runtime, any existing -compressed pages are not modified; they are left in their own zpool. When a -request is made for a page in an old zpool, it is uncompressed using its -original compressor. Once all pages are removed from an old zpool, the zpool -and its compressor are freed. +When the compressor parameter is changed at runtime, any existing compressed +pages are not modified; they are left in their own pool. When a request is +made for a page in an old pool, it is uncompressed using its original +compressor. Once all pages are removed from an old pool, the pool and its +compressor are freed. Some of the pages in zswap are same-value filled pages (i.e. contents of the page have same value or repetitive pattern). These pages include zero-filled diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst index a8667a777490..7f2f127dc76f 100644 --- a/Documentation/admin-guide/module-signing.rst +++ b/Documentation/admin-guide/module-signing.rst @@ -28,10 +28,12 @@ trusted userspace bits. This facility uses X.509 ITU-T standard certificates to encode the public keys involved. The signatures are not themselves encoded in any industrial standard -type. The built-in facility currently only supports the RSA & NIST P-384 ECDSA -public key signing standard (though it is pluggable and permits others to be -used). The possible hash algorithms that can be used are SHA-2 and SHA-3 of -sizes 256, 384, and 512 (the algorithm is selected by data in the signature). +type. The built-in facility currently only supports the RSA, NIST P-384 ECDSA +and NIST FIPS-204 ML-DSA public key signing standards (though it is pluggable +and permits others to be used). For RSA and ECDSA, the possible hash +algorithms that can be used are SHA-2 and SHA-3 of sizes 256, 384, and 512 (the +algorithm is selected by data in the signature); ML-DSA does its own hashing, +but is allowed to be used with a SHA512 hash for signed attributes. ========================== @@ -146,9 +148,9 @@ into vmlinux) using parameters in the:: file (which is also generated if it does not already exist). -One can select between RSA (``MODULE_SIG_KEY_TYPE_RSA``) and ECDSA -(``MODULE_SIG_KEY_TYPE_ECDSA``) to generate either RSA 4k or NIST -P-384 keypair. +One can select between RSA (``MODULE_SIG_KEY_TYPE_RSA``), ECDSA +(``MODULE_SIG_KEY_TYPE_ECDSA``) and ML-DSA (``MODULE_SIG_KEY_TYPE_MLDSA_*``) to +generate an RSA 4k, a NIST P-384 keypair or an ML-DSA 44, 65 or 87 keypair. It is strongly recommended that you provide your own x509.genkey file. diff --git a/Documentation/admin-guide/namespaces/resource-control.rst b/Documentation/admin-guide/namespaces/resource-control.rst index 369556e00f0c..553a44803231 100644 --- a/Documentation/admin-guide/namespaces/resource-control.rst +++ b/Documentation/admin-guide/namespaces/resource-control.rst @@ -1,17 +1,17 @@ -=========================== -Namespaces research control -=========================== +==================================== +User namespaces and resource control +==================================== -There are a lot of kinds of objects in the kernel that don't have -individual limits or that have limits that are ineffective when a set -of processes is allowed to switch user ids. With user namespaces -enabled in a kernel for people who don't trust their users or their -users programs to play nice this problems becomes more acute. +The kernel contains many kinds of objects that either don't have +individual limits or that have limits which are ineffective when +a set of processes is allowed to switch their UID. On a system +where the admins don't trust their users or their users' programs, +user namespaces expose the system to potential misuse of resources. -Therefore it is recommended that memory control groups be enabled in -kernels that enable user namespaces, and it is further recommended -that userspace configure memory control groups to limit how much -memory user's they don't trust to play nice can use. +In order to mitigate this, we recommend that admins enable memory +control groups on any system that enables user namespaces. +Furthermore, we recommend that admins configure the memory control +groups to limit the maximum memory usable by any untrusted user. Memory control groups can be configured by installing the libcgroup package present on most distros editing /etc/cgrules.conf, diff --git a/Documentation/admin-guide/nfs/nfsroot.rst b/Documentation/admin-guide/nfs/nfsroot.rst index 135218f33394..06990309c6ff 100644 --- a/Documentation/admin-guide/nfs/nfsroot.rst +++ b/Documentation/admin-guide/nfs/nfsroot.rst @@ -342,7 +342,7 @@ They depend on various facilities being available: When using pxelinux, the kernel image is specified using "kernel <relative-path-below /tftpboot>". The nfsroot parameters are passed to the kernel by adding them to the "append" line. - It is common to use serial console in conjunction with pxeliunx, + It is common to use serial console in conjunction with pxelinux, see Documentation/admin-guide/serial-console.rst for more information. For more information on isolinux, including how to create bootdisks diff --git a/Documentation/admin-guide/nfs/pnfs-block-server.rst b/Documentation/admin-guide/nfs/pnfs-block-server.rst index 20fe9f5117fe..7667dd2e17f1 100644 --- a/Documentation/admin-guide/nfs/pnfs-block-server.rst +++ b/Documentation/admin-guide/nfs/pnfs-block-server.rst @@ -40,3 +40,33 @@ how to translate the device into a serial number from SCSI EVPD 0x80:: echo "fencing client ${CLIENT} serial ${EVPD}" >> /var/log/pnfsd-fence.log EOF + +If the nfsd server needs to fence a non-responding client and the +fencing operation fails, the server logs a warning message in the +system log with the following format: + + FENCE failed client[IP_address] clid[#n] device[dev_name] + + where: + + - IP_address: refers to the IP address of the affected client. + - #n: indicates the unique client identifier. + - dev_name: specifies the name of the block device related + to the fencing attempt. + +The server will repeatedly retry the operation indefinitely. During +this time, access to the affected file is restricted for all other +clients. This is to prevent potential data corruption if multiple +clients access the same file simultaneously. + +To restore access to the affected file for other clients, the admin +needs to take the following actions: + + - shutdown or power off the client being fenced. + - manually expire the client to release all its state on the server:: + + echo 'expire' > /proc/fs/nfsd/clients/clid/ctl + + where: + + - clid: is the unique client identifier displayed in the system log. diff --git a/Documentation/admin-guide/nfs/pnfs-scsi-server.rst b/Documentation/admin-guide/nfs/pnfs-scsi-server.rst index b2eec2288329..b202508d281d 100644 --- a/Documentation/admin-guide/nfs/pnfs-scsi-server.rst +++ b/Documentation/admin-guide/nfs/pnfs-scsi-server.rst @@ -22,3 +22,34 @@ option and the underlying SCSI device support persistent reservations. On the client make sure the kernel has the CONFIG_PNFS_BLOCK option enabled, and the file system is mounted using the NFSv4.1 protocol version (mount -o vers=4.1). + +If the nfsd server needs to fence a non-responding client and the +fencing operation fails, the server logs a warning message in the +system log with the following format: + + FENCE failed client[IP_address] clid[#n] device[dev_name] + + where: + + - IP_address: refers to the IP address of the affected client. + - #n: indicates the unique client identifier. + - dev_name: specifies the name of the block device related + to the fencing attempt. + +The server will repeatedly retry the operation indefinitely. During +this time, access to the affected file is restricted for all other +clients. This is to prevent potential data corruption if multiple +clients access the same file simultaneously. + +To restore access to the affected file for other clients, the admin +needs to take the following actions: + + - shutdown or power off the client being fenced. + - manually expire the client to release all its state on the server:: + + echo 'expire' > /proc/fs/nfsd/clients/clid/ctl + + where: + + - clid: is the unique client identifier displayed in the system log. + diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index cb376f335f40..167f9281fbf5 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -16,8 +16,8 @@ provides the following two features: - one 64-bit counter for Time Based Analysis (RX/TX data throughput and time spent in each low-power LTSSM state) and -- one 32-bit counter for Event Counting (error and non-error events for - a specified lane) +- one 32-bit counter per event for Event Counting (error and non-error + events for a specified lane) Note: There is no interrupt for counter overflow. diff --git a/Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst b/Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst new file mode 100644 index 000000000000..2ec0249e37b6 --- /dev/null +++ b/Documentation/admin-guide/perf/fujitsu_uncore_pmu.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +================================================ +Fujitsu Uncore Performance Monitoring Unit (PMU) +================================================ + +This driver supports the Uncore MAC PMUs and the Uncore PCI PMUs found +in Fujitsu chips. +Each MAC PMU on these chips is exposed as a uncore perf PMU with device name +mac_iod<iod>_mac<mac>_ch<ch>. +And each PCI PMU on these chips is exposed as a uncore perf PMU with device name +pci_iod<iod>_pci<pci>. + +The driver provides a description of its available events and configuration +options in sysfs, see /sys/bus/event_sources/devices/mac_iod<iod>_mac<mac>_ch<ch>/ +and /sys/bus/event_sources/devices/pci_iod<iod>_pci<pci>/. +This driver exports: + +- formats, used by perf user space and other tools to configure events +- events, used by perf user space and other tools to create events + symbolically, e.g.:: + + perf stat -a -e mac_iod0_mac0_ch0/event=0x21/ ls + perf stat -a -e pci_iod0_pci0/event=0x24/ ls + +- cpumask, used by perf user space and other tools to know on which CPUs + to open the events + +This driver supports the following events for MAC: + +- cycles + This event counts MAC cycles at MAC frequency. +- read-count + This event counts the number of read requests to MAC. +- read-count-request + This event counts the number of read requests including retry to MAC. +- read-count-return + This event counts the number of responses to read requests to MAC. +- read-count-request-pftgt + This event counts the number of read requests including retry with PFTGT + flag. +- read-count-request-normal + This event counts the number of read requests including retry without PFTGT + flag. +- read-count-return-pftgt-hit + This event counts the number of responses to read requests which hit the + PFTGT buffer. +- read-count-return-pftgt-miss + This event counts the number of responses to read requests which miss the + PFTGT buffer. +- read-wait + This event counts outstanding read requests issued by DDR memory controller + per cycle. +- write-count + This event counts the number of write requests to MAC (including zero write, + full write, partial write, write cancel). +- write-count-write + This event counts the number of full write requests to MAC (not including + zero write). +- write-count-pwrite + This event counts the number of partial write requests to MAC. +- memory-read-count + This event counts the number of read requests from MAC to memory. +- memory-write-count + This event counts the number of full write requests from MAC to memory. +- memory-pwrite-count + This event counts the number of partial write requests from MAC to memory. +- ea-mac + This event counts energy consumption of MAC. +- ea-memory + This event counts energy consumption of memory. +- ea-memory-mac-write + This event counts the number of write requests from MAC to memory. +- ea-ha + This event counts energy consumption of HA. + + 'ea' is the abbreviation for 'Energy Analyzer'. + +Examples for use with perf:: + + perf stat -e mac_iod0_mac0_ch0/ea-mac/ ls + +And, this driver supports the following events for PCI: + +- pci-port0-cycles + This event counts PCI cycles at PCI frequency in port0. +- pci-port0-read-count + This event counts read transactions for data transfer in port0. +- pci-port0-read-count-bus + This event counts read transactions for bus usage in port0. +- pci-port0-write-count + This event counts write transactions for data transfer in port0. +- pci-port0-write-count-bus + This event counts write transactions for bus usage in port0. +- pci-port1-cycles + This event counts PCI cycles at PCI frequency in port1. +- pci-port1-read-count + This event counts read transactions for data transfer in port1. +- pci-port1-read-count-bus + This event counts read transactions for bus usage in port1. +- pci-port1-write-count + This event counts write transactions for data transfer in port1. +- pci-port1-write-count-bus + This event counts write transactions for bus usage in port1. +- ea-pci + This event counts energy consumption of PCI. + + 'ea' is the abbreviation for 'Energy Analyzer'. + +Examples for use with perf:: + + perf stat -e pci_iod0_pci0/ea-pci/ ls + +Given that these are uncore PMUs the driver does not support sampling, therefore +"perf record" will not work. Per-task perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst index 48992a0b8e94..d56b2d690709 100644 --- a/Documentation/admin-guide/perf/hisi-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -18,9 +18,10 @@ HiSilicon SoC uncore PMU driver Each device PMU has separate registers for event counting, control and interrupt, and the PMU driver shall register perf PMU drivers like L3C, HHA and DDRC etc. The available events and configuration options shall -be described in the sysfs, see: +be described in the sysfs, see:: + +/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}> -/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>. The "perf list" command shall list the available events from sysfs. Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU @@ -65,6 +66,10 @@ specified as a bitmap:: This will only count the operations from core/thread 0 and 1 in this cluster. +User should not use tt_core_deprecated to specify the core/thread filtering. +This option is provided for backward compatiblility and only support 8bit +which may not cover all the core/thread sharing L3C. + 2. Tracetag allow the user to chose to count only read, write or atomic operations via the tt_req parameeter in perf. The default value counts all operations. tt_req is 3bits, 3'b100 represents read operations, 3'b101 @@ -109,8 +114,52 @@ uring channel. It is 2 bits. Some important codes are as follows: - 2'b11: count the events which sent to the uring_ext (MATA) channel; - 2'b01: is the same as 2'b11; - 2'b10: count the events which sent to the uring (non-MATA) channel; -- 2'b00: default value, count the events which sent to the both uring and - uring_ext channel; +- 2'b00: default value, count the events which sent to both uring and + uring_ext channels; + +6. ch: NoC PMU supports filtering the event counts of certain transaction +channel with this option. The current supported channels are as follows: + +- 3'b010: Request channel +- 3'b100: Snoop channel +- 3'b110: Response channel +- 3'b111: Data channel + +7. tt_en: NoC PMU supports counting only transactions that have tracetag set +if this option is set. See the 2nd list for more information about tracetag. + +For HiSilicon uncore PMU v3 whose identifier is 0x40, some uncore PMUs are +further divided into parts for finer granularity of tracing, each part has its +own dedicated PMU, and all such PMUs together cover the monitoring job of events +on particular uncore device. Such PMUs are described in sysfs with name format +slightly changed:: + +/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}_{Z}/ddrc{Y}_{Z}/noc{Y}_{Z}> + +Z is the sub-id, indicating different PMUs for part of hardware device. + +Usage of most PMUs with different sub-ids are identical. Specially, L3C PMU +provides ``ext`` option to allow exploration of even finer granual statistics +of L3C PMU. L3C PMU driver uses that as hint of termination when delivering +perf command to hardware: + +- ext=0: Default, could be used with event names. +- ext=1 and ext=2: Must be used with event codes, event names are not supported. + +An example of perf command could be:: + + $# perf stat -a -e hisi_sccl0_l3c1_0/rd_spipe/ sleep 5 + +or:: + + $# perf stat -a -e hisi_sccl0_l3c1_0/event=0x1,ext=1/ sleep 5 + +As above, ``hisi_sccl0_l3c1_0`` locates PMU of Super CPU CLuster 0, L3 cache 1 +pipe0. + +First command locates the first part of L3C since ``ext=0`` is implied by +default. Second command issues the counting on another part of L3C with the +event ``0x1``. Users could configure IDs to count data come from specific CCL/ICL, by setting srcid_cmd & srcid_msk, and data desitined for specific CCL/ICL by setting diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 072b510385c4..aa12708ddb96 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -24,8 +24,10 @@ Performance monitor support thunderx2-pmu alibaba_pmu dwc_pcie_pmu - nvidia-pmu + nvidia-tegra241-pmu + nvidia-tegra410-pmu meson-ddr-pmu cxl ampere_cspmu mrvl-pem-pmu + fujitsu_uncore_pmu diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst index f538ef67e0e8..fad5bc4cee6c 100644 --- a/Documentation/admin-guide/perf/nvidia-pmu.rst +++ b/Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst @@ -1,8 +1,8 @@ -========================================================= -NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU) -========================================================= +============================================================ +NVIDIA Tegra241 SoC Uncore Performance Monitoring Unit (PMU) +============================================================ -The NVIDIA Tegra SoC includes various system PMUs to measure key performance +The NVIDIA Tegra241 SoC includes various system PMUs to measure key performance metrics like memory bandwidth, latency, and utilization: * Scalable Coherency Fabric (SCF) diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst new file mode 100644 index 000000000000..0656223b61d4 --- /dev/null +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst @@ -0,0 +1,522 @@ +===================================================================== +NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU) +===================================================================== + +The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance +metrics like memory bandwidth, latency, and utilization: + +* Unified Coherence Fabric (UCF) +* PCIE +* PCIE-TGT +* CPU Memory (CMEM) Latency +* NVLink-C2C +* NV-CLink +* NV-DLink + +PMU Driver +---------- + +The PMU driver describes the available events and configuration of each PMU in +sysfs. Please see the sections below to get the sysfs path of each PMU. Like +other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show +the CPU id used to handle the PMU event. There is also "associated_cpus" +sysfs attribute, which contains a list of CPUs associated with the PMU instance. + +UCF PMU +------- + +The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a +distributed cache, last level for CPU Memory and CXL Memory, and cache coherent +interconnect that supports hardware coherence across multiple coherently caching +agents, including: + + * CPU clusters + * GPU + * PCIe Ordering Controller Unit (OCU) + * Other IO-coherent requesters + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>. + +Some of the events available in this PMU can be used to measure bandwidth and +utilization: + + * slc_access_rd: count the number of read requests to SLC. + * slc_access_wr: count the number of write requests to SLC. + * slc_bytes_rd: count the number of bytes transferred by slc_access_rd. + * slc_bytes_wr: count the number of bytes transferred by slc_access_wr. + * mem_access_rd: count the number of read requests to local or remote memory. + * mem_access_wr: count the number of write requests to local or remote memory. + * mem_bytes_rd: count the number of bytes transferred by mem_access_rd. + * mem_bytes_wr: count the number of bytes transferred by mem_access_wr. + * cycles: counts the UCF cycles. + +The average bandwidth is calculated as:: + + AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS + AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS + AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS + AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS + +The average request rate is calculated as:: + + AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES + AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES + AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES + AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES + +More details about what other events are available can be found in Tegra410 SoC +technical reference manual. + +The events can be filtered based on source or destination. The source filter +indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or +remote socket. The destination filter specifies the destination memory type, +e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The +local/remote classification of the destination filter is based on the home +socket of the address, not where the data actually resides. The available +filters are described in +/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/. + +The list of UCF PMU event filters: + +* Source filter: + + * src_loc_cpu: if set, count events from local CPU + * src_loc_noncpu: if set, count events from local non-CPU device + * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket + +* Destination filter: + + * dst_loc_cmem: if set, count events to local system memory (CMEM) address + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address + * dst_loc_other: if set, count events to local CXL memory address + * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket + +If the source is not specified, the PMU will count events from all sources. If +the destination is not specified, the PMU will count events to all destinations. + +Example usage: + +* Count event id 0x0 in socket 0 from all sources and to all destinations:: + + perf stat -a -e nvidia_ucf_pmu_0/event=0x0/ + +* Count event id 0x0 in socket 0 with source filter = local CPU and destination + filter = local system memory (CMEM):: + + perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/ + +* Count event id 0x0 in socket 1 with source filter = local non-CPU device and + destination filter = remote memory:: + + perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/ + +PCIE PMU +-------- + +This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and +the memory subsystem. It monitors all read/write traffic from the root port(s) +or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per +PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into +up to 8 root ports. The traffic from each root port can be filtered using RP or +BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will +capture traffic from all RPs. Please see below for more details. + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>. + +The events in this PMU can be used to measure bandwidth, utilization, and +latency: + + * rd_req: count the number of read requests by PCIE device. + * wr_req: count the number of write requests by PCIE device. + * rd_bytes: count the number of bytes transferred by rd_req. + * wr_bytes: count the number of bytes transferred by wr_req. + * rd_cum_outs: count outstanding rd_req each cycle. + * cycles: count the clock cycles of SOC fabric connected to the PCIE interface. + +The average bandwidth is calculated as:: + + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS + +The average request rate is calculated as:: + + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES + + +The average latency is calculated as:: + + FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS + AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ + AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ + +The PMU events can be filtered based on the traffic source and destination. +The source filter indicates the PCIE devices that will be monitored. The +destination filter specifies the destination memory type, e.g. local system +memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote +classification of the destination filter is based on the home socket of the +address, not where the data actually resides. These filters can be found in +/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/. + +The list of event filters: + +* Source filter: + + * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this + bitmask represents the RP index in the RC. If the bit is set, all devices under + the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor + devices in root port 0 to 3. + * src_bdf: the BDF that will be monitored. This is a 16-bit value that + follows formula: (bus << 8) + (device << 3) + (function). For example, the + value of BDF 27:01.1 is 0x2781. + * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in + "src_bdf" is used to filter the traffic. + + Note that Root-Port and BDF filters are mutually exclusive and the PMU in + each RC can only have one BDF filter for the whole counters. If BDF filter + is enabled, the BDF filter value will be applied to all events. + +* Destination filter: + + * dst_loc_cmem: if set, count events to local system memory (CMEM) address + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address + * dst_loc_pcie_p2p: if set, count events to local PCIE peer address + * dst_loc_pcie_cxl: if set, count events to local CXL memory address + * dst_rem: if set, count events to remote memory address + +If the source filter is not specified, the PMU will count events from all root +ports. If the destination filter is not specified, the PMU will count events +to all destinations. + +Example usage: + +* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all + destinations:: + + perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/ + +* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and + targeting just local CMEM of socket 0:: + + perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/ + +* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all + destinations:: + + perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/ + +* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and + targeting just local CMEM of socket 1:: + + perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/ + +* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all + destinations:: + + perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/ + +.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section: + +Mapping the RC# to lspci segment number +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA +Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space +for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register +contains the following information to map PCIE devices under the RP back to its RC# : + + - Bus# (byte 0xc) : bus number as reported by the lspci output + - Segment# (byte 0xd) : segment number as reported by the lspci output + - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability + - RC# (byte 0xf): root complex number associated with the RP + - Socket# (byte 0x10): socket number associated with the RP + +Example script for mapping lspci BDF to RC# and socket#:: + + #!/bin/bash + while read bdf rest; do + dvsec4_reg=$(lspci -vv -s $bdf | awk ' + /Designated Vendor-Specific: Vendor=10de ID=0004/ { + match($0, /\[([0-9a-fA-F]+)/, arr); + print "0x" arr[1]; + exit + } + ') + if [ -n "$dvsec4_reg" ]; then + bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b) + segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b) + rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b) + rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b) + socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b) + echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket" + fi + done < <(lspci -d 10de:) + +Example output:: + + 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00 + 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00 + 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00 + 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00 + 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00 + 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00 + 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00 + 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00 + 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00 + 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00 + 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00 + 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01 + 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01 + 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01 + 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01 + 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01 + 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01 + 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01 + 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01 + 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01 + 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01 + +PCIE-TGT PMU +------------ + +This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and +the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges. +There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can +have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU +provides RP filter to count PCIE BAR traffic to each RP and address filter to +count access to PCIE BAR or CXL HDM ranges. The details of the filters are +described in the following sections. + +Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see +:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info. + +The events and configuration options of this PMU device are available in sysfs, +see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>. + +The events in this PMU can be used to measure bandwidth and utilization: + + * rd_req: count the number of read requests to PCIE. + * wr_req: count the number of write requests to PCIE. + * rd_bytes: count the number of bytes transferred by rd_req. + * wr_bytes: count the number of bytes transferred by wr_req. + * cycles: count the clock cycles of SOC fabric connected to the PCIE interface. + +The average bandwidth is calculated as:: + + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS + +The average request rate is calculated as:: + + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES + +The PMU events can be filtered based on the destination root port or target +address range. Filtering based on RP is only available for PCIE BAR traffic. +Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be +found in sysfs, see +/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/. + +Destination filter settings: + +* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF" + corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is + only available for PCIE BAR traffic. +* dst_addr_base: BAR or CXL HDM filter base address. +* dst_addr_mask: BAR or CXL HDM filter address mask. +* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the + address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter + the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison + to determine if the traffic destination address falls within the filter range:: + + (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask) + + If the comparison succeeds, then the event will be counted. + +If the destination filter is not specified, the RP filter will be configured by default +to count PCIE BAR traffic to all root ports. + +Example usage: + +* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0:: + + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/ + +* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range + 0x10000 to 0x100FF on socket 0's PCIE RC-1:: + + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/ + +CPU Memory (CMEM) Latency PMU +----------------------------- + +This PMU monitors latency events of memory read requests from the edge of the +Unified Coherence Fabric (UCF) to local CPU DRAM: + + * RD_REQ counters: count read requests (32B per request). + * RD_CUM_OUTS counters: accumulated outstanding request counter, which track + how many cycles the read requests are in flight. + * CYCLES counter: counts the number of elapsed cycles. + +The average latency is calculated as:: + + FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS + AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ + AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ + +The events and configuration options of this PMU device are described in sysfs, +see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>. + +Example usage:: + + perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}' + +NVLink-C2C PMU +-------------- + +This PMU monitors latency events of memory read/write requests that pass through +the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available +in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC). + +The events and configuration options of this PMU device are available in sysfs, +see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>. + +The list of events: + + * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests. + * IN_RD_REQ: the number of incoming read requests. + * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests. + * IN_WR_REQ: the number of incoming write requests. + * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests. + * OUT_RD_REQ: the number of outgoing read requests. + * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests. + * OUT_WR_REQ: the number of outgoing write requests. + * CYCLES: NVLink-C2C interface cycle counts. + +The incoming events count the reads/writes from remote device to the SoC. +The outgoing events count the reads/writes from the SoC to remote device. + +The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer +contains the information about the connected device. + +When the C2C interface is connected to GPU(s), the user can use the +"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU +index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1. +The PMU will monitor all GPUs by default if not specified. + +When connected to another SoC, only the read events are available. + +The events can be used to calculate the average latency of the read/write requests:: + + C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS + + IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ + IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ + + IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ + IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ + + OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ + OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ + + OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ + OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ + +Example usage: + + * Count incoming traffic from all GPUs connected via NVLink-C2C:: + + perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/ + + * Count incoming traffic from GPU 0 connected via NVLink-C2C:: + + perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/ + + * Count incoming traffic from GPU 1 connected via NVLink-C2C:: + + perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/ + + * Count outgoing traffic to all GPUs connected via NVLink-C2C:: + + perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/ + + * Count outgoing traffic to GPU 0 connected via NVLink-C2C:: + + perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/ + + * Count outgoing traffic to GPU 1 connected via NVLink-C2C:: + + perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/ + +NV-CLink PMU +------------ + +This PMU monitors latency events of memory read requests that pass through +the NV-CLINK interface. Bandwidth events are not available in this PMU. +In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410 +SoC and this PMU only counts read traffic. + +The events and configuration options of this PMU device are available in sysfs, +see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>. + +The list of events: + + * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests. + * IN_RD_REQ: the number of incoming read requests. + * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests. + * OUT_RD_REQ: the number of outgoing read requests. + * CYCLES: NV-CLINK interface cycle counts. + +The incoming events count the reads from remote device to the SoC. +The outgoing events count the reads from the SoC to remote device. + +The events can be used to calculate the average latency of the read requests:: + + CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS + + IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ + IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ + + OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ + OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ + +Example usage: + + * Count incoming read traffic from remote SoC connected via NV-CLINK:: + + perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/ + + * Count outgoing read traffic to remote SoC connected via NV-CLINK:: + + perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/ + +NV-DLink PMU +------------ + +This PMU monitors latency events of memory read requests that pass through +the NV-DLINK interface. Bandwidth events are not available in this PMU. +In Tegra410 SoC, this PMU only counts CXL memory read traffic. + +The events and configuration options of this PMU device are available in sysfs, +see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>. + +The list of events: + + * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory. + * IN_RD_REQ: the number of read requests to CXL memory. + * CYCLES: NV-DLINK interface cycle counts. + +The events can be used to calculate the average latency of the read requests:: + + DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS + + IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ + IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ + +Example usage: + + * Count read events to CXL memory:: + + perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}' diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 412423c54f25..a95e2ebce005 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -72,7 +72,7 @@ to manage each performance update behavior. :: Lowest non- | | | | linear perf ------>+-----------------------+ +-----------------------+ | | | | - | | Lowest perf ---->| | + | | Min perf ---->| | | | | | Lowest perf ------>+-----------------------+ +-----------------------+ | | | | @@ -239,8 +239,12 @@ control its functionality at the system level. They are located in the root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd* /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf + /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_hw_prefcore /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq + /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_floor_freq + /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_floor_count + /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_prefcore_ranking ``amd_pstate_highest_perf / amd_pstate_max_freq`` @@ -264,14 +268,46 @@ This attribute is read-only. ``amd_pstate_hw_prefcore`` -Whether the platform supports the preferred core feature and it has been -enabled. This attribute is read-only. +Whether the platform supports the preferred core feature and it has +been enabled. This attribute is read-only. This file is only visible +on platforms which support the preferred core feature. ``amd_pstate_prefcore_ranking`` The performance ranking of the core. This number doesn't have any unit, but larger numbers are preferred at the time of reading. This can change at -runtime based on platform conditions. This attribute is read-only. +runtime based on platform conditions. This attribute is read-only. This file +is only visible on platforms which support the preferred core feature. + +``amd_pstate_floor_freq`` + +The floor frequency associated with each CPU. Userspace can write any +value between ``cpuinfo_min_freq`` and ``scaling_max_freq`` into this +file. When the system is under power or thermal constraints, the +platform firmware will attempt to throttle the CPU frequency to the +value specified in ``amd_pstate_floor_freq`` before throttling it +further. This allows userspace to specify different floor frequencies +to different CPUs. For optimal results, threads of the same core +should have the same floor frequency value. This file is only visible +on platforms that support the CPPC Performance Priority feature. + + +``amd_pstate_floor_count`` + +The number of distinct Floor Performance levels supported by the +platform. For example, if this value is 2, then the number of unique +values obtained from the command ``cat +/sys/devices/system/cpu/cpufreq/policy*/amd_pstate_floor_freq | +sort -n | uniq`` should be at most this number for the behavior +described in ``amd_pstate_floor_freq`` to take effect. A zero value +implies that the platform supports unlimited floor performance levels. +This file is only visible on platforms that support the CPPC +Performance Priority feature. + +**Note**: When ``amd_pstate_floor_count`` is non-zero, the frequency to +which the CPU is throttled under power or thermal constraints is +undefined when the number of unique values of ``amd_pstate_floor_freq`` +across all CPUs in the system exceeds ``amd_pstate_floor_count``. ``energy_performance_available_preferences`` @@ -280,16 +316,22 @@ A list of all the supported EPP preferences that could be used for These profiles represent different hints that are provided to the low-level firmware about the user's desired energy vs efficiency tradeoff. ``default`` represents the epp value is set by platform -firmware. This attribute is read-only. +firmware. ``custom`` designates that integer values 0-255 may be written +as well. This attribute is read-only. ``energy_performance_preference`` The current energy performance preference can be read from this attribute. and user can change current preference according to energy or performance needs -Please get all support profiles list from -``energy_performance_available_preferences`` attribute, all the profiles are -integer values defined between 0 to 255 when EPP feature is enabled by platform -firmware, if EPP feature is disabled, driver will ignore the written value +Coarse named profiles are available in the attribute +``energy_performance_available_preferences``. +Users can also write individual integer values between 0 to 255. +When dynamic EPP is enabled, writes to energy_performance_preference are blocked +even when EPP feature is enabled by platform firmware. Lower epp values shift the bias +towards improved performance while a higher epp value shifts the bias towards +power-savings. The exact impact can change from one platform to the other. +If a valid integer was last written, then a number will be returned on future reads. +If a valid string was last written then a string will be returned on future reads. This attribute is read-write. ``boost`` @@ -311,6 +353,24 @@ boost or `1` to enable it, for the respective CPU using the sysfs path Other performance and frequency values can be read back from ``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`. +Dynamic energy performance profile +================================== +The amd-pstate driver supports dynamically selecting the energy performance +profile based on whether the machine is running on AC or DC power. + +Whether this behavior is enabled by default depends on the kernel command line option +``amd_dynamic_epp`` is set. This behavior can also be overridden +at runtime by the sysfs file ``/sys/devices/system/cpu/amd_pstate/dynamic_epp``. + +When set to enabled, the driver will select a different energy performance +profile when the machine is running on battery or AC power. The driver will +also register with the platform profile handler to receive notifications of +user desired power state and react to those. +When set to disabled, the driver will not change the energy performance profile +based on the power source and will not react to user desired power state. + +Attempting to manually write to the ``energy_performance_preference`` sysfs +file will fail when ``dynamic_epp`` is enabled. ``amd-pstate`` vs ``acpi-cpufreq`` ====================================== @@ -422,6 +482,12 @@ For systems that support ``amd-pstate`` preferred core, the core rankings will always be advertised by the platform. But OS can choose to ignore that via the kernel parameter ``amd_prefcore=disable``. +``amd_dynamic_epp`` + +When AMD pstate is in auto mode, dynamic EPP will control whether the kernel +autonomously changes the EPP mode. The default is disabled. It can be enabled +with the kernel parameter ``amd_dynamic_epp=enable``. + User Space Interface in ``sysfs`` - General =========================================== @@ -790,13 +856,13 @@ Reference =========== .. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming, - https://www.amd.com/system/files/TechDocs/24593.pdf + https://docs.amd.com/v/u/en-US/24593_3.44_APM_Vol2 .. [2] Advanced Configuration and Power Interface Specification, https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf .. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors - https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip + https://docs.amd.com/v/u/en-US/56569-A1-PUB_3.03 .. [4] Linux Kernel Selftests, https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index a21369eba034..dbe6d23a5d67 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -231,7 +231,7 @@ are the following: present). The existence of the limit may be a result of some (often unintentional) - BIOS settings, restrictions coming from a service processor or another + BIOS settings, restrictions coming from a service processor or other BIOS/HW-based mechanisms. This does not cover ACPI thermal limitations which can be discovered @@ -248,6 +248,20 @@ are the following: If that frequency cannot be determined, this attribute should not be present. +``cpuinfo_avg_freq`` + An average frequency (in KHz) of all CPUs belonging to a given policy, + derived from a hardware provided feedback and reported on a time frame + spanning at most few milliseconds. + + This is expected to be based on the frequency the hardware actually runs + at and, as such, might require specialised hardware support (such as AMU + extension on ARM). If one cannot be determined, this attribute should + not be present. + + Note that failed attempt to retrieve current frequency for a given + CPU(s) will result in an appropriate error, i.e.: EAGAIN for CPU that + remains idle (raised on ARM). + ``cpuinfo_max_freq`` Maximum possible operating frequency the CPUs belonging to this policy can run at (in kHz). @@ -260,10 +274,6 @@ are the following: The time it takes to switch the CPUs belonging to this policy from one P-state to another, in nanoseconds. - If unknown or if known to be so high that the scaling driver does not - work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) - will be returned by reads from this attribute. - ``related_cpus`` List of all (online and offline) CPUs belonging to this policy. @@ -293,7 +303,8 @@ are the following: Some architectures (e.g. ``x86``) may attempt to provide information more precisely reflecting the current CPU frequency through this attribute, but that still may not be the exact current CPU frequency as - seen by the hardware at the moment. + seen by the hardware at the moment. This behavior though, is only + available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option. ``scaling_driver`` The scaling driver currently in use. @@ -383,7 +394,9 @@ policy limits change after that. This governor does not do anything by itself. Instead, it allows user space to set the CPU frequency for the policy it is attached to by writing to the -``scaling_setspeed`` attribute of that policy. +``scaling_setspeed`` attribute of that policy. Though the intention may be to +set an exact frequency for the policy, the actual frequency may vary depending +on hardware coordination, thermal and power limits, and other factors. ``schedutil`` ------------- @@ -426,7 +439,7 @@ This governor exposes only one tunable: ``rate_limit_us`` Minimum time (in microseconds) that has to pass between two consecutive runs of governor computations (default: 1.5 times the scaling driver's - transition latency or the maximum 2ms). + transition latency or 1ms if the driver does not provide a latency value). The purpose of this tunable is to reduce the scheduler context overhead of the governor which might be excessive without it. @@ -484,7 +497,7 @@ This governor exposes the following tunables: represented by it to be 1.5 times as high as the transition latency (the default):: - # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate + # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2))` > ondemand/sampling_rate ``up_threshold`` If the estimated CPU load is above this value (in percent), the governor diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index eb58d7a5affd..be4c1120e3f0 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -275,20 +275,25 @@ values and, when predicting the idle duration next time, it computes the average and variance of them. If the variance is small (smaller than 400 square milliseconds) or it is small relative to the average (the average is greater that 6 times the standard deviation), the average is regarded as the "typical -interval" value. Otherwise, the longest of the saved observed idle duration +interval" value. Otherwise, either the longest or the shortest (depending on +which one is farther from the average) of the saved observed idle duration values is discarded and the computation is repeated for the remaining ones. + Again, if the variance of them is small (in the above sense), the average is taken as the "typical interval" value and so on, until either the "typical -interval" is determined or too many data points are disregarded, in which case -the "typical interval" is assumed to equal "infinity" (the maximum unsigned -integer value). - -If the "typical interval" computed this way is long enough, the governor obtains -the time until the closest timer event with the assumption that the scheduler -tick will be stopped. That time, referred to as the *sleep length* in what follows, -is the upper bound on the time before the next CPU wakeup. It is used to determine -the sleep length range, which in turn is needed to get the sleep length correction -factor. +interval" is determined or too many data points are disregarded. In the latter +case, if the size of the set of data points still under consideration is +sufficiently large, the next idle duration is not likely to be above the largest +idle duration value still in that set, so that value is taken as the predicted +next idle duration. Finally, if the set of data points still under +consideration is too small, no prediction is made. + +If the preliminary prediction of the next idle duration computed this way is +long enough, the governor obtains the time until the closest timer event with +the assumption that the scheduler tick will be stopped. That time, referred to +as the *sleep length* in what follows, is the upper bound on the time before the +next CPU wakeup. It is used to determine the sleep length range, which in turn +is needed to get the sleep length correction factor. The ``menu`` governor maintains an array containing several correction factor values that correspond to different sleep length ranges organized so that each @@ -302,7 +307,7 @@ to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). The sleep length is multiplied by the correction factor for the range that it falls into to obtain an approximation of the predicted idle duration that is compared to the "typical interval" determined previously and the minimum of -the two is taken as the idle duration prediction. +the two is taken as the final idle duration prediction. If the "typical interval" value is small, which means that the CPU is likely to be woken up soon enough, the sleep length computation is skipped as it may @@ -575,6 +580,15 @@ the given CPU as the upper limit for the exit latency of the idle states that they are allowed to select for that CPU. They should never select any idle states with exit latency beyond that limit. +While the above CPU QoS constraints apply to CPU idle time management, user +space may also request a CPU system wakeup latency QoS limit, via the +`cpu_wakeup_latency` file. This QoS constraint is respected when selecting a +suitable idle state for the CPUs, while entering the system-wide suspend-to-idle +sleep state, but also to the regular CPU idle time management. + +Note that, the management of the `cpu_wakeup_latency` file works according to +the 'cpu_dma_latency' file from user space point of view. Moreover, the unit +is also microseconds. Idle States Control Via Kernel Command Line =========================================== diff --git a/Documentation/admin-guide/pm/intel-speed-select.rst b/Documentation/admin-guide/pm/intel-speed-select.rst index a2bfb971654f..dec2a25f10bc 100644 --- a/Documentation/admin-guide/pm/intel-speed-select.rst +++ b/Documentation/admin-guide/pm/intel-speed-select.rst @@ -287,7 +287,7 @@ level. Check presence of other Intel(R) SST features --------------------------------------------- -Each of the performance profiles also specifies weather there is support of +Each of the performance profiles also specifies whether there is support of other two Intel(R) SST features (Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) and Intel(R) Speed Select Technology - Turbo Frequency (Intel SST-TF)). diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index 39bd6ecce7de..188d52cd26e8 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -38,6 +38,27 @@ instruction at all. only way to pass early-configuration-time parameters to it is via the kernel command line. +Sysfs Interface +=============== + +The ``intel_idle`` driver exposes the following ``sysfs`` attributes in +``/sys/devices/system/cpu/cpuidle/``: + +``intel_c1_demotion`` + Enable or disable C1 demotion for all CPUs in the system. This file is + only exposed on platforms that support the C1 demotion feature and where + it was tested. Value 0 means that C1 demotion is disabled, value 1 means + that it is enabled. Write 0 or 1 to disable or enable C1 demotion for + all CPUs. + + The C1 demotion feature involves the platform firmware demoting deep + C-state requests from the OS (e.g., C6 requests) to C1. The idea is that + firmware monitors CPU wake-up rate, and if it is higher than a + platform-specific threshold, the firmware demotes deep C-state requests + to C1. For example, Linux requests C6, but firmware noticed too many + wake-ups per second, and it keeps the CPU in C1. When the CPU stays in + C1 long enough, the platform promotes it back to C6. This may improve + some workloads' performance, but it may also increase power consumption. .. _intel-idle-enumeration-of-states: @@ -192,11 +213,19 @@ even if they have been enumerated (see :ref:`cpu-pm-qos` in Documentation/admin-guide/pm/cpuidle.rst). Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. -The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` -if the kernel has been configured with ACPI support) can be set to make the -driver ignore the system's ACPI tables entirely or use them for all of the -recognized processor models, respectively (they both are unset by default and -``use_acpi`` has no effect if ``no_acpi`` is set). +The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are +recognized by ``intel_idle`` if the kernel has been configured with ACPI +support. In the case that ACPI is not configured these flags have no impact +on functionality. + +``no_acpi`` - Do not use ACPI at all. Only native mode is available, no +ACPI mode. + +``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for +C-states on/off status in native mode. + +``no_native`` - Work only in ACPI mode, no native mode available (ignore +all custom tables). The value of the ``states_off`` module parameter (0 by default) represents a list of idle states to be disabled by default in the form of a bitmask. @@ -231,6 +260,17 @@ mode to off when the CPU is in any one of the available idle states. This may help performance of a sibling CPU at the expense of a slightly higher wakeup latency for the idle CPU. +The ``table`` argument allows customization of idle state latency and target +residency. The syntax is a comma-separated list of ``name:latency:residency`` +entries, where ``name`` is the idle state name, ``latency`` is the exit latency +in microseconds, and ``residency`` is the target residency in microseconds. It +is not necessary to specify all idle states; only those to be customized. For +example, ``C1:1:3,C6:50:100`` sets the exit latency and target residency for +C1 and C6 to 1/3 and 50/100 microseconds, respectively. Remaining idle states +keep their default values. The driver verifies that deeper idle states have +higher latency and target residency than shallower ones. Also, target +residency cannot be smaller than exit latency. If any of these conditions is +not met, the driver ignores the entire ``table`` parameter. .. _intel-idle-core-and-package-idle-states: diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index bf13ad25a32f..25fe5d88fea6 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -48,8 +48,9 @@ only way to pass early-configuration-time parameters to it is via the kernel command line. However, its configuration can be adjusted via ``sysfs`` to a great extent. In some configurations it even is possible to unregister it via ``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and -registered (see `below <status_attr_>`_). +registered (see :ref:`below <status_attr>`). +.. _operation_modes: Operation Modes =============== @@ -62,6 +63,8 @@ a certain performance scaling algorithm. Which of them will be in effect depends on what kernel command line options are used and on the capabilities of the processor. +.. _active_mode: + Active Mode ----------- @@ -94,6 +97,8 @@ Which of the P-state selection algorithms is used by default depends on the Namely, if that option is set, the ``performance`` algorithm will be used by default, and the other one will be used by default if it is not set. +.. _active_mode_hwp: + Active Mode With HWP ~~~~~~~~~~~~~~~~~~~~ @@ -123,7 +128,7 @@ Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's internal P-state selection logic is expected to focus entirely on performance. This will override the EPP/EPB setting coming from the ``sysfs`` interface -(see `Energy vs Performance Hints`_ below). Moreover, any attempts to change +(see :ref:`energy_performance_hints` below). Moreover, any attempts to change the EPP/EPB to a value different from 0 ("performance") via ``sysfs`` in this configuration will be rejected. @@ -192,6 +197,8 @@ This is the default P-state selection algorithm if the :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option is not set. +.. _passive_mode: + Passive Mode ------------ @@ -289,12 +296,12 @@ Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes the entire range of available P-states, including the whole turbo range, to the ``CPUFreq`` core and (in the passive mode) to generic scaling governors. This generally causes turbo P-states to be set more often when ``intel_pstate`` is -used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_ -for more information). +used relative to ACPI-based CPU performance scaling (see +:ref:`below <acpi-cpufreq>` for more information). Moreover, since ``intel_pstate`` always knows what the real turbo threshold is (even if the Configurable TDP feature is enabled in the processor), its -``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should +``no_turbo`` attribute in ``sysfs`` (described :ref:`below <no_turbo_attr>`) should work as expected in all cases (that is, if set to disable turbo P-states, it always should prevent ``intel_pstate`` from using them). @@ -307,12 +314,12 @@ pieces of information on it to be known, including: * The minimum supported P-state. - * The maximum supported `non-turbo P-state <turbo_>`_. + * The maximum supported :ref:`non-turbo P-state <turbo>`. * Whether or not turbo P-states are supported at all. - * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states - are supported). + * The maximum supported :ref:`one-core turbo P-state <turbo>` (if turbo + P-states are supported). * The scaling formula to translate the driver's internal representation of P-states into frequencies and the other way around. @@ -329,9 +336,112 @@ information listed above is the same for all of the processors supporting the HWP feature, which is why ``intel_pstate`` works with all of them.] +Support for Hybrid Processors +============================= + +Some processors supported by ``intel_pstate`` contain two or more types of CPU +cores differing by the maximum turbo P-state, performance vs power characteristics, +cache sizes, and possibly other properties. They are commonly referred to as +hybrid processors. To support them, ``intel_pstate`` requires HWP to be enabled +and it assumes the HWP performance units to be the same for all CPUs in the +system, so a given HWP performance level always represents approximately the +same physical performance regardless of the core (CPU) type. + +Hybrid Processors with SMT +-------------------------- + +On systems where SMT (Simultaneous Multithreading), also referred to as +HyperThreading (HT) in the context of Intel processors, is enabled on at least +one core, ``intel_pstate`` assigns performance-based priorities to CPUs. Namely, +the priority of a given CPU reflects its highest HWP performance level which +causes the CPU scheduler to generally prefer more performant CPUs, so the less +performant CPUs are used when the other ones are fully loaded. SMT siblings +(that is, logical CPUs sharing one physical core) are given the same priority. +The scheduler can pull tasks from lower-priority cores and place them on any +sibling. Since the scheduler spreads tasks among physical cores, tasks will be +placed on the SMT siblings of physical cores only after all physical cores are +busy. + +This approach maximizes performance in the majority of cases, but unfortunately +it also leads to excessive energy usage in some important scenarios, like video +playback, which is not generally desirable. While there is no other viable +choice with SMT enabled because the effective capacity and utilization of SMT +siblings are hard to determine, hybrid processors without SMT can be handled in +more energy-efficient ways. + +.. _CAS: + +Capacity-Aware Scheduling Support +--------------------------------- + +The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by +``intel_pstate`` by default on hybrid processors without SMT. CAS generally +causes the scheduler to put tasks on a CPU so long as there is a sufficient +amount of spare capacity on it, and if the utilization of a given task is too +high for it, the task will need to go somewhere else. + +Since CAS takes CPU capacities into account, it does not require CPU +prioritization and it allows tasks to be distributed more symmetrically among +the more performant and less performant CPUs. Once placed on a CPU with enough +capacity to accommodate it, a task may just continue to run there regardless of +whether or not the other CPUs are fully loaded, so on average CAS reduces the +utilization of the more performant CPUs which causes the energy usage to be more +balanced because the more performant CPUs are generally less energy-efficient +than the less performant ones. + +In order to use CAS, the scheduler needs to know the capacity of each CPU in +the system and it needs to be able to compute scale-invariant utilization of +CPUs, so ``intel_pstate`` provides it with the requisite information. + +First of all, the capacity of each CPU is represented by the ratio of its highest +HWP performance level, multiplied by 1024, to the highest HWP performance level +of the most performant CPU in the system, which works because the HWP performance +units are the same for all CPUs. Second, the frequency-invariance computations, +carried out by the scheduler to always express CPU utilization in the same units +regardless of the frequency it is currently running at, are adjusted to take the +CPU capacity into account. All of this happens when ``intel_pstate`` has +registered itself with the ``CPUFreq`` core and it has figured out that it is +running on a hybrid processor without SMT. + +Energy-Aware Scheduling Support +------------------------------- + +If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and +``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling +:ref:`CAS` it registers an Energy Model for the processor. This allows the +Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if +``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate`` +to operate in the :ref:`passive mode <passive_mode>`. + +The Energy Model registered by ``intel_pstate`` is artificial (that is, it is +based on abstract cost values and it does not include any real power numbers) +and it is relatively simple to avoid unnecessary computations in the scheduler. +There is a performance domain in it for every CPU in the system and the cost +values for these performance domains have been chosen so that running a task on +a less performant (small) CPU appears to be always cheaper than running that +task on a more performant (big) CPU. However, for two CPUs of the same type, +the cost difference depends on their current utilization, and the CPU whose +current utilization is higher generally appears to be a more expensive +destination for a given task. This helps to balance the load among CPUs of the +same type. + +Since EAS works on top of CAS, high-utilization tasks are always migrated to +CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization +tasks tend to be placed on the CPUs that look less expensive to the scheduler. +Effectively, this causes the less performant and less loaded CPUs to be +preferred as long as they have enough spare capacity to run the given task +which generally leads to reduced energy usage. + +The Energy Model created by ``intel_pstate`` can be inspected by looking at +the ``energy_model`` directory in ``debugfs`` (typlically mounted on +``/sys/kernel/debug/``). + + User Space Interface in ``sysfs`` ================================= +.. _global_attributes: + Global Attributes ----------------- @@ -344,8 +454,8 @@ argument is passed to the kernel in the command line. ``max_perf_pct`` Maximum P-state the driver is allowed to set in percent of the - maximum supported performance level (the highest supported `turbo - P-state <turbo_>`_). + maximum supported performance level (the highest supported :ref:`turbo + P-state <turbo>`). This attribute will not be exposed if the ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel @@ -353,8 +463,8 @@ argument is passed to the kernel in the command line. ``min_perf_pct`` Minimum P-state the driver is allowed to set in percent of the - maximum supported performance level (the highest supported `turbo - P-state <turbo_>`_). + maximum supported performance level (the highest supported :ref:`turbo + P-state <turbo>`). This attribute will not be exposed if the ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel @@ -363,18 +473,18 @@ argument is passed to the kernel in the command line. ``num_pstates`` Number of P-states supported by the processor (between 0 and 255 inclusive) including both turbo and non-turbo P-states (see - `Turbo P-states Support`_). + :ref:`turbo`). This attribute is present only if the value exposed by it is the same for all of the CPUs in the system. The value of this attribute is not affected by the ``no_turbo`` - setting described `below <no_turbo_attr_>`_. + setting described :ref:`below <no_turbo_attr>`. This attribute is read-only. ``turbo_pct`` - Ratio of the `turbo range <turbo_>`_ size to the size of the entire + Ratio of the :ref:`turbo range <turbo>` size to the size of the entire range of supported P-states, in percent. This attribute is present only if the value exposed by it is the same @@ -386,7 +496,7 @@ argument is passed to the kernel in the command line. ``no_turbo`` If set (equal to 1), the driver is not allowed to set any turbo P-states - (see `Turbo P-states Support`_). If unset (equal to 0, which is the + (see :ref:`turbo`). If unset (equal to 0, which is the default), turbo P-states can be set by the driver. [Note that ``intel_pstate`` does not support the general ``boost`` attribute (supported by some other scaling drivers) which is replaced @@ -395,11 +505,11 @@ argument is passed to the kernel in the command line. This attribute does not affect the maximum supported frequency value supplied to the ``CPUFreq`` core and exposed via the policy interface, but it affects the maximum possible value of per-policy P-state limits - (see `Interpretation of Policy Attributes`_ below for details). + (see :ref:`policy_attributes_interpretation` below for details). ``hwp_dynamic_boost`` This attribute is only present if ``intel_pstate`` works in the - `active mode with the HWP feature enabled <Active Mode With HWP_>`_ in + :ref:`active mode with the HWP feature enabled <active_mode_hwp>` in the processor. If set (equal to 1), it causes the minimum P-state limit to be increased dynamically for a short time whenever a task previously waiting on I/O is selected to run on a given logical CPU (the purpose @@ -414,12 +524,12 @@ argument is passed to the kernel in the command line. Operation mode of the driver: "active", "passive" or "off". "active" - The driver is functional and in the `active mode - <Active Mode_>`_. + The driver is functional and in the :ref:`active mode + <active_mode>`. "passive" - The driver is functional and in the `passive mode - <Passive Mode_>`_. + The driver is functional and in the :ref:`passive mode + <passive_mode>`. "off" The driver is not functional (it is not registered as a scaling @@ -447,13 +557,15 @@ argument is passed to the kernel in the command line. attribute to "1" enables the energy-efficiency optimizations and setting to "0" disables them. +.. _policy_attributes_interpretation: + Interpretation of Policy Attributes ----------------------------------- The interpretation of some ``CPUFreq`` policy attributes described in Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate`` as the current scaling driver and it generally depends on the driver's -`operation mode <Operation Modes_>`_. +:ref:`operation mode <operation_modes>`. First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and ``scaling_cur_freq`` attributes are produced by applying a processor-specific @@ -462,9 +574,10 @@ Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq`` attributes are capped by the frequency corresponding to the maximum P-state that the driver is allowed to set. -If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is -not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq`` -and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency. +If the ``no_turbo`` :ref:`global attribute <no_turbo_attr>` is set, the driver +is not allowed to use turbo P-states, so the maximum value of +``scaling_max_freq`` and ``scaling_min_freq`` is limited to the maximum +non-turbo P-state frequency. Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and ``scaling_min_freq`` to go down to that value if they were above it before. However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be @@ -476,7 +589,7 @@ and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state, which also is the value of ``cpuinfo_max_freq`` in either case. Next, the following policy attributes have special meaning if -``intel_pstate`` works in the `active mode <Active Mode_>`_: +``intel_pstate`` works in the :ref:`active mode <active_mode>`: ``scaling_available_governors`` List of P-state selection algorithms provided by ``intel_pstate``. @@ -497,20 +610,22 @@ processor: Shows the base frequency of the CPU. Any frequency above this will be in the turbo frequency range. -The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the +The meaning of these attributes in the :ref:`passive mode <passive_mode>` is the same as for other scaling drivers. Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate`` depends on the operation mode of the driver. Namely, it is either -"intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the -`passive mode <Passive Mode_>`_). +"intel_pstate" (in the :ref:`active mode <active_mode>`) or "intel_cpufreq" +(in the :ref:`passive mode <passive_mode>`). + +.. _pstate_limits_coordination: Coordination of P-State Limits ------------------------------ ``intel_pstate`` allows P-state limits to be set in two ways: with the help of -the ``max_perf_pct`` and ``min_perf_pct`` `global attributes -<Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq`` +the ``max_perf_pct`` and ``min_perf_pct`` :ref:`global attributes +<global_attributes>` or via the ``scaling_max_freq`` and ``scaling_min_freq`` ``CPUFreq`` policy attributes. The coordination between those limits is based on the following rules, regardless of the current operation mode of the driver: @@ -532,17 +647,18 @@ on the following rules, regardless of the current operation mode of the driver: 3. The global and per-policy limits can be set independently. -In the `active mode with the HWP feature enabled <Active Mode With HWP_>`_, the +In the :ref:`active mode with the HWP feature enabled <active_mode_hwp>`, the resulting effective values are written into hardware registers whenever the limits change in order to request its internal P-state selection logic to always set P-states within these limits. Otherwise, the limits are taken into account -by scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver -every time before setting a new P-state for a CPU. +by scaling governors (in the :ref:`passive mode <passive_mode>`) and by the +driver every time before setting a new P-state for a CPU. Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed at all and the only way to set the limits is by using the policy attributes. +.. _energy_performance_hints: Energy vs Performance Hints --------------------------- @@ -602,9 +718,9 @@ output. On those systems each ``_PSS`` object returns a list of P-states supported by the corresponding CPU which basically is a subset of the P-states range that can be used by ``intel_pstate`` on the same system, with one exception: the whole -`turbo range <turbo_>`_ is represented by one item in it (the topmost one). By -convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz -than the frequency of the highest non-turbo P-state listed by it, but the +:ref:`turbo range <turbo>` is represented by one item in it (the topmost one). +By convention, the frequency returned by ``_PSS`` for that item is greater by +1 MHz than the frequency of the highest non-turbo P-state listed by it, but the corresponding P-state representation (following the hardware specification) returned for it matches the maximum supported turbo P-state (or is the special value 255 meaning essentially "go as high as you can get"). @@ -630,18 +746,18 @@ benefit from running at turbo frequencies will be given non-turbo P-states instead. One more issue related to that may appear on systems supporting the -`Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the -turbo threshold. Namely, if that is not coordinated with the lists of P-states -returned by ``_PSS`` properly, there may be more than one item corresponding to -a turbo P-state in those lists and there may be a problem with avoiding the -turbo range (if desirable or necessary). Usually, to avoid using turbo -P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed -by ``_PSS``, but that is not sufficient when there are other turbo P-states in -the list returned by it. +:ref:`Configurable TDP feature <turbo>` allowing the platform firmware to set +the turbo threshold. Namely, if that is not coordinated with the lists of +P-states returned by ``_PSS`` properly, there may be more than one item +corresponding to a turbo P-state in those lists and there may be a problem with +avoiding the turbo range (if desirable or necessary). Usually, to avoid using +turbo P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state +listed by ``_PSS``, but that is not sufficient when there are other turbo +P-states in the list returned by it. Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the -`passive mode <Passive Mode_>`_, except that the number of P-states it can set -is limited to the ones listed by the ACPI ``_PSS`` objects. +:ref:`passive mode <passive_mode>`, except that the number of P-states it can +set is limited to the ones listed by the ACPI ``_PSS`` objects. Kernel Command Line Options for ``intel_pstate`` @@ -656,11 +772,11 @@ of them have to be prepended with the ``intel_pstate=`` prefix. processor is supported by it. ``active`` - Register ``intel_pstate`` in the `active mode <Active Mode_>`_ to start - with. + Register ``intel_pstate`` in the :ref:`active mode <active_mode>` to + start with. ``passive`` - Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to + Register ``intel_pstate`` in the :ref:`passive mode <passive_mode>` to start with. ``force`` @@ -693,9 +809,12 @@ of them have to be prepended with the ``intel_pstate=`` prefix. and this option has no effect. ``per_cpu_perf_limits`` - Use per-logical-CPU P-State limits (see `Coordination of P-state - Limits`_ for details). + Use per-logical-CPU P-State limits (see + :ref:`pstate_limits_coordination` for details). +``no_cas`` + Do not enable :ref:`capacity-aware scheduling <CAS>` which is enabled + by default on hybrid systems without SMT. Diagnostics and Tuning ====================== @@ -707,7 +826,7 @@ There are two static trace events that can be used for ``intel_pstate`` diagnostics. One of them is the ``cpu_frequency`` trace event generally used by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if -it works in the `active mode <Active Mode_>`_. +it works in the :ref:`active mode <active_mode>`. The following sequence of shell commands can be used to enable them and see their output (if the kernel is generally configured to support event tracing):: @@ -719,7 +838,7 @@ their output (if the kernel is generally configured to support event tracing):: gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476 cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 -If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the +If ``intel_pstate`` works in the :ref:`passive mode <passive_mode>`, the ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the policies with other scaling governors). diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst index 5151ec312dc0..d367ba4d744a 100644 --- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst +++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst @@ -91,12 +91,22 @@ Attributes in each directory: ``domain_id`` This attribute is used to get the power domain id of this instance. +``die_id`` + This attribute is used to get the Linux die id of this instance. + This attribute is only present for domains with core agents and + when the CPUID leaf 0x1f presents die ID. + ``fabric_cluster_id`` This attribute is used to get the fabric cluster id of this instance. ``package_id`` This attribute is used to get the package id of this instance. +``agent_types`` + This attribute displays all the hardware agents present within the + domain. Each agent has the capability to control one or more hardware + subsystems, which include: core, cache, memory, and I/O. + The other attributes are same as presented at package_*_die_* level. In most of current use cases, the "max_freq_khz" and "min_freq_khz" diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst index 3eda08191d13..24d80e3eb309 100644 --- a/Documentation/admin-guide/pnp.rst +++ b/Documentation/admin-guide/pnp.rst @@ -129,9 +129,6 @@ pnp_put_protocol pnp_register_protocol use this to register a new PnP protocol -pnp_unregister_protocol - use this function to remove a PnP protocol from the Plug and Play Layer - pnp_register_driver adds a PnP driver to the Plug and Play Layer diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst index 07cfd8863b46..cb178e0a6208 100644 --- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst +++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst @@ -273,7 +273,7 @@ again. does nothing at all; in that case you have to manually install your kernel, as outlined in the reference section. - If you are running a immutable Linux distribution, check its documentation + If you are running an immutable Linux distribution, check its documentation and the web to find out how to install your own kernel there. [:ref:`details<install>`] @@ -347,14 +347,16 @@ again. [:ref:`details<uninstall>`] -.. _submit_improvements: +.. _submit_improvements_qbtl: -Did you run into trouble following any of the above steps that is not cleared up -by the reference section below? Or do you have ideas how to improve the text? -Then please take a moment of your time and let the maintainer of this document -know by email (Thorsten Leemhuis <linux@leemhuis.info>), ideally while CCing the -Linux docs mailing list (linux-doc@vger.kernel.org). Such feedback is vital to -improve this document further, which is in everybody's interest, as it will +Did you run into trouble following the step-by-step guide not cleared up by the +reference section below? Did you spot errors? Or do you have ideas on how to +improve the guide? + +If any of that applies, please let the developers know by sending a short note +or a patch to Thorsten Leemhuis <linux@leemhuis.info> while ideally CCing the +public Linux docs mailing list <linux-doc@vger.kernel.org>. Such feedback is +vital to improve this text further, which is in everybody's interest, as it will enable more people to master the task described here. Reference section for the step-by-step guide @@ -884,7 +886,7 @@ When a build error occurs, it might be caused by some aspect of your machine's setup that often can be fixed quickly; other times though the problem lies in the code and can only be fixed by a developer. A close examination of the failure messages coupled with some research on the internet will often tell you -which of the two it is. To perform such a investigation, restart the build +which of the two it is. To perform such an investigation, restart the build process like this:: make V=1 @@ -1070,7 +1072,7 @@ complicated, and harder to follow. That being said: this of course is a balancing act. Hence, if you think an additional use-case is worth describing, suggest it to the maintainers of this -document, as :ref:`described above <submit_improvements>`. +document, as :ref:`described above <submit_improvements_qbtl>`. .. diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst index 2fd5a030235a..16a66a1f1975 100644 --- a/Documentation/admin-guide/reporting-issues.rst +++ b/Documentation/admin-guide/reporting-issues.rst @@ -41,13 +41,23 @@ If you are facing multiple issues with the Linux kernel at once, report each separately. While writing your report, include all information relevant to the issue, like the kernel and the distro used. In case of a regression, CC the regressions mailing list (regressions@lists.linux.dev) to your report. Also try -to pin-point the culprit with a bisection; if you succeed, include its +to pinpoint the culprit with a bisection; if you succeed, include its commit-id and CC everyone in the sign-off-by chain. Once the report is out, answer any questions that come up and help where you can. That includes keeping the ball rolling by occasionally retesting with newer releases and sending a status update afterwards. +.. + Note: If you see this note, you are reading the text's source file. You + might want to switch to a rendered version: It makes it a lot easier to + read and navigate this document -- especially when you want to look something + up in the reference section, then jump back to where you left off. +.. + Find the latest rendered version of this text here: + https://docs.kernel.org/admin-guide/reporting-issues.html + + Step-by-step guide how to report issues to the kernel maintainers ================================================================= @@ -206,7 +216,7 @@ Reporting issues only occurring in older kernel version lines This subsection is for you, if you tried the latest mainline kernel as outlined above, but failed to reproduce your issue there; at the same time you want to see the issue fixed in a still supported stable or longterm series or vendor -kernels regularly rebased on those. If that the case, follow these steps: +kernels regularly rebased on those. If that is the case, follow these steps: * Prepare yourself for the possibility that going through the next few steps might not get the issue solved in older releases: the fix might be too big @@ -231,45 +241,54 @@ kernels regularly rebased on those. If that the case, follow these steps: The reference section below explains each of these steps in more detail. +Conclusion of the step-by-step guide +------------------------------------ + +Did you run into trouble following the step-by-step guide not cleared up by the +reference section below? Did you spot errors? Or do you have ideas on how to +improve the guide? + +If any of that applies, please let the developers know by sending a short note +or a patch to Thorsten Leemhuis <linux@leemhuis.info> while ideally CCing the +public Linux docs mailing list <linux-doc@vger.kernel.org>. Such feedback is +vital to improve this text further, which is in everybody's interest, as it will +enable more people to master the task described here. + + Reference section: Reporting issues to the kernel maintainers ============================================================= -The detailed guides above outline all the major steps in brief fashion, which -should be enough for most people. But sometimes there are situations where even -experienced users might wonder how to actually do one of those steps. That's -what this section is for, as it will provide a lot more details on each of the -above steps. Consider this as reference documentation: it's possible to read it -from top to bottom. But it's mainly meant to skim over and a place to look up -details how to actually perform those steps. - -A few words of general advice before digging into the details: - - * The Linux kernel developers are well aware this process is complicated and - demands more than other FLOSS projects. We'd love to make it simpler. But - that would require work in various places as well as some infrastructure, - which would need constant maintenance; nobody has stepped up to do that - work, so that's just how things are for now. - - * A warranty or support contract with some vendor doesn't entitle you to - request fixes from developers in the upstream Linux kernel community: such - contracts are completely outside the scope of the Linux kernel, its - development community, and this document. That's why you can't demand - anything such a contract guarantees in this context, not even if the - developer handling the issue works for the vendor in question. If you want - to claim your rights, use the vendor's support channel instead. When doing - so, you might want to mention you'd like to see the issue fixed in the - upstream Linux kernel; motivate them by saying it's the only way to ensure - the fix in the end will get incorporated in all Linux distributions. - - * If you never reported an issue to a FLOSS project before you should consider - reading `How to Report Bugs Effectively - <https://www.chiark.greenend.org.uk/~sgtatham/bugs.html>`_, `How To Ask - Questions The Smart Way - <http://www.catb.org/esr/faqs/smart-questions.html>`_, and `How to ask good - questions <https://jvns.ca/blog/good-questions/>`_. - -With that off the table, find below the details on how to properly report -issues to the Linux kernel developers. +The step-by-step guide above outlines all the major steps in brief fashion, +which usually covers everything required. But even experienced users will +sometimes wonder how to actually realize some of those steps or why they are +needed; there are also corner cases the guide ignores for readability. That is +what the entries in this reference section are for, which provide additional +information for each of the steps in the guide. + +A few words of general advice: + +* The Linux developers are well aware that reporting bugs to them is more + complicated and demanding than in other FLOSS projects. Some of it is because + the kernel is different, among others due to its mail-driven development + process and because it consists mostly of drivers. Some of it is because + improving things would require work in several technical areas and people + triaging bugs –– and nobody has stepped up to do or fund that work. + +* A warranty or support contract with some vendor doesn't entitle you to + request fixes from the upstream Linux developers: Such contracts are + completely outside the scope of the upstream Linux kernel, its development + community, and this document -- even if those handling the issue work for the + vendor who issued the contract. If you want to claim your rights, use the + vendor's support channel. + +* If you never reported an issue to a FLOSS project before, consider skimming + guides like `How to ask good questions + <https://jvns.ca/blog/good-questions/>`_, `How To Ask Questions The Smart Way + <http://www.catb.org/esr/faqs/smart-questions.html>`_, and `How to Report + Bugs Effectively <https://www.chiark.greenend.org.uk/~sgtatham/bugs.html>`_,. + +With that off the table, find below details for the steps from the detailed +guide on reporting issues to the Linux kernel developers. Make sure you're using the upstream Linux kernel @@ -312,7 +331,7 @@ small modifications to a kernel based on a recent Linux version; that for example often holds true for the mainline kernels shipped by Debian GNU/Linux Sid or Fedora Rawhide. Some developers will also accept reports about issues with kernels from distributions shipping the latest stable kernel, as long as -its only slightly modified; that for example is often the case for Arch Linux, +it's only slightly modified; that for example is often the case for Arch Linux, regular Fedora releases, and openSUSE Tumbleweed. But keep in mind, you better want to use a mainline Linux and avoid using a stable kernel for this process, as outlined in the section 'Install a fresh kernel for testing' in more @@ -611,7 +630,7 @@ better place. How to read the MAINTAINERS file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -To illustrate how to use the :ref:`MAINTAINERS <maintainers>` file, lets assume +To illustrate how to use the :ref:`MAINTAINERS <maintainers>` file, let's assume the WiFi in your Laptop suddenly misbehaves after updating the kernel. In that case it's likely an issue in the WiFi driver. Obviously it could also be some code it builds upon, but unless you suspect something like that stick to the @@ -1543,7 +1562,7 @@ as well, because that will speed things up. And note, it helps developers a great deal if you can specify the exact version that introduced the problem. Hence if possible within a reasonable time frame, -try to find that version using vanilla kernels. Lets assume something broke when +try to find that version using vanilla kernels. Let's assume something broke when your distributor released a update from Linux kernel 5.10.5 to 5.10.8. Then as instructed above go and check the latest kernel from that version line, say 5.10.9. If it shows the problem, try a vanilla 5.10.5 to ensure that no patches @@ -1674,72 +1693,59 @@ for the subsystem where the issue seems to have its roots; CC the mailing list for the subsystem as well as the stable mailing list (stable@vger.kernel.org). -Why some issues won't get any reaction or remain unfixed after being reported -============================================================================= - -When reporting a problem to the Linux developers, be aware only 'issues of high -priority' (regressions, security issues, severe problems) are definitely going -to get resolved. The maintainers or if all else fails Linus Torvalds himself -will make sure of that. They and the other kernel developers will fix a lot of -other issues as well. But be aware that sometimes they can't or won't help; and -sometimes there isn't even anyone to send a report to. - -This is best explained with kernel developers that contribute to the Linux -kernel in their spare time. Quite a few of the drivers in the kernel were -written by such programmers, often because they simply wanted to make their -hardware usable on their favorite operating system. - -These programmers most of the time will happily fix problems other people -report. But nobody can force them to do, as they are contributing voluntarily. - -Then there are situations where such developers really want to fix an issue, -but can't: sometimes they lack hardware programming documentation to do so. -This often happens when the publicly available docs are superficial or the -driver was written with the help of reverse engineering. - -Sooner or later spare time developers will also stop caring for the driver. -Maybe their test hardware broke, got replaced by something more fancy, or is so -old that it's something you don't find much outside of computer museums -anymore. Sometimes developer stops caring for their code and Linux at all, as -something different in their life became way more important. In some cases -nobody is willing to take over the job as maintainer – and nobody can be forced -to, as contributing to the Linux kernel is done on a voluntary basis. Abandoned -drivers nevertheless remain in the kernel: they are still useful for people and -removing would be a regression. - -The situation is not that different with developers that are paid for their -work on the Linux kernel. Those contribute most changes these days. But their -employers sooner or later also stop caring for their code or make its -programmer focus on other things. Hardware vendors for example earn their money -mainly by selling new hardware; quite a few of them hence are not investing -much time and energy in maintaining a Linux kernel driver for something they -stopped selling years ago. Enterprise Linux distributors often care for a -longer time period, but in new versions often leave support for old and rare -hardware aside to limit the scope. Often spare time contributors take over once -a company orphans some code, but as mentioned above: sooner or later they will -leave the code behind, too. - -Priorities are another reason why some issues are not fixed, as maintainers -quite often are forced to set those, as time to work on Linux is limited. -That's true for spare time or the time employers grant their developers to -spend on maintenance work on the upstream kernel. Sometimes maintainers also -get overwhelmed with reports, even if a driver is working nearly perfectly. To -not get completely stuck, the programmer thus might have no other choice than -to prioritize issue reports and reject some of them. - -But don't worry too much about all of this, a lot of drivers have active -maintainers who are quite interested in fixing as many issues as possible. - - -Closing words -============= - -Compared with other Free/Libre & Open Source Software it's hard to report -issues to the Linux kernel developers: the length and complexity of this -document and the implications between the lines illustrate that. But that's how -it is for now. The main author of this text hopes documenting the state of the -art will lay some groundwork to improve the situation over time. - +Appendix: Why it is somewhat hard to report kernel bugs +======================================================= + +The Linux kernel developers are well aware that reporting bugs to them is harder +than in other Free/Libre Open Source Projects. Many reasons for that lie in the +nature of kernels, Linux' development model, and how the world uses the kernel: + +* *Most kernels of Linux distributions are totally unsuitable for reporting bugs + upstream.* The reference section above already explained this in detail: + outdated codebases as well as modifications and add-ons lead to kernel bugs + that were fixed upstream a long time ago or never happened there in the first + place. Developers of other Open Source software face these problems as well, + but the situation is a lot worse when it comes to the kernel, as the changes + and their impact are much more severe -- which is why many kernel developers + expect reports with kernels built from fresh and nearly unmodified sources. + +* *Bugs often only occur in a special environment.* That is because Linux is + mostly drivers and can be used in a multitude of ways. Developers often do not + have a matching setup at hand -- and therefore frequently must rely on bug + reporters for isolating a problems's cause and testing proposed fixes. + +* *The kernel has hundreds of maintainers, but all-rounders are very rare.* That + again is and effect caused by the multitude of features and drivers, due to + which many kernel developers know little about lower or higher layers related + to their code and even less about other areas. + +* *It is hard finding where to report issues to, among others, due to the lack + of a central bug tracker.* This is something even some kernel developers + dislike, but that's the situation everyone has to deal with currently. + +* *Stable and longterm kernels are primarily maintained by a dedicated 'stable + team', which only handles regressions introduced within stable and longterm + series.* When someone reports a bug, say, using Linux 6.1.2, the team will, + therefore, always ask if mainline is affected: if the bug already happened + in 6.1 or occurs with latest mainline (say, 6.2-rc3), they in everybody's + interest shove it to the regular developers, as those know the code best. + +* *Linux developers are free to focus on latest mainline.* Some, thus, react + coldly to reports about bugs in, say, Linux 6.0 when 6.1 is already out; + even the latter might not be enough once 6.2-rc1 is out. Some will also not + be very welcoming to reports with 6.1.5 or 6.1.6, as the problem might be a + series-specific regression the stable team (see above) caused and must fix. + +* *Sometimes there is nobody to help.* Sometimes this is due to the lack of + hardware documentation -- for example, when a driver was built using reverse + engineering or was taken over by spare-time developers when the hardware + manufacturer left it behind. Other times there is nobody to even report bugs + to: when maintainers move on without a replacement, their code often remains + in the kernel as long as it's useful. + +Some of these aspects could be improved to facilitate bug reporting -- many +Linux kernel developers are well aware of this and would be glad if a few +individuals or an entity would make this their mission. .. end-of-content diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst index a3dfc2c66e01..1609e7479249 100644 --- a/Documentation/admin-guide/serial-console.rst +++ b/Documentation/admin-guide/serial-console.rst @@ -78,7 +78,9 @@ If no console device is specified, the first device found capable of acting as a system console will be used. At this time, the system first looks for a VGA card and then for a serial port. So if you don't have a VGA card in your system the first serial port will automatically -become the console. +become the console, unless the kernel is configured with the +CONFIG_NULL_TTY_DEFAULT_CONSOLE option, then it will default to using the +ttynull device. You will need to create a new device to use ``/dev/console``. The official ``/dev/console`` is now character device 5,1. diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst index e3cfffef5a63..c1768d9e80fa 100644 --- a/Documentation/admin-guide/syscall-user-dispatch.rst +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -53,20 +53,25 @@ following prctl: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) -<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and -disable the mechanism globally for that thread. When -PR_SYS_DISPATCH_OFF is used, the other fields must be zero. - -[<offset>, <offset>+<length>) delimit a memory region interval -from which syscalls are always executed directly, regardless of the -userspace selector. This provides a fast path for the C library, which -includes the most common syscall dispatchers in the native code -applications, and also provides a way for the signal handler to return +<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON +or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for +that thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero. + +For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit +a memory region interval from which syscalls are always executed directly, +regardless of the userspace selector. This provides a fast path for the +C library, which includes the most common syscall dispatchers in the native +code applications, and also provides a way for the signal handler to return without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this interface should make sure that at least the signal trampoline code is included in this region. In addition, for syscalls that implement the trampoline code on the vDSO, that trampoline is never intercepted. +For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit +a memory region interval from which syscalls are dispatched based on +the userspace selector. Syscalls from outside of the range are always +executed directly. + [selector] is a pointer to a char-sized region in the process memory region, that provides a quick way to enable disable syscall redirection thread-wide, without the need to invoke the kernel directly. selector diff --git a/Documentation/admin-guide/sysctl/crypto.rst b/Documentation/admin-guide/sysctl/crypto.rst new file mode 100644 index 000000000000..b707bd314a64 --- /dev/null +++ b/Documentation/admin-guide/sysctl/crypto.rst @@ -0,0 +1,47 @@ +================= +/proc/sys/crypto/ +================= + +These files show up in ``/proc/sys/crypto/``, depending on the +kernel configuration: + +.. contents:: :local: + +fips_enabled +============ + +Read-only flag that indicates whether FIPS mode is enabled. + +- ``0``: FIPS mode is disabled (default). +- ``1``: FIPS mode is enabled. + +This value is set at boot time via the ``fips=1`` kernel command line +parameter. When enabled, the cryptographic API will restrict the use +of certain algorithms and perform self-tests to ensure compliance with +FIPS (Federal Information Processing Standards) requirements, such as +FIPS 140-2 and the newer FIPS 140-3, depending on the kernel +configuration and the module in use. + +fips_name +========= + +Read-only file that contains the name of the FIPS module currently in use. +The value is typically configured via the ``CONFIG_CRYPTO_FIPS_NAME`` +kernel configuration option. + +fips_version +============ + +Read-only file that contains the version string of the FIPS module. +If ``CONFIG_CRYPTO_FIPS_CUSTOM_VERSION`` is set, it uses the value from +``CONFIG_CRYPTO_FIPS_VERSION``. Otherwise, it defaults to the kernel +release version (``UTS_RELEASE``). + +Copyright (c) 2026, Shubham Chakraborty <chakrabortyshubham66@gmail.com> + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +.. See scripts/check-sysctl-docs to keep this up to date: +.. scripts/check-sysctl-docs -vtable="crypto" \ +.. $(git grep -l register_sysctl_) diff --git a/Documentation/admin-guide/sysctl/debug.rst b/Documentation/admin-guide/sysctl/debug.rst new file mode 100644 index 000000000000..506bd5e48594 --- /dev/null +++ b/Documentation/admin-guide/sysctl/debug.rst @@ -0,0 +1,52 @@ +================ +/proc/sys/debug/ +================ + +These files show up in ``/proc/sys/debug/``, depending on the +kernel configuration: + +.. contents:: :local: + +exception-trace +=============== + +This flag controls whether the kernel prints information about unhandled +signals (like segmentation faults) to the kernel log (``dmesg``). + +- ``0``: Unhandled signals are not traced. +- ``1``: Information about unhandled signals is printed. + +The default value is ``1`` on most architectures (like x86, MIPS, RISC-V), +but it is ``0`` on **arm64**. + +The actual information printed and the context provided varies +significantly depending on the CPU architecture. For example: + +- On **x86**, it typically prints the instruction pointer (IP), error + code, and address that caused a page fault. +- On **PowerPC**, it may print the next instruction pointer (NIP), + link register (LR), and other relevant registers. + +When enabled, this feature is often rate-limited to prevent the kernel +log from being flooded during a crash loop. + +kprobes-optimization +==================== + +This flag enables or disables the optimization of Kprobes on certain +architectures (like x86). + +- ``0``: Kprobes optimization is turned off. +- ``1``: Kprobes optimization is turned on (default). + +For more details on Kprobes and its optimization, please refer to +Documentation/trace/kprobes.rst. + +Copyright (c) 2026, Shubham Chakraborty <chakrabortyshubham66@gmail.com> + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +.. See scripts/check-sysctl-docs to keep this up to date: +.. scripts/check-sysctl-docs -vtable="debug" \ +.. $(git grep -l register_sysctl_) diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index 08e89e031714..9b7f65c3efd8 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -164,8 +164,8 @@ pipe-user-pages-soft -------------------- Maximum total number of pages a non-privileged user may allocate for pipes -before the pipe size gets limited to a single page. Once this limit is reached, -new pipes will be limited to a single page in size for this user in order to +before the pipe size gets limited to two pages. Once this limit is reached, +new pipes will be limited to two pages in size for this user in order to limit total memory usage, and trying to increase them using ``fcntl()`` will be denied until usage goes below the limit again. The default value allows to allocate up to 1024 pipes at their default size. When set to 0, no limit is @@ -347,3 +347,28 @@ filesystems: ``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for setting/getting the maximum number of pages that can be used for servicing requests in FUSE. + +``/proc/sys/fs/fuse/default_request_timeout`` is a read/write file for +setting/getting the default timeout (in seconds) for a fuse server to +reply to a kernel-issued request in the event where the server did not +specify a timeout at mount. If the server set a timeout, +then default_request_timeout will be ignored. The default +"default_request_timeout" is set to 0. 0 indicates no default timeout. +The maximum value that can be set is 65535. + +``/proc/sys/fs/fuse/max_request_timeout`` is a read/write file for +setting/getting the maximum timeout (in seconds) for a fuse server to +reply to a kernel-issued request. A value greater than 0 automatically opts +the server into a timeout that will be set to at most "max_request_timeout", +even if the server did not specify a timeout and default_request_timeout is +set to 0. If max_request_timeout is greater than 0 and the server set a timeout +greater than max_request_timeout or default_request_timeout is set to a value +greater than max_request_timeout, the system will use max_request_timeout as the +timeout. 0 indicates no max request timeout. The maximum value that can be set +is 65535. + +For timeouts, if the server does not respond to the request by the time +the set timeout elapses, then the connection to the fuse server will be aborted. +Please note that the timeouts are not 100% precise (eg you may set 60 seconds but +the timeout may kick in after 70 seconds). The upper margin of error for the +timeout is roughly FUSE_TIMEOUT_TIMER_FREQ seconds. diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst index 03346f98c7b9..50f00514f0ff 100644 --- a/Documentation/admin-guide/sysctl/index.rst +++ b/Documentation/admin-guide/sysctl/index.rst @@ -66,33 +66,42 @@ This documentation is about: =============== =============================================================== abi/ execution domains & personalities -debug/ <empty> -dev/ device specific information (eg dev/cdrom/info) +<$ARCH> tuning controls for various CPU architecture (e.g. csky, s390) +crypto/ cryptographic subsystem +debug/ debugging features +dev/ device specific information (e.g. dev/cdrom/info) fs/ specific filesystems filehandle, inode, dentry and quota tuning binfmt_misc <Documentation/admin-guide/binfmt-misc.rst> kernel/ global kernel info / tuning miscellaneous stuff + some architecture-specific controls + security (LSM) stuff net/ networking stuff, for documentation look in: <Documentation/networking/> proc/ <empty> sunrpc/ SUN Remote Procedure Call (NFS) +user/ Per user namespace limits vm/ memory management tuning buffer and cache management -user/ Per user per user namespace limits +xen/ Xen hypervisor controls =============== =============================================================== -These are the subdirs I have on my system. There might be more -or other subdirs in another setup. If you see another dir, I'd -really like to hear about it :-) +These are the subdirs I have on my system or have been discovered by +searching through the source code. There might be more or other subdirs +in another setup. If you see another dir, I'd really like to hear about +it :-) .. toctree:: :maxdepth: 1 abi + crypto + debug fs kernel net sunrpc user vm + xen diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index a43b78b4b646..c6994e55d141 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -177,6 +177,7 @@ core_pattern %E executable path %c maximum size of core file by resource limit RLIMIT_CORE %C CPU the task ran on + %F pidfd number %<OTHER> both are dropped ======== ========================================== @@ -212,6 +213,17 @@ pid>/``). This value defaults to 0. +core_sort_vma +============= + +The default coredump writes VMAs in address order. By setting +``core_sort_vma`` to 1, VMAs will be written from smallest size +to largest size. This is known to break at least elfutils, but +can be handy when dealing with very large (and truncated) +coredumps where the more useful debugging details are included +in the smaller VMAs. + + core_uses_pid ============= @@ -385,13 +397,14 @@ a hung task is detected. hung_task_panic =============== -Controls the kernel's behavior when a hung task is detected. +When set to a non-zero value, a kernel panic will be triggered if the +number of hung tasks found during a single scan reaches this value. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -= ================================================= += ======================================================= 0 Continue operation. This is the default behavior. -1 Panic immediately. -= ================================================= +N Panic when N hung tasks are found during a single scan. += ======================================================= hung_task_check_count @@ -405,10 +418,16 @@ hung_task_detect_count ====================== Indicates the total number of tasks that have been detected as hung since -the system boot. +the system boot or since the counter was reset. The counter is zeroed when +a value of 0 is written. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_sys_info +================== +A comma separated list of extra system information to be dumped when +hung task is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. hung_task_timeout_secs ====================== @@ -503,6 +522,15 @@ default), only processes with the CAP_SYS_ADMIN capability may create io_uring instances. +kernel_sys_info +=============== +A comma separated list of extra system information to be dumped when +soft/hard lockup is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. + +It serves as the default kernel control knob, which will take effect +when a kernel module calls sys_info() with parameter==0. + kexec_load_disabled =================== @@ -564,6 +592,14 @@ if leaking kernel pointer values to unprivileged users is a concern. When ``kptr_restrict`` is set to 2, kernel pointers printed using %pK will be replaced with 0s regardless of privileges. +For disabling these security restrictions early at boot time (and once +for all), use the ``hash_pointers`` boot parameter instead. + +softlockup_sys_info & hardlockup_sys_info +========================================= +A comma separated list of extra system information to be dumped when +soft/hard lockup is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. modprobe ======== @@ -878,7 +914,7 @@ bit 1 print system memory info bit 2 print timer info bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -bit 5 print all printk messages in buffer +bit 5 replay all kernel messages on consoles at the end of panic bit 6 print all CPUs backtrace (if available in the arch) bit 7 print only tasks in uninterruptible (blocked) state ===== ============================================ @@ -888,6 +924,24 @@ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print +panic_sys_info +============== + +A comma separated list of extra information to be dumped on panic, +for example, "tasks,mem,timers,...". It is a human readable alternative +to 'panic_print'. Possible values are: + +============= =================================================== +tasks print all tasks info +mem print system memory info +timers print timers info +locks print locks info if CONFIG_LOCKDEP is on +ftrace print ftrace buffer +all_bt print all CPUs backtrace (if available in the arch) +blocked_tasks print only tasks in uninterruptible (blocked) state +============= =================================================== + + panic_on_rcu_stall ================== @@ -1003,30 +1057,26 @@ perf_user_access (arm64 and riscv only) Controls user space access for reading perf event counters. -arm64 -===== - -The default value is 0 (access disabled). +* for arm64 + The default value is 0 (access disabled). -When set to 1, user space can read performance monitor counter registers -directly. + When set to 1, user space can read performance monitor counter registers + directly. -See Documentation/arch/arm64/perf.rst for more information. + See Documentation/arch/arm64/perf.rst for more information. -riscv -===== +* for riscv + When set to 0, user space access is disabled. -When set to 0, user space access is disabled. + The default value is 1, user space can read performance monitor counter + registers through perf, any direct access without perf intervention will trigger + an illegal instruction. -The default value is 1, user space can read performance monitor counter -registers through perf, any direct access without perf intervention will trigger -an illegal instruction. + When set to 2, which enables legacy mode (user space has direct access to cycle + and insret CSRs only). Note that this legacy value is deprecated and will be + removed once all user space applications are fixed. -When set to 2, which enables legacy mode (user space has direct access to cycle -and insret CSRs only). Note that this legacy value is deprecated and will be -removed once all user space applications are fixed. - -Note that the time CSR is always directly accessible to all modes. + Note that the time CSR is always directly accessible to all modes. pid_max ======= @@ -1099,7 +1149,8 @@ printk_ratelimit_burst While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. ``printk_ratelimit_burst`` specifies the number of messages we can -send before ratelimiting kicks in. +send before ratelimiting kicks in. After `printk_ratelimit`_ seconds +have elapsed, another burst of messages may be sent. The default value is 10 messages. @@ -1188,12 +1239,6 @@ that support this feature. == =========================================================================== -real-root-dev -============= - -See Documentation/admin-guide/initrd.rst. - - reboot-cmd (SPARC only) ======================= @@ -1454,7 +1499,7 @@ stack_erasing ============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. +of syscalls for kernels built with ``CONFIG_KSTACK_ERASE``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. @@ -1462,7 +1507,7 @@ The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. = ==================================================================== -0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +0 Kernel stack erasing is disabled, KSTACK_ERASE_METRICS are not updated. 1 Kernel stack erasing is enabled (default), it is performed before returning to the userspace at the end of syscalls. = ==================================================================== diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 7b0c4291c686..0724a793798f 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -40,8 +40,8 @@ Table : Subdirectories in /proc/sys/net bridge Bridging rose X.25 PLP layer core General parameter tipc TIPC ethernet Ethernet protocol unix Unix domain sockets - ipv4 IP version 4 x25 X.25 protocol - ipv6 IP version 6 + ipv4 IP version 4 vsock VSOCK sockets + ipv6 IP version 6 x25 X.25 protocol ========= =================== = ========== =================== 1. /proc/sys/net/core - Network core options @@ -212,6 +212,14 @@ mem_pcpu_rsv Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU. +bypass_prot_mem +--------------- + +Skip charging socket buffers to the global per-protocol memory +accounting controlled by net.ipv4.tcp_mem, net.ipv4.udp_mem, etc. + +Default: 0 (off) + rmem_default ------------ @@ -222,6 +230,8 @@ rmem_max The maximum receive socket buffer size in bytes. +Default: 4194304 + rps_default_mask ---------------- @@ -247,6 +257,8 @@ wmem_max The maximum send socket buffer size in bytes. +Default: 4194304 + message_burst and message_cost ------------------------------ @@ -291,24 +303,33 @@ netdev_max_backlog Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. +qdisc_max_burst +------------------ + +Maximum number of packets that can be temporarily stored before +reaching qdisc. + +Default: 1000 + netdev_rss_key -------------- -RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is -randomly generated. +RSS (Receive Side Scaling) enabled drivers use a host key that +is randomly generated. Some user space might need to gather its content even if drivers do not provide ethtool -x support yet. :: myhost:~# cat /proc/sys/net/core/netdev_rss_key - 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (256 bytes total) -File contains nul bytes if no driver ever called netdev_rss_key_fill() function. +File contains all nul bytes if no driver ever called netdev_rss_key_fill() +function. Note: - /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, - but most drivers only use 40 bytes of it. + /proc/sys/net/core/netdev_rss_key contains 256 bytes of key, + but many drivers only use 40 or 52 bytes of it. :: @@ -343,9 +364,9 @@ skb_defer_max ------------- Max size (in skbs) of the per-cpu list of skbs being freed -by the cpu which allocated them. Used by TCP stack so far. +by the cpu which allocated them. -Default: 64 +Default: 128 optmem_max ---------- @@ -402,6 +423,23 @@ to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). If set to 1 (default), hash rethink is performed on listening socket. If set to 0, hash rethink is not performed. +txq_reselection_ms +------------------ + +Controls how often (in ms) a busy connected flow can select another tx queue. + +A resection is desirable when/if user thread has migrated and XPS +would select a different queue. Same can occur without XPS +if the flow hash has changed. + +But switching txq can introduce reorders, especially if the +old queue is under high pressure. Modern TCP stacks deal +well with reorders if they happen not too often. + +To disable this feature, set the value to 0. + +Default : 1000 + gro_normal_batch ---------------- @@ -513,3 +551,82 @@ originally may have been issued in the correct sequential order. If named_timeout is nonzero, failed topology updates will be placed on a defer queue until another event arrives that clears the error, or until the timeout expires. Value is in milliseconds. + +6. /proc/sys/net/vsock - VSOCK sockets +-------------------------------------- + +VSOCK sockets (AF_VSOCK) provide communication between virtual machines and +their hosts. The behavior of VSOCK sockets in a network namespace is determined +by the namespace's mode (``global`` or ``local``), which controls how CIDs +(Context IDs) are allocated and how sockets interact across namespaces. + +ns_mode +------- + +Read-only. Reports the current namespace's mode, set at namespace creation +and immutable thereafter. + +Values: + + - ``global`` - the namespace shares system-wide CID allocation and + its sockets can reach any VM or socket in any global namespace. + Sockets in this namespace cannot reach sockets in local + namespaces. + - ``local`` - the namespace has private CID allocation and its + sockets can only connect to VMs or sockets within the same + namespace. + +The init_net mode is always ``global``. + +child_ns_mode +------------- + +Controls what mode newly created child namespaces will inherit. At namespace +creation, ``ns_mode`` is inherited from the parent's ``child_ns_mode``. The +initial value matches the namespace's own ``ns_mode``. + +Values: + + - ``global`` - child namespaces will share system-wide CID allocation + and their sockets will be able to reach any VM or socket in any + global namespace. + - ``local`` - child namespaces will have private CID allocation and + their sockets will only be able to connect within their own + namespace. + +The first write to ``child_ns_mode`` locks its value. Subsequent writes of the +same value succeed, but writing a different value returns ``-EBUSY``. + +Changing ``child_ns_mode`` only affects namespaces created after the change; +it does not modify the current namespace or any existing children. + +A namespace with ``ns_mode`` set to ``local`` cannot change +``child_ns_mode`` to ``global`` (returns ``-EPERM``). + +g2h_fallback +------------ + +Controls whether connections to CIDs not owned by the host-to-guest (H2G) +transport automatically fall back to the guest-to-host (G2H) transport. + +When enabled, if a connect targets a CID that the H2G transport (e.g. +vhost-vsock) does not serve, or if no H2G transport is loaded at all, the +connection is routed via the G2H transport (e.g. virtio-vsock) instead. This +allows a host running both nested VMs (via vhost-vsock) and sibling VMs +reachable through the hypervisor (e.g. Nitro Enclaves) to address both using +a single CID space, without requiring applications to set +``VMADDR_FLAG_TO_HOST``. + +When the fallback is taken, ``VMADDR_FLAG_TO_HOST`` is automatically set on +the remote address so that userspace can determine the path via +``getpeername()``. + +Note: With this sysctl enabled, user space that attempts to talk to a guest +CID which is not implemented by the H2G transport will create host vsock +traffic. Environments that rely on H2G-only isolation should set it to 0. + +Values: + + - 0 - Connections to CIDs <= 2 or with VMADDR_FLAG_TO_HOST use G2H; + all others use H2G (or fail with ENODEV if H2G is not loaded). + - 1 - Connections to CIDs not owned by H2G fall back to G2H. (default) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..97e12359775c 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - compact_memory - compaction_proactiveness - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -40,7 +41,6 @@ Currently, these files are in /proc/sys/vm: - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group -- laptop_mode - legacy_va_layout - lowmem_reserve_ratio - max_map_count @@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm: - mmap_min_addr - mmap_rnd_bits - mmap_rnd_compat_bits +- movable_gigantic_pages - nr_hugepages - nr_hugepages_mempolicy - nr_overcommit_hugepages @@ -74,6 +75,7 @@ Currently, these files are in /proc/sys/vm: - unprivileged_userfaultfd - user_reserve_kbytes - vfs_cache_pressure +- vfs_cache_pressure_denom - watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -130,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. @@ -145,6 +153,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== @@ -215,6 +231,8 @@ eventually gets pushed out to disk. This tunable is used to define when dirty inode is old enough to be eligible for writeback by the kernel flusher threads. And, it is also used as the interval to wakeup dirtytime_writeback thread. +Setting this to zero disables periodic dirtytime writeback. + dirty_writeback_centisecs ========================= @@ -347,13 +365,6 @@ hugetlb_shm_group contains group id that is allowed to create SysV shared memory segment using hugetlb page. -laptop_mode -=========== - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. - - legacy_va_layout ================ @@ -449,8 +460,8 @@ The minimum value is 1 (1/1 -> 100%). The value less than 1 completely disables protection of the pages. -max_map_count: -============== +max_map_count +============= This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling @@ -478,9 +489,13 @@ memory allocations. The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. +When CONFIG_MEM_ALLOC_PROFILING_DEBUG=y, this control is read-only to avoid +warnings produced by allocations made while profiling is disabled and freed +when it's enabled. + -memory_failure_early_kill: -========================== +memory_failure_early_kill +========================= Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware @@ -608,6 +623,33 @@ This value can be changed after boot using the /proc/sys/vm/mmap_rnd_compat_bits tunable +movable_gigantic_pages +====================== + +This parameter controls whether gigantic pages may be allocated from +ZONE_MOVABLE. If set to non-zero, gigantic pages can be allocated +from ZONE_MOVABLE. ZONE_MOVABLE memory may be created via the kernel +boot parameter `kernelcore` or via memory hotplug as discussed in +Documentation/admin-guide/mm/memory-hotplug.rst. + +Support may depend on specific architecture. + +Note that using ZONE_MOVABLE gigantic pages make memory hotremove unreliable. + +Memory hot-remove operations will block indefinitely until the admin reserves +sufficient gigantic pages to service migration requests associated with the +memory offlining process. As HugeTLB gigantic page reservation is a manual +process (via `nodeN/hugepages/.../nr_hugepages` interfaces) this may not be +obvious when just attempting to offline a block of memory. + +Additionally, as multiple gigantic pages may be reserved on a single block, +it may appear that gigantic pages are available for migration when in reality +they are in the process of being removed. For example if `memoryN` contains +two gigantic pages, one reserved and one allocated, and an admin attempts to +offline that block, this operations may hang indefinitely unless another +reserved gigantic page is available on another block `memoryM`. + + nr_hugepages ============ @@ -1008,19 +1050,28 @@ vfs_cache_pressure This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel +will attempt to reclaim dentries and inodes at a "fair" rate with respect to +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, +the kernel will never reclaim dentries and inodes due to memory pressure and +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries +and inodes. -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may +have negative performance impact. Reclaim code needs to take various locks to +find freeable directory and inode objects. When vfs_cache_pressure equals +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable +objects than there are. + +Note: This setting should always be used together with vfs_cache_pressure_denom. + +vfs_cache_pressure_denom +======================== +Defaults to 100 (minimum allowed value). Requires corresponding +vfs_cache_pressure setting to take effect. watermark_boost_factor ====================== diff --git a/Documentation/admin-guide/sysctl/xen.rst b/Documentation/admin-guide/sysctl/xen.rst new file mode 100644 index 000000000000..6c5edc3e5e4c --- /dev/null +++ b/Documentation/admin-guide/sysctl/xen.rst @@ -0,0 +1,31 @@ +=============== +/proc/sys/xen/ +=============== + +Copyright (c) 2026, Shubham Chakraborty <chakrabortyshubham66@gmail.com> + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +------------------------------------------------------------------------------ + +These files show up in ``/proc/sys/xen/``, depending on the +kernel configuration: + +.. contents:: :local: + +balloon/hotplug_unpopulated +=========================== + +This flag controls whether unpopulated memory ranges are automatically +hotplugged as system RAM. + +- ``0``: Unpopulated ranges are not hotplugged (default). +- ``1``: Unpopulated ranges are automatically hotplugged. + +When enabled, the Xen balloon driver will add memory regions that are +marked as unpopulated in the Xen memory map to the system as usable RAM. +This allows for dynamic memory expansion in Xen guest domains. + +This option is only available when the kernel is built with +``CONFIG_XEN_BALLOON_MEMORY_HOTPLUG`` enabled. diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index 700aa72eecb1..9ead927a37c0 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -74,7 +74,7 @@ a particular type of taint. It's best to leave that to the aforementioned script, but if you need something quick you can use this shell command to check which bits are set:: - $ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done + $ for i in $(seq 20); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done Table for decoding tainted state ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted 16 _/X 65536 auxiliary taint, defined for and used by distros 17 _/T 131072 kernel was built with the struct randomization plugin 18 _/N 262144 an in-kernel test has been run + 19 _/J 524288 userspace used a mutating debug operation in fwctl === === ====== ======================================================== Note: The character ``_`` is representing a blank in this table to make reading @@ -184,3 +185,7 @@ More detailed explanation for tainting build time. 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. + + 19) ``J`` if userspace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE + to use the devices debugging features. Device debugging features could + cause the device to malfunction in undefined ways. diff --git a/Documentation/admin-guide/thermal/index.rst b/Documentation/admin-guide/thermal/index.rst index 193b7b01a87d..e48bc0a1951b 100644 --- a/Documentation/admin-guide/thermal/index.rst +++ b/Documentation/admin-guide/thermal/index.rst @@ -6,3 +6,4 @@ Thermal Subsystem :maxdepth: 1 intel_powerclamp + intel_thermal_throttle diff --git a/Documentation/admin-guide/thermal/intel_thermal_throttle.rst b/Documentation/admin-guide/thermal/intel_thermal_throttle.rst new file mode 100644 index 000000000000..f4fbf9d5a4ec --- /dev/null +++ b/Documentation/admin-guide/thermal/intel_thermal_throttle.rst @@ -0,0 +1,91 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +======================================= +Intel thermal throttle events reporting +======================================= + +:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> + +Introduction +------------ + +Intel processors have built in automatic and adaptive thermal monitoring +mechanisms that force the processor to reduce its power consumption in order +to operate within predetermined temperature limits. + +Refer to section "THERMAL MONITORING AND PROTECTION" in the "Intel® 64 and +IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C, & 3D): +System Programming Guide" for more details. + +In general, there are two mechanisms to control the core temperature of the +processor. They are called "Thermal Monitor 1 (TM1) and Thermal Monitor 2 (TM2)". + +The status of the temperature sensor that triggers the thermal monitor (TM1/TM2) +is indicated through the "thermal status flag" and "thermal status log flag" in +MSR_IA32_THERM_STATUS for core level and MSR_IA32_PACKAGE_THERM_STATUS for +package level. + +Thermal Status flag, bit 0 — When set, indicates that the processor core +temperature is currently at the trip temperature of the thermal monitor and that +the processor power consumption is being reduced via either TM1 or TM2, depending +on which is enabled. When clear, the flag indicates that the core temperature is +below the thermal monitor trip temperature. This flag is read only. + +Thermal Status Log flag, bit 1 — When set, indicates that the thermal sensor has +tripped since the last power-up or reset or since the last time that software +cleared this flag. This flag is a sticky bit; once set it remains set until +cleared by software or until a power-up or reset of the processor. The default +state is clear. + +It is possible that when user reads MSR_IA32_THERM_STATUS or +MSR_IA32_PACKAGE_THERM_STATUS, TM1/TM2 is not active. In this case, +"Thermal Status flag" will read "0" and the "Thermal Status Log flag" will be set +to show any previous "TM1/TM2" activation. But since it needs to be cleared by +the software, it can't show the number of occurrences of "TM1/TM2" activations. + +Hence, Linux provides counters of how many times the "Thermal Status flag" was +set. Also presents how long the "Thermal Status flag" was active in milliseconds. +Using these counters, users can check if the performance was limited because of +thermal events. It is recommended to read from sysfs instead of directly reading +MSRs as the "Thermal Status Log flag" is reset by the driver to implement rate +control. + +Sysfs Interface +--------------- + +Thermal throttling events are presented for each CPU under +"/sys/devices/system/cpu/cpuX/thermal_throttle/", where "X" is the CPU number. + +All these counters are read-only. They can't be reset to 0. So, they can potentially +overflow after reaching the maximum 64 bit unsigned integer. + +``core_throttle_count`` + Shows the number of times "Thermal Status flag" changed from 0 to 1 for this + CPU since OS boot and thermal vector is initialized. This is a 64 bit counter. + +``package_throttle_count`` + Shows the number of times "Thermal Status flag" changed from 0 to 1 for the + package containing this CPU since OS boot and thermal vector is initialized. + Package status is broadcast to all CPUs; all CPUs in the package increment + this count. This is a 64-bit counter. + +``core_throttle_max_time_ms`` + Shows the maximum amount of time for which "Thermal Status flag" has been + set to 1 for this CPU at the core level since OS boot and thermal vector + is initialized. + +``package_throttle_max_time_ms`` + Shows the maximum amount of time for which "Thermal Status flag" has been + set to 1 for the package containing this CPU since OS boot and thermal + vector is initialized. + +``core_throttle_total_time_ms`` + Shows the cumulative time for which "Thermal Status flag" has been + set to 1 for this CPU for core level since OS boot and thermal vector + is initialized. + +``package_throttle_total_time_ms`` + Shows the cumulative time for which "Thermal Status flag" has been set + to 1 for the package containing this CPU since OS boot and thermal vector + is initialized. diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index 2ed79f41a411..89df26553aa0 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -28,7 +28,7 @@ should be a userspace tool that handles all the low-level details, keeps a database of the authorized devices and prompts users for new connections. More details about the sysfs interface for Thunderbolt devices can be -found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``. +found in Documentation/ABI/testing/sysfs-bus-thunderbolt. Those users who just want to connect any device without any sort of manual work can add following line to @@ -203,10 +203,10 @@ host controller or a device, it is important that the firmware can be upgraded to the latest where possible bugs in it have been fixed. Typically OEMs provide this firmware from their support site. -There is also a central site which has links where to download firmware -for some machines: - - `Thunderbolt Updates <https://thunderbolttechnology.net/updates>`_ +Currently, recommended method of updating firmware is through "fwupd" tool. +It uses LVFS (Linux Vendor Firmware Service) portal by default to get the +latest firmware from hardware vendors and updates connected devices if found +compatible. For details refer to: https://github.com/fwupd/fwupd. Before you upgrade firmware on a device, host or retimer, please make sure it is a suitable upgrade. Failing to do that may render the device @@ -215,18 +215,40 @@ tools! Host NVM upgrade on Apple Macs is not supported. -Once the NVM image has been downloaded, you need to plug in a -Thunderbolt device so that the host controller appears. It does not -matter which device is connected (unless you are upgrading NVM on a -device - then you need to connect that particular device). +Fwupd is installed by default. If you don't have it on your system, simply +use your distro package manager to get it. + +To see possible updates through fwupd, you need to plug in a Thunderbolt +device so that the host controller appears. It does not matter which +device is connected (unless you are upgrading NVM on a device - then you +need to connect that particular device). Note an OEM-specific method to power the controller up ("force power") may be available for your system in which case there is no need to plug in a Thunderbolt device. -After that we can write the firmware to the non-active parts of the NVM -of the host or device. As an example here is how Intel NUC6i7KYK (Skull -Canyon) Thunderbolt controller NVM is upgraded:: +Updating firmware using fwupd is straightforward - refer to official +readme on fwupd github. + +If firmware image is written successfully, the device shortly disappears. +Once it comes back, the driver notices it and initiates a full power +cycle. After a while device appears again and this time it should be +fully functional. + +Device of interest should display new version under "Current version" +and "Update State: Success" in fwupd's interface. + +Upgrading firmware manually +--------------------------------------------------------------- +If possible, use fwupd to updated the firmware. However, if your device OEM +has not uploaded the firmware to LVFS, but it is available for download +from their side, you can use method below to directly upgrade the +firmware. + +Manual firmware update can be done with 'dd' tool. To update firmware +using this method, you need to write it to the non-active parts of NVM +of the host or device. Example on how to update Intel NUC6i7KYK +(Skull Canyon) Thunderbolt controller NVM:: # dd if=KYK_TBT_FW_0018.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem @@ -235,10 +257,8 @@ upgrade process as follows:: # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate -If no errors are returned, the host controller shortly disappears. Once -it comes back the driver notices it and initiates a full power cycle. -After a while the host controller appears again and this time it should -be fully functional. +If no errors are returned, device should behave as described in previous +section. We can verify that the new NVM firmware is active by running the following commands:: @@ -296,6 +316,39 @@ information is missing. To recover from this mode, one needs to flash a valid NVM image to the host controller in the same way it is done in the previous chapter. +Tunneling events +---------------- +The driver sends ``KOBJ_CHANGE`` events to userspace when there is a +tunneling change in the ``thunderbolt_domain``. The notification carries +following environment variables:: + + TUNNEL_EVENT=<EVENT> + TUNNEL_DETAILS=0:12 <-> 1:20 (USB3) + +Possible values for ``<EVENT>`` are: + + activated + The tunnel was activated (created). + + changed + There is a change in this tunnel. For example bandwidth allocation was + changed. + + deactivated + The tunnel was torn down. + + low bandwidth + The tunnel is not getting optimal bandwidth. + + insufficient bandwidth + There is not enough bandwidth for the current tunnel requirements. + +The ``TUNNEL_DETAILS`` is only provided if the tunnel is known. For +example, in case of Firmware Connection Manager this is missing or does +not provide full tunnel information. In case of Software Connection Manager +this includes full tunnel details. The format currently matches what the +driver uses when logging. This may change over time. + Networking over Thunderbolt cable --------------------------------- Thunderbolt technology allows software communication between two hosts @@ -317,7 +370,7 @@ is built-in to the kernel image, there is no need to do anything. The driver will create one virtual ethernet interface per Thunderbolt port which are named like ``thunderbolt0`` and so on. From this point -you can either use standard userspace tools like ``ifconfig`` to +you can either use standard userspace tools like ``ip`` to configure the interface or let your GUI handle it automatically. Forcing power @@ -325,12 +378,7 @@ Forcing power Many OEMs include a method that can be used to force the power of a Thunderbolt controller to an "On" state even if nothing is connected. If supported by your machine this will be exposed by the WMI bus with -a sysfs attribute called "force_power". - -For example the intel-wmi-thunderbolt driver exposes this attribute in: - /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power - - To force the power to on, write 1 to this attribute file. - To disable force power, write 0 to this attribute file. +a sysfs attribute called "force_power", see +Documentation/ABI/testing/sysfs-platform-intel-wmi-thunderbolt for details. Note: it's currently not possible to query the force power state of a platform. diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst index 03c55151346c..7d38393f31fb 100644 --- a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -267,7 +267,7 @@ culprit might be known already. For further details on what actually qualifies as a regression check out Documentation/admin-guide/reporting-regressions.rst. If you run into any problems while following this guide or have ideas how to -improve it, :ref:`please let the kernel developers know <submit_improvements>`. +improve it, :ref:`please let the kernel developers know <submit_improvements_vbbr>`. .. _introprep_bissbs: @@ -1055,23 +1055,22 @@ follow these instructions. [:ref:`details <introoptional_bisref>`] -.. _submit_improvements: +.. _submit_improvements_vbbr: Conclusion ---------- You have reached the end of the step-by-step guide. -Did you run into trouble following any of the above steps not cleared up by the -reference section below? Did you spot errors? Or do you have ideas how to +Did you run into trouble following the step-by-step guide not cleared up by the +reference section below? Did you spot errors? Or do you have ideas on how to improve the guide? -If any of that applies, please take a moment and let the maintainer of this -document know by email (Thorsten Leemhuis <linux@leemhuis.info>), ideally while -CCing the Linux docs mailing list (linux-doc@vger.kernel.org). Such feedback is -vital to improve this text further, which is in everybody's interest, as it -will enable more people to master the task described here -- and hopefully also -improve similar guides inspired by this one. +If any of that applies, please let the developers know by sending a short note +or a patch to Thorsten Leemhuis <linux@leemhuis.info> while ideally CCing the +public Linux docs mailing list <linux-doc@vger.kernel.org>. Such feedback is +vital to improve this text further, which is in everybody's interest, as it will +enable more people to master the task described here. Reference section for the step-by-step guide @@ -1757,7 +1756,7 @@ or all of these tasks: to your bootloader's configuration. You have to take care of some or all of the tasks yourself, if your -distribution lacks a installkernel script or does only handle part of them. +distribution lacks an installkernel script or does only handle part of them. Consult the distribution's documentation for details. If in doubt, install the kernel manually:: diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst index 6be38c1b9c5b..35963491b9f1 100644 --- a/Documentation/admin-guide/workload-tracing.rst +++ b/Documentation/admin-guide/workload-tracing.rst @@ -82,7 +82,7 @@ Install tools to build Linux kernel and tools in kernel repository. scripts/ver_linux is a good way to check if your system already has the necessary tools:: - sudo apt-get build-essentials flex bison yacc + sudo apt-get install build-essential flex bison yacc sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev cscope is a good tool to browse kernel sources. Let's install it now:: @@ -196,11 +196,11 @@ Let’s checkout the latest Linux repository and build cscope database:: cscope -R -p10 # builds cscope.out database before starting browse session cscope -d -p10 # starts browse session on cscope.out database -Note: Run "cscope -R -p10" to build the database and c"scope -d -p10" to -enter into the browsing session. cscope by default cscope.out database. -To get out of this mode press ctrl+d. -p option is used to specify the -number of file path components to display. -p10 is optimal for browsing -kernel sources. +Note: Run "cscope -R -p10" to build the database and "cscope -d -p10" to +enter into the browsing session. cscope by default uses the cscope.out +database. To get out of this mode press ctrl+d. -p option is used to +specify the number of file path components to display. -p10 is optimal +for browsing kernel sources. What is perf and how do we use it? ================================== diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index b67772cf36d6..acdd4b65964c 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -34,22 +34,6 @@ When mounting an XFS filesystem, the following options are accepted. to the file. Specifying a fixed ``allocsize`` value turns off the dynamic behaviour. - attr2 or noattr2 - The options enable/disable an "opportunistic" improvement to - be made in the way inline extended attributes are stored - on-disk. When the new form is used for the first time when - ``attr2`` is selected (either when setting or removing extended - attributes) the on-disk superblock feature bit field will be - updated to reflect this format being in use. - - The default behaviour is determined by the on-disk feature - bit indicating that ``attr2`` behaviour is active. If either - mount option is set, then that becomes the new default used - by the filesystem. - - CRC enabled filesystems always use the ``attr2`` format, and so - will reject the ``noattr2`` mount option if it is set. - discard or nodiscard (default) Enable/disable the issuing of commands to let the block device reclaim space freed by the filesystem. This is @@ -75,12 +59,6 @@ When mounting an XFS filesystem, the following options are accepted. across the entire filesystem rather than just on directories configured to use it. - ikeep or noikeep (default) - When ``ikeep`` is specified, XFS does not delete empty inode - clusters and keeps them around on disk. When ``noikeep`` is - specified, empty inode clusters are returned to the free - space pool. - inode32 or inode64 (default) When ``inode32`` is specified, it indicates that XFS limits inode creation to locations which will not result in inode @@ -124,6 +102,14 @@ When mounting an XFS filesystem, the following options are accepted. controls the size of each buffer and so is also relevant to this case. + lifetime (default) or nolifetime + Enable data placement based on write life time hints provided + by the user. This turns on co-allocation of data of similar + life times when statistically favorable to reduce garbage + collection cost. + + These options are only available for zoned rt file systems. + logbsize=value Set the size of each in-memory log buffer. The size may be specified in bytes, or in kilobytes with a "k" suffix. @@ -143,6 +129,25 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. + max_atomic_write=value + Set the maximum size of an atomic write. The size may be + specified in bytes, in kilobytes with a "k" suffix, in megabytes + with a "m" suffix, or in gigabytes with a "g" suffix. The size + cannot be larger than the maximum write size, larger than the + size of any allocation group, or larger than the size of a + remapping operation that the log can complete atomically. + + The default value is to set the maximum I/O completion size + to allow each CPU to handle one at a time. + + max_open_zones=value + Specify the max number of zones to keep open for writing on a + zoned rt device. Many open zones aids file data separation + but may impact performance on HDDs. + + If ``max_open_zones`` is not specified, the value is determined + by the capabilities and the size of the zoned rt device. + noalign Data allocations will not be aligned at stripe unit boundaries. This is only relevant to filesystems created @@ -210,6 +215,14 @@ When mounting an XFS filesystem, the following options are accepted. inconsistent namespace presentation during or after a failover event. + errortag=tagname + When specified, enables the error inject tag named "tagname" with the + default frequency. Can be specified multiple times to enable multiple + errortags. Specifying this option on remount will reset the error tag + to the default value if it was set to any other value before. + This option is only supported when CONFIG_XFS_DEBUG is enabled, and + will not be reflected in /proc/self/mounts. + Deprecation of V4 Format ======================== @@ -226,9 +239,8 @@ latest version and try again. The deprecation will take place in two parts. Support for mounting V4 filesystems can now be disabled at kernel build time via Kconfig option. -The option will default to yes until September 2025, at which time it -will be changed to default to no. In September 2030, support will be -removed from the codebase entirely. +These options were changed to default to no in September 2025. In +September 2030, support will be removed from the codebase entirely. Note: Distributors may choose to withdraw V4 format support earlier than the dates listed above. @@ -241,8 +253,6 @@ Deprecated Mount Options ============================ ================ Mounting with V4 filesystem September 2030 Mounting ascii-ci filesystem September 2030 -ikeep/noikeep September 2025 -attr2/noattr2 September 2025 ============================ ================ @@ -258,6 +268,8 @@ Removed Mount Options osyncisdsync/osyncisosync v4.0 barrier v4.19 nobarrier v4.19 + ikeep/noikeep v6.18 + attr2/noattr2 v6.18 =========================== ======= sysctls @@ -285,9 +297,6 @@ The following sysctls are available for the XFS filesystem: removes unused preallocation from clean inodes and releases the unused space back to the free pool. - fs.xfs.speculative_cow_prealloc_lifetime - This is an alias for speculative_prealloc_lifetime. - fs.xfs.error_level (Min: 0 Default: 3 Max: 11) A volume knob for error reporting when internal errors occur. This will generate detailed messages & backtraces for filesystem @@ -314,17 +323,6 @@ The following sysctls are available for the XFS filesystem: This option is intended for debugging only. - fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) - Controls whether symlinks are created with mode 0777 (default) - or whether their mode is affected by the umask (irix mode). - - fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) - Controls files created in SGID directories. - If the group ID of the new file does not match the effective group - ID or one of the supplementary group IDs of the parent dir, the - ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl - is set. - fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "sync" flag set by the **xfs_io(8)** chattr command on a directory to be @@ -360,24 +358,20 @@ The following sysctls are available for the XFS filesystem: Deprecated Sysctls ================== -=========================================== ================ - Name Removal Schedule -=========================================== ================ -fs.xfs.irix_sgid_inherit September 2025 -fs.xfs.irix_symlink_mode September 2025 -fs.xfs.speculative_cow_prealloc_lifetime September 2025 -=========================================== ================ - +None currently. Removed Sysctls =============== -============================= ======= - Name Removed -============================= ======= - fs.xfs.xfsbufd_centisec v4.0 - fs.xfs.age_buffer_centisecs v4.0 -============================= ======= +========================================== ======= + Name Removed +========================================== ======= + fs.xfs.xfsbufd_centisec v4.0 + fs.xfs.age_buffer_centisecs v4.0 + fs.xfs.irix_symlink_mode v6.18 + fs.xfs.irix_sgid_inherit v6.18 + fs.xfs.speculative_cow_prealloc_lifetime v6.18 +========================================== ======= Error handling ============== @@ -542,3 +536,28 @@ The interesting knobs for XFS workqueues are as follows: nice Relative priority of scheduling the threads. These are the same nice levels that can be applied to userspace processes. ============ =========== + +Zoned Filesystems +================= + +For zoned file systems, the following attributes are exposed in: + + /sys/fs/xfs/<dev>/zoned/ + + max_open_zones (Min: 1 Default: Varies Max: UINTMAX) + This read-only attribute exposes the maximum number of open zones + available for data placement. The value is determined at mount time and + is limited by the capabilities of the backing zoned device, file system + size and the max_open_zones mount option. + + nr_open_zones (Min: 0 Default: Varies Max: UINTMAX) + This read-only attribute exposes the current number of open zones + used by the file system. + + zonegc_low_space (Min: 0 Default: 0 Max: 100) + Define a percentage for how much of the unused space that GC should keep + available for writing. A high value will reclaim more of the space + occupied by unused blocks, creating a larger buffer against write + bursts at the cost of increased write amplification. Regardless + of this value, garbage collection will always aim to free a minimum + amount of blocks to keep max_open_zones open for data placement purposes. |
