diff options
Diffstat (limited to 'Documentation/admin-guide/sysctl')
| -rw-r--r-- | Documentation/admin-guide/sysctl/crypto.rst | 47 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/debug.rst | 52 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/fs.rst | 29 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/index.rst | 21 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/kernel.rst | 100 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/net.rst | 137 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 97 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/xen.rst | 31 |
8 files changed, 440 insertions, 74 deletions
diff --git a/Documentation/admin-guide/sysctl/crypto.rst b/Documentation/admin-guide/sysctl/crypto.rst new file mode 100644 index 000000000000..b707bd314a64 --- /dev/null +++ b/Documentation/admin-guide/sysctl/crypto.rst @@ -0,0 +1,47 @@ +================= +/proc/sys/crypto/ +================= + +These files show up in ``/proc/sys/crypto/``, depending on the +kernel configuration: + +.. contents:: :local: + +fips_enabled +============ + +Read-only flag that indicates whether FIPS mode is enabled. + +- ``0``: FIPS mode is disabled (default). +- ``1``: FIPS mode is enabled. + +This value is set at boot time via the ``fips=1`` kernel command line +parameter. When enabled, the cryptographic API will restrict the use +of certain algorithms and perform self-tests to ensure compliance with +FIPS (Federal Information Processing Standards) requirements, such as +FIPS 140-2 and the newer FIPS 140-3, depending on the kernel +configuration and the module in use. + +fips_name +========= + +Read-only file that contains the name of the FIPS module currently in use. +The value is typically configured via the ``CONFIG_CRYPTO_FIPS_NAME`` +kernel configuration option. + +fips_version +============ + +Read-only file that contains the version string of the FIPS module. +If ``CONFIG_CRYPTO_FIPS_CUSTOM_VERSION`` is set, it uses the value from +``CONFIG_CRYPTO_FIPS_VERSION``. Otherwise, it defaults to the kernel +release version (``UTS_RELEASE``). + +Copyright (c) 2026, Shubham Chakraborty <chakrabortyshubham66@gmail.com> + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +.. See scripts/check-sysctl-docs to keep this up to date: +.. scripts/check-sysctl-docs -vtable="crypto" \ +.. $(git grep -l register_sysctl_) diff --git a/Documentation/admin-guide/sysctl/debug.rst b/Documentation/admin-guide/sysctl/debug.rst new file mode 100644 index 000000000000..506bd5e48594 --- /dev/null +++ b/Documentation/admin-guide/sysctl/debug.rst @@ -0,0 +1,52 @@ +================ +/proc/sys/debug/ +================ + +These files show up in ``/proc/sys/debug/``, depending on the +kernel configuration: + +.. contents:: :local: + +exception-trace +=============== + +This flag controls whether the kernel prints information about unhandled +signals (like segmentation faults) to the kernel log (``dmesg``). + +- ``0``: Unhandled signals are not traced. +- ``1``: Information about unhandled signals is printed. + +The default value is ``1`` on most architectures (like x86, MIPS, RISC-V), +but it is ``0`` on **arm64**. + +The actual information printed and the context provided varies +significantly depending on the CPU architecture. For example: + +- On **x86**, it typically prints the instruction pointer (IP), error + code, and address that caused a page fault. +- On **PowerPC**, it may print the next instruction pointer (NIP), + link register (LR), and other relevant registers. + +When enabled, this feature is often rate-limited to prevent the kernel +log from being flooded during a crash loop. + +kprobes-optimization +==================== + +This flag enables or disables the optimization of Kprobes on certain +architectures (like x86). + +- ``0``: Kprobes optimization is turned off. +- ``1``: Kprobes optimization is turned on (default). + +For more details on Kprobes and its optimization, please refer to +Documentation/trace/kprobes.rst. + +Copyright (c) 2026, Shubham Chakraborty <chakrabortyshubham66@gmail.com> + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +.. See scripts/check-sysctl-docs to keep this up to date: +.. scripts/check-sysctl-docs -vtable="debug" \ +.. $(git grep -l register_sysctl_) diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index 08e89e031714..9b7f65c3efd8 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -164,8 +164,8 @@ pipe-user-pages-soft -------------------- Maximum total number of pages a non-privileged user may allocate for pipes -before the pipe size gets limited to a single page. Once this limit is reached, -new pipes will be limited to a single page in size for this user in order to +before the pipe size gets limited to two pages. Once this limit is reached, +new pipes will be limited to two pages in size for this user in order to limit total memory usage, and trying to increase them using ``fcntl()`` will be denied until usage goes below the limit again. The default value allows to allocate up to 1024 pipes at their default size. When set to 0, no limit is @@ -347,3 +347,28 @@ filesystems: ``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for setting/getting the maximum number of pages that can be used for servicing requests in FUSE. + +``/proc/sys/fs/fuse/default_request_timeout`` is a read/write file for +setting/getting the default timeout (in seconds) for a fuse server to +reply to a kernel-issued request in the event where the server did not +specify a timeout at mount. If the server set a timeout, +then default_request_timeout will be ignored. The default +"default_request_timeout" is set to 0. 0 indicates no default timeout. +The maximum value that can be set is 65535. + +``/proc/sys/fs/fuse/max_request_timeout`` is a read/write file for +setting/getting the maximum timeout (in seconds) for a fuse server to +reply to a kernel-issued request. A value greater than 0 automatically opts +the server into a timeout that will be set to at most "max_request_timeout", +even if the server did not specify a timeout and default_request_timeout is +set to 0. If max_request_timeout is greater than 0 and the server set a timeout +greater than max_request_timeout or default_request_timeout is set to a value +greater than max_request_timeout, the system will use max_request_timeout as the +timeout. 0 indicates no max request timeout. The maximum value that can be set +is 65535. + +For timeouts, if the server does not respond to the request by the time +the set timeout elapses, then the connection to the fuse server will be aborted. +Please note that the timeouts are not 100% precise (eg you may set 60 seconds but +the timeout may kick in after 70 seconds). The upper margin of error for the +timeout is roughly FUSE_TIMEOUT_TIMER_FREQ seconds. diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst index 03346f98c7b9..50f00514f0ff 100644 --- a/Documentation/admin-guide/sysctl/index.rst +++ b/Documentation/admin-guide/sysctl/index.rst @@ -66,33 +66,42 @@ This documentation is about: =============== =============================================================== abi/ execution domains & personalities -debug/ <empty> -dev/ device specific information (eg dev/cdrom/info) +<$ARCH> tuning controls for various CPU architecture (e.g. csky, s390) +crypto/ cryptographic subsystem +debug/ debugging features +dev/ device specific information (e.g. dev/cdrom/info) fs/ specific filesystems filehandle, inode, dentry and quota tuning binfmt_misc <Documentation/admin-guide/binfmt-misc.rst> kernel/ global kernel info / tuning miscellaneous stuff + some architecture-specific controls + security (LSM) stuff net/ networking stuff, for documentation look in: <Documentation/networking/> proc/ <empty> sunrpc/ SUN Remote Procedure Call (NFS) +user/ Per user namespace limits vm/ memory management tuning buffer and cache management -user/ Per user per user namespace limits +xen/ Xen hypervisor controls =============== =============================================================== -These are the subdirs I have on my system. There might be more -or other subdirs in another setup. If you see another dir, I'd -really like to hear about it :-) +These are the subdirs I have on my system or have been discovered by +searching through the source code. There might be more or other subdirs +in another setup. If you see another dir, I'd really like to hear about +it :-) .. toctree:: :maxdepth: 1 abi + crypto + debug fs kernel net sunrpc user vm + xen diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index dd49a89a62d3..c6994e55d141 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -177,6 +177,7 @@ core_pattern %E executable path %c maximum size of core file by resource limit RLIMIT_CORE %C CPU the task ran on + %F pidfd number %<OTHER> both are dropped ======== ========================================== @@ -396,13 +397,14 @@ a hung task is detected. hung_task_panic =============== -Controls the kernel's behavior when a hung task is detected. +When set to a non-zero value, a kernel panic will be triggered if the +number of hung tasks found during a single scan reaches this value. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -= ================================================= += ======================================================= 0 Continue operation. This is the default behavior. -1 Panic immediately. -= ================================================= +N Panic when N hung tasks are found during a single scan. += ======================================================= hung_task_check_count @@ -416,10 +418,16 @@ hung_task_detect_count ====================== Indicates the total number of tasks that have been detected as hung since -the system boot. +the system boot or since the counter was reset. The counter is zeroed when +a value of 0 is written. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_sys_info +================== +A comma separated list of extra system information to be dumped when +hung task is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. hung_task_timeout_secs ====================== @@ -514,6 +522,15 @@ default), only processes with the CAP_SYS_ADMIN capability may create io_uring instances. +kernel_sys_info +=============== +A comma separated list of extra system information to be dumped when +soft/hard lockup is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. + +It serves as the default kernel control knob, which will take effect +when a kernel module calls sys_info() with parameter==0. + kexec_load_disabled =================== @@ -575,6 +592,14 @@ if leaking kernel pointer values to unprivileged users is a concern. When ``kptr_restrict`` is set to 2, kernel pointers printed using %pK will be replaced with 0s regardless of privileges. +For disabling these security restrictions early at boot time (and once +for all), use the ``hash_pointers`` boot parameter instead. + +softlockup_sys_info & hardlockup_sys_info +========================================= +A comma separated list of extra system information to be dumped when +soft/hard lockup is detected, for example, "tasks,mem,timers,locks,...". +Refer 'panic_sys_info' section below for more details. modprobe ======== @@ -889,7 +914,7 @@ bit 1 print system memory info bit 2 print timer info bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -bit 5 print all printk messages in buffer +bit 5 replay all kernel messages on consoles at the end of panic bit 6 print all CPUs backtrace (if available in the arch) bit 7 print only tasks in uninterruptible (blocked) state ===== ============================================ @@ -899,6 +924,24 @@ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print +panic_sys_info +============== + +A comma separated list of extra information to be dumped on panic, +for example, "tasks,mem,timers,...". It is a human readable alternative +to 'panic_print'. Possible values are: + +============= =================================================== +tasks print all tasks info +mem print system memory info +timers print timers info +locks print locks info if CONFIG_LOCKDEP is on +ftrace print ftrace buffer +all_bt print all CPUs backtrace (if available in the arch) +blocked_tasks print only tasks in uninterruptible (blocked) state +============= =================================================== + + panic_on_rcu_stall ================== @@ -1014,30 +1057,26 @@ perf_user_access (arm64 and riscv only) Controls user space access for reading perf event counters. -arm64 -===== - -The default value is 0 (access disabled). +* for arm64 + The default value is 0 (access disabled). -When set to 1, user space can read performance monitor counter registers -directly. + When set to 1, user space can read performance monitor counter registers + directly. -See Documentation/arch/arm64/perf.rst for more information. - -riscv -===== + See Documentation/arch/arm64/perf.rst for more information. -When set to 0, user space access is disabled. +* for riscv + When set to 0, user space access is disabled. -The default value is 1, user space can read performance monitor counter -registers through perf, any direct access without perf intervention will trigger -an illegal instruction. + The default value is 1, user space can read performance monitor counter + registers through perf, any direct access without perf intervention will trigger + an illegal instruction. -When set to 2, which enables legacy mode (user space has direct access to cycle -and insret CSRs only). Note that this legacy value is deprecated and will be -removed once all user space applications are fixed. + When set to 2, which enables legacy mode (user space has direct access to cycle + and insret CSRs only). Note that this legacy value is deprecated and will be + removed once all user space applications are fixed. -Note that the time CSR is always directly accessible to all modes. + Note that the time CSR is always directly accessible to all modes. pid_max ======= @@ -1110,7 +1149,8 @@ printk_ratelimit_burst While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. ``printk_ratelimit_burst`` specifies the number of messages we can -send before ratelimiting kicks in. +send before ratelimiting kicks in. After `printk_ratelimit`_ seconds +have elapsed, another burst of messages may be sent. The default value is 10 messages. @@ -1199,12 +1239,6 @@ that support this feature. == =========================================================================== -real-root-dev -============= - -See Documentation/admin-guide/initrd.rst. - - reboot-cmd (SPARC only) ======================= @@ -1465,7 +1499,7 @@ stack_erasing ============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. +of syscalls for kernels built with ``CONFIG_KSTACK_ERASE``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. @@ -1473,7 +1507,7 @@ The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. = ==================================================================== -0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +0 Kernel stack erasing is disabled, KSTACK_ERASE_METRICS are not updated. 1 Kernel stack erasing is enabled (default), it is performed before returning to the userspace at the end of syscalls. = ==================================================================== diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 7b0c4291c686..0724a793798f 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -40,8 +40,8 @@ Table : Subdirectories in /proc/sys/net bridge Bridging rose X.25 PLP layer core General parameter tipc TIPC ethernet Ethernet protocol unix Unix domain sockets - ipv4 IP version 4 x25 X.25 protocol - ipv6 IP version 6 + ipv4 IP version 4 vsock VSOCK sockets + ipv6 IP version 6 x25 X.25 protocol ========= =================== = ========== =================== 1. /proc/sys/net/core - Network core options @@ -212,6 +212,14 @@ mem_pcpu_rsv Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU. +bypass_prot_mem +--------------- + +Skip charging socket buffers to the global per-protocol memory +accounting controlled by net.ipv4.tcp_mem, net.ipv4.udp_mem, etc. + +Default: 0 (off) + rmem_default ------------ @@ -222,6 +230,8 @@ rmem_max The maximum receive socket buffer size in bytes. +Default: 4194304 + rps_default_mask ---------------- @@ -247,6 +257,8 @@ wmem_max The maximum send socket buffer size in bytes. +Default: 4194304 + message_burst and message_cost ------------------------------ @@ -291,24 +303,33 @@ netdev_max_backlog Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. +qdisc_max_burst +------------------ + +Maximum number of packets that can be temporarily stored before +reaching qdisc. + +Default: 1000 + netdev_rss_key -------------- -RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is -randomly generated. +RSS (Receive Side Scaling) enabled drivers use a host key that +is randomly generated. Some user space might need to gather its content even if drivers do not provide ethtool -x support yet. :: myhost:~# cat /proc/sys/net/core/netdev_rss_key - 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (256 bytes total) -File contains nul bytes if no driver ever called netdev_rss_key_fill() function. +File contains all nul bytes if no driver ever called netdev_rss_key_fill() +function. Note: - /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, - but most drivers only use 40 bytes of it. + /proc/sys/net/core/netdev_rss_key contains 256 bytes of key, + but many drivers only use 40 or 52 bytes of it. :: @@ -343,9 +364,9 @@ skb_defer_max ------------- Max size (in skbs) of the per-cpu list of skbs being freed -by the cpu which allocated them. Used by TCP stack so far. +by the cpu which allocated them. -Default: 64 +Default: 128 optmem_max ---------- @@ -402,6 +423,23 @@ to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). If set to 1 (default), hash rethink is performed on listening socket. If set to 0, hash rethink is not performed. +txq_reselection_ms +------------------ + +Controls how often (in ms) a busy connected flow can select another tx queue. + +A resection is desirable when/if user thread has migrated and XPS +would select a different queue. Same can occur without XPS +if the flow hash has changed. + +But switching txq can introduce reorders, especially if the +old queue is under high pressure. Modern TCP stacks deal +well with reorders if they happen not too often. + +To disable this feature, set the value to 0. + +Default : 1000 + gro_normal_batch ---------------- @@ -513,3 +551,82 @@ originally may have been issued in the correct sequential order. If named_timeout is nonzero, failed topology updates will be placed on a defer queue until another event arrives that clears the error, or until the timeout expires. Value is in milliseconds. + +6. /proc/sys/net/vsock - VSOCK sockets +-------------------------------------- + +VSOCK sockets (AF_VSOCK) provide communication between virtual machines and +their hosts. The behavior of VSOCK sockets in a network namespace is determined +by the namespace's mode (``global`` or ``local``), which controls how CIDs +(Context IDs) are allocated and how sockets interact across namespaces. + +ns_mode +------- + +Read-only. Reports the current namespace's mode, set at namespace creation +and immutable thereafter. + +Values: + + - ``global`` - the namespace shares system-wide CID allocation and + its sockets can reach any VM or socket in any global namespace. + Sockets in this namespace cannot reach sockets in local + namespaces. + - ``local`` - the namespace has private CID allocation and its + sockets can only connect to VMs or sockets within the same + namespace. + +The init_net mode is always ``global``. + +child_ns_mode +------------- + +Controls what mode newly created child namespaces will inherit. At namespace +creation, ``ns_mode`` is inherited from the parent's ``child_ns_mode``. The +initial value matches the namespace's own ``ns_mode``. + +Values: + + - ``global`` - child namespaces will share system-wide CID allocation + and their sockets will be able to reach any VM or socket in any + global namespace. + - ``local`` - child namespaces will have private CID allocation and + their sockets will only be able to connect within their own + namespace. + +The first write to ``child_ns_mode`` locks its value. Subsequent writes of the +same value succeed, but writing a different value returns ``-EBUSY``. + +Changing ``child_ns_mode`` only affects namespaces created after the change; +it does not modify the current namespace or any existing children. + +A namespace with ``ns_mode`` set to ``local`` cannot change +``child_ns_mode`` to ``global`` (returns ``-EPERM``). + +g2h_fallback +------------ + +Controls whether connections to CIDs not owned by the host-to-guest (H2G) +transport automatically fall back to the guest-to-host (G2H) transport. + +When enabled, if a connect targets a CID that the H2G transport (e.g. +vhost-vsock) does not serve, or if no H2G transport is loaded at all, the +connection is routed via the G2H transport (e.g. virtio-vsock) instead. This +allows a host running both nested VMs (via vhost-vsock) and sibling VMs +reachable through the hypervisor (e.g. Nitro Enclaves) to address both using +a single CID space, without requiring applications to set +``VMADDR_FLAG_TO_HOST``. + +When the fallback is taken, ``VMADDR_FLAG_TO_HOST`` is automatically set on +the remote address so that userspace can determine the path via +``getpeername()``. + +Note: With this sysctl enabled, user space that attempts to talk to a guest +CID which is not implemented by the H2G transport will create host vsock +traffic. Environments that rely on H2G-only isolation should set it to 0. + +Values: + + - 0 - Connections to CIDs <= 2 or with VMADDR_FLAG_TO_HOST use G2H; + all others use H2G (or fail with ENODEV if H2G is not loaded). + - 1 - Connections to CIDs not owned by H2G fall back to G2H. (default) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..97e12359775c 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - compact_memory - compaction_proactiveness - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -40,7 +41,6 @@ Currently, these files are in /proc/sys/vm: - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group -- laptop_mode - legacy_va_layout - lowmem_reserve_ratio - max_map_count @@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm: - mmap_min_addr - mmap_rnd_bits - mmap_rnd_compat_bits +- movable_gigantic_pages - nr_hugepages - nr_hugepages_mempolicy - nr_overcommit_hugepages @@ -74,6 +75,7 @@ Currently, these files are in /proc/sys/vm: - unprivileged_userfaultfd - user_reserve_kbytes - vfs_cache_pressure +- vfs_cache_pressure_denom - watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -130,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. @@ -145,6 +153,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== @@ -215,6 +231,8 @@ eventually gets pushed out to disk. This tunable is used to define when dirty inode is old enough to be eligible for writeback by the kernel flusher threads. And, it is also used as the interval to wakeup dirtytime_writeback thread. +Setting this to zero disables periodic dirtytime writeback. + dirty_writeback_centisecs ========================= @@ -347,13 +365,6 @@ hugetlb_shm_group contains group id that is allowed to create SysV shared memory segment using hugetlb page. -laptop_mode -=========== - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. - - legacy_va_layout ================ @@ -449,8 +460,8 @@ The minimum value is 1 (1/1 -> 100%). The value less than 1 completely disables protection of the pages. -max_map_count: -============== +max_map_count +============= This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling @@ -478,9 +489,13 @@ memory allocations. The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. +When CONFIG_MEM_ALLOC_PROFILING_DEBUG=y, this control is read-only to avoid +warnings produced by allocations made while profiling is disabled and freed +when it's enabled. + -memory_failure_early_kill: -========================== +memory_failure_early_kill +========================= Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware @@ -608,6 +623,33 @@ This value can be changed after boot using the /proc/sys/vm/mmap_rnd_compat_bits tunable +movable_gigantic_pages +====================== + +This parameter controls whether gigantic pages may be allocated from +ZONE_MOVABLE. If set to non-zero, gigantic pages can be allocated +from ZONE_MOVABLE. ZONE_MOVABLE memory may be created via the kernel +boot parameter `kernelcore` or via memory hotplug as discussed in +Documentation/admin-guide/mm/memory-hotplug.rst. + +Support may depend on specific architecture. + +Note that using ZONE_MOVABLE gigantic pages make memory hotremove unreliable. + +Memory hot-remove operations will block indefinitely until the admin reserves +sufficient gigantic pages to service migration requests associated with the +memory offlining process. As HugeTLB gigantic page reservation is a manual +process (via `nodeN/hugepages/.../nr_hugepages` interfaces) this may not be +obvious when just attempting to offline a block of memory. + +Additionally, as multiple gigantic pages may be reserved on a single block, +it may appear that gigantic pages are available for migration when in reality +they are in the process of being removed. For example if `memoryN` contains +two gigantic pages, one reserved and one allocated, and an admin attempts to +offline that block, this operations may hang indefinitely unless another +reserved gigantic page is available on another block `memoryM`. + + nr_hugepages ============ @@ -1008,19 +1050,28 @@ vfs_cache_pressure This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel +will attempt to reclaim dentries and inodes at a "fair" rate with respect to +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, +the kernel will never reclaim dentries and inodes due to memory pressure and +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries +and inodes. -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may +have negative performance impact. Reclaim code needs to take various locks to +find freeable directory and inode objects. When vfs_cache_pressure equals +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable +objects than there are. + +Note: This setting should always be used together with vfs_cache_pressure_denom. + +vfs_cache_pressure_denom +======================== +Defaults to 100 (minimum allowed value). Requires corresponding +vfs_cache_pressure setting to take effect. watermark_boost_factor ====================== diff --git a/Documentation/admin-guide/sysctl/xen.rst b/Documentation/admin-guide/sysctl/xen.rst new file mode 100644 index 000000000000..6c5edc3e5e4c --- /dev/null +++ b/Documentation/admin-guide/sysctl/xen.rst @@ -0,0 +1,31 @@ +=============== +/proc/sys/xen/ +=============== + +Copyright (c) 2026, Shubham Chakraborty <chakrabortyshubham66@gmail.com> + +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. + +------------------------------------------------------------------------------ + +These files show up in ``/proc/sys/xen/``, depending on the +kernel configuration: + +.. contents:: :local: + +balloon/hotplug_unpopulated +=========================== + +This flag controls whether unpopulated memory ranges are automatically +hotplugged as system RAM. + +- ``0``: Unpopulated ranges are not hotplugged (default). +- ``1``: Unpopulated ranges are automatically hotplugged. + +When enabled, the Xen balloon driver will add memory regions that are +marked as unpopulated in the Xen memory map to the system as usable RAM. +This allows for dynamic memory expansion in Xen guest domains. + +This option is only available when the kernel is built with +``CONFIG_XEN_BALLOON_MEMORY_HOTPLUG`` enabled. |
