From 49f0ce5f92321cdcf741e35f385669a421013cb7 Mon Sep 17 00:00:00 2001 From: Jerome Marchand Date: Tue, 21 Jan 2014 15:49:14 -0800 Subject: mm: add overcommit_kbytes sysctl variable Some applications that run on HPC clusters are designed around the availability of RAM and the overcommit ratio is fine tuned to get the maximum usage of memory without swapping. With growing memory, the 1%-of-all-RAM grain provided by overcommit_ratio has become too coarse for these workload (on a 2TB machine it represents no less than 20GB). This patch adds the new overcommit_kbytes sysctl variable that allow a much finer grain. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Jerome Marchand Cc: Dave Hansen Cc: Alan Cox Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 1fbd4eb7b64a..9f5481bdc5a4 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -47,6 +47,7 @@ Currently, these files are in /proc/sys/vm: - numa_zonelist_order - oom_dump_tasks - oom_kill_allocating_task +- overcommit_kbytes - overcommit_memory - overcommit_ratio - page-cluster @@ -574,6 +575,17 @@ The default value is 0. ============================================================== +overcommit_kbytes: + +When overcommit_memory is set to 2, the committed address space is not +permitted to exceed swap plus this amount of physical RAM. See below. + +Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one +of them may be specified at a time. Setting one disables the other (which +then appears as 0 when read). + +============================================================== + overcommit_memory: This value contains a flag that enables memory overcommitment. -- cgit v1.2.3 From 7984754b99b6c89054edc405e9d9d35810a91d36 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Thu, 23 Jan 2014 15:55:59 -0800 Subject: kexec: add sysctl to disable kexec_load For general-purpose (i.e. distro) kernel builds it makes sense to build with CONFIG_KEXEC to allow end users to choose what kind of things they want to do with kexec. However, in the face of trying to lock down a system with such a kernel, there needs to be a way to disable kexec_load (much like module loading can be disabled). Without this, it is too easy for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM and modules_disabled are set. With this change, it is still possible to load an image for use later, then disable kexec_load so the image (or lack of image) can't be altered. The intention is for using this in environments where "perfect" enforcement is hard. Without a verified boot, along with verified modules, and along with verified kexec, this is trying to give a system a better chance to defend itself (or at least grow the window of discoverability) against attack in the face of a privilege escalation. In my mind, I consider several boot scenarios: 1) Verified boot of read-only verified root fs loading fd-based verification of kexec images. 2) Secure boot of writable root fs loading signed kexec images. 3) Regular boot loading kexec (e.g. kcrash) image early and locking it. 4) Regular boot with no control of kexec image at all. 1 and 2 don't exist yet, but will soon once the verified kexec series has landed. 4 is the state of things now. The gap between 2 and 4 is too large, so this change creates scenario 3, a middle-ground above 4 when 2 and 1 are not possible for a system. Signed-off-by: Kees Cook Acked-by: Rik van Riel Cc: Vivek Goyal Cc: Eric Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 15 ++++++++++++++- include/linux/kexec.h | 1 + kernel/kexec.c | 3 ++- kernel/sysctl.c | 13 +++++++++++++ 4 files changed, 30 insertions(+), 2 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 6d486404200e..ee9a2f983b99 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -33,6 +33,7 @@ show up in /proc/sys/kernel: - domainname - hostname - hotplug +- kexec_load_disabled - kptr_restrict - kstack_depth_to_print [ X86 only ] - l2cr [ PPC only ] @@ -287,6 +288,18 @@ Default value is "/sbin/hotplug". ============================================================== +kexec_load_disabled: + +A toggle indicating if the kexec_load syscall has been disabled. This +value defaults to 0 (false: kexec_load enabled), but can be set to 1 +(true: kexec_load disabled). Once true, kexec can no longer be used, and +the toggle cannot be set back to false. This allows a kexec image to be +loaded before disabling the syscall, allowing a system to set up (and +later use) an image without it being altered. Generally used together +with the "modules_disabled" sysctl. + +============================================================== + kptr_restrict: This toggle indicates whether restrictions are placed on @@ -331,7 +344,7 @@ A toggle value indicating if modules are allowed to be loaded in an otherwise modular kernel. This toggle defaults to off (0), but can be set true (1). Once true, modules can be neither loaded nor unloaded, and the toggle cannot be set back -to false. +to false. Generally used with the "kexec_load_disabled" toggle. ============================================================== diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 5fd33dc1fe3a..6d4066cdb5b5 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -170,6 +170,7 @@ unsigned long paddr_vmcoreinfo_note(void); extern struct kimage *kexec_image; extern struct kimage *kexec_crash_image; +extern int kexec_load_disabled; #ifndef kexec_flush_icache_page #define kexec_flush_icache_page(page) diff --git a/kernel/kexec.c b/kernel/kexec.c index 9c970167e402..ac738781d356 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -932,6 +932,7 @@ static int kimage_load_segment(struct kimage *image, */ struct kimage *kexec_image; struct kimage *kexec_crash_image; +int kexec_load_disabled; static DEFINE_MUTEX(kexec_mutex); @@ -942,7 +943,7 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments, int result; /* We only trust the superuser with rebooting the system. */ - if (!capable(CAP_SYS_BOOT)) + if (!capable(CAP_SYS_BOOT) || kexec_load_disabled) return -EPERM; /* diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 693eac39c202..096db7452cbd 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -62,6 +62,7 @@ #include #include #include +#include #include #include @@ -614,6 +615,18 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_KEXEC + { + .procname = "kexec_load_disabled", + .data = &kexec_load_disabled, + .maxlen = sizeof(int), + .mode = 0644, + /* only handle a transition from default "0" to "1" */ + .proc_handler = proc_dointvec_minmax, + .extra1 = &one, + .extra2 = &one, + }, +#endif #ifdef CONFIG_MODULES { .procname = "modprobe", -- cgit v1.2.3 From 270750dbc18a71b23d660df110e433ff9616a2d4 Mon Sep 17 00:00:00 2001 From: Aaron Tomlin Date: Mon, 20 Jan 2014 17:34:13 +0000 Subject: hung_task: Display every hung task warning When khungtaskd detects hung tasks, it prints out backtraces from a number of those tasks. Limiting the number of backtraces being printed out can result in the user not seeing the information necessary to debug the issue. The hung_task_warnings sysctl controls this feature. This patch makes it possible for hung_task_warnings to accept a special value to print an unlimited number of backtraces when khungtaskd detects hung tasks. The special value is -1. To use this value it is necessary to change types from ulong to int. Signed-off-by: Aaron Tomlin Reviewed-by: Rik van Riel Acked-by: David Rientjes Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com [ Build warning fix. ] Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 42 +++++++++++++++++++++++++++++++++++++++++ include/linux/sched/sysctl.h | 2 +- kernel/hung_task.c | 6 ++++-- kernel/sysctl.c | 8 +++++--- 4 files changed, 52 insertions(+), 6 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 6d486404200e..4205f3c05cbe 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -33,6 +33,10 @@ show up in /proc/sys/kernel: - domainname - hostname - hotplug +- hung_task_panic +- hung_task_check_count +- hung_task_timeout_secs +- hung_task_warnings - kptr_restrict - kstack_depth_to_print [ X86 only ] - l2cr [ PPC only ] @@ -287,6 +291,44 @@ Default value is "/sbin/hotplug". ============================================================== +hung_task_panic: + +Controls the kernel's behavior when a hung task is detected. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: continue operation. This is the default behavior. + +1: panic immediately. + +============================================================== + +hung_task_check_count: + +The upper bound on the number of tasks that are checked. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +============================================================== + +hung_task_timeout_secs: + +Check interval. When a task in D state did not get scheduled +for more than this value report a warning. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: means infinite timeout - no checking done. + +============================================================== + +hung_task_warning: + +The maximum number of warnings to report. During a check interval +When this value is reached, no more the warnings will be reported. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +-1: report an infinite number of warnings. + +============================================================== + kptr_restrict: This toggle indicates whether restrictions are placed on diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 31e0193cb0c5..3a93f842306a 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -5,7 +5,7 @@ extern int sysctl_hung_task_check_count; extern unsigned int sysctl_hung_task_panic; extern unsigned long sysctl_hung_task_timeout_secs; -extern unsigned long sysctl_hung_task_warnings; +extern int sysctl_hung_task_warnings; extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 9328b80eaf14..0b9c169d577f 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -37,7 +37,7 @@ int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT; */ unsigned long __read_mostly sysctl_hung_task_timeout_secs = CONFIG_DEFAULT_HUNG_TASK_TIMEOUT; -unsigned long __read_mostly sysctl_hung_task_warnings = 10; +int __read_mostly sysctl_hung_task_warnings = 10; static int __read_mostly did_panic; @@ -98,7 +98,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) if (!sysctl_hung_task_warnings) return; - sysctl_hung_task_warnings--; + + if (sysctl_hung_task_warnings > 0) + sysctl_hung_task_warnings--; /* * Ok, the task did not get scheduled for more than 2 minutes, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c398a58673a7..dd5b4496637e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -122,7 +122,8 @@ extern int blk_iopoll_enabled; static int sixty = 60; #endif -static int neg_one = -1; +static int __maybe_unused neg_one = -1; + static int zero; static int __maybe_unused one = 1; static int __maybe_unused two = 2; @@ -978,9 +979,10 @@ static struct ctl_table kern_table[] = { { .procname = "hung_task_warnings", .data = &sysctl_hung_task_warnings, - .maxlen = sizeof(unsigned long), + .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_doulongvec_minmax, + .proc_handler = proc_dointvec_minmax, + .extra1 = &neg_one, }, #endif #ifdef CONFIG_COMPAT -- cgit v1.2.3 From 8582cb96b0bfd6891766d8c30d759bf21aad3b4d Mon Sep 17 00:00:00 2001 From: Aaron Tomlin Date: Wed, 29 Jan 2014 14:05:38 -0800 Subject: mm: document improved handling of swappiness==0 Prior to commit fe35004fbf9e ("mm: avoid swapping out with swappiness==0") setting swappiness to 0, reclaim code could still evict recently used user anonymous memory to swap even though there is a significant amount of RAM used for page cache. The behaviour of setting swappiness to 0 has since changed. When set, the reclaim code does not initiate swap until the amount of free pages and file-backed pages, is less than the high water mark in a zone. Let's update the documentation to reflect this. [akpm@linux-foundation.org: remove comma, per Randy] Signed-off-by: Aaron Tomlin Acked-by: Rik van Riel Acked-by: Bryn M. Reeves Cc: Satoru Moriya Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 9f5481bdc5a4..d614a9b6a280 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -696,7 +696,9 @@ swappiness This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values -decrease the amount of swap. +decrease the amount of swap. A value of 0 instructs the kernel not to +initiate swap until the amount of free and file-backed pages is less +than the high water mark in a zone. The default value is 60. -- cgit v1.2.3