diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-12-29 10:57:20 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-12-29 10:57:20 -0800 |
commit | 6f9d71c9c759b1e7d31189a4de228983192c7dc7 (patch) | |
tree | b8372cc73cbac4701c85ca1d441c4ba8a53b4c1f /Documentation | |
parent | 55db91fbaad9ea769d516e6867195808b4399894 (diff) | |
parent | 3fc9c12d27b4ded4f1f761a800558dab2e6bbac5 (diff) | |
download | lwn-6f9d71c9c759b1e7d31189a4de228983192c7dc7.tar.gz lwn-6f9d71c9c759b1e7d31189a4de228983192c7dc7.zip |
Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Waiman's cgroup2 cpuset support has been finally merged closing one
of the last remaining feature gaps.
- cgroup.procs could show non-leader threads when cgroup2 threaded mode
was used in certain ways. I forgot to push the fix during the last
cycle.
- A patch to fix mount option parsing when all mount options have been
consumed by someone else (LSM).
- cgroup_no_v1 boot param can now block named cgroup1 hierarchies too.
* 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param
cgroup: fix parsing empty mount option string
cpuset: Remove set but not used variable 'cs'
cgroup: fix CSS_TASK_ITER_PROCS
cgroup: Add .__DEBUG__. prefix to debug file names
cpuset: Minor cgroup2 interface updates
cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
cpuset: Add documentation about the new "cpuset.sched.partition" flag
cpuset: Use descriptive text when reading/writing cpuset.sched.partition
cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
cpuset: Make generate_sched_domains() work with partition
cpuset: Make CPU hotplug work with partition
cpuset: Track cpusets that use parent's effective_cpus
cpuset: Add an error state to cpuset.sched.partition
cpuset: Add new v2 cpuset.sched.partition flag
cpuset: Simply allocation and freeing of cpumasks
cpuset: Define data structures to support scheduling partition
cpuset: Enable cpuset controller in default hierarchy
cgroup: remove unnecessary unlikely()
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 182 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 8 |
2 files changed, 183 insertions, 7 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index baf19bf28385..7bf3f129c68b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/. 5-3-3-2. IO Latency Interface Files 5-4. PID 5-4-1. PID Interface Files - 5-5. Device - 5-6. RDMA - 5-6-1. RDMA Interface Files - 5-7. Misc - 5-7-1. perf_event + 5-5. Cpuset + 5.5-1. Cpuset Interface Files + 5-6. Device + 5-7. RDMA + 5-7-1. RDMA Interface Files + 5-8. Misc + 5-8-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -1610,6 +1612,176 @@ through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated. +Cpuset +------ + +The "cpuset" controller provides a mechanism for constraining +the CPU and memory node placement of tasks to only the resources +specified in the cpuset interface files in a task's current cgroup. +This is especially valuable on large NUMA systems where placing jobs +on properly sized subsets of the systems with careful processor and +memory placement to reduce cross-node memory access and contention +can improve overall system performance. + +The "cpuset" controller is hierarchical. That means the controller +cannot use CPUs or memory nodes not allowed in its parent. + + +Cpuset Interface Files +~~~~~~~~~~~~~~~~~~~~~~ + + cpuset.cpus + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested CPUs to be used by tasks within this + cgroup. The actual list of CPUs to be granted, however, is + subjected to constraints imposed by its parent and can differ + from the requested CPUs. + + The CPU numbers are comma-separated numbers or ranges. + For example: + + # cat cpuset.cpus + 0-4,6,8-10 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.cpus" or all the available CPUs if none is found. + + The value of "cpuset.cpus" stays constant until the next update + and won't be affected by any CPU hotplug events. + + cpuset.cpus.effective + A read-only multiple values file which exists on all + cpuset-enabled cgroups. + + It lists the onlined CPUs that are actually granted to this + cgroup by its parent. These CPUs are allowed to be used by + tasks within the current cgroup. + + If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows + all the CPUs from the parent cgroup that can be available to + be used by this cgroup. Otherwise, it should be a subset of + "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" + can be granted. In this case, it will be treated just like an + empty "cpuset.cpus". + + Its value will be affected by CPU hotplug events. + + cpuset.mems + A read-write multiple values file which exists on non-root + cpuset-enabled cgroups. + + It lists the requested memory nodes to be used by tasks within + this cgroup. The actual list of memory nodes granted, however, + is subjected to constraints imposed by its parent and can differ + from the requested memory nodes. + + The memory node numbers are comma-separated numbers or ranges. + For example: + + # cat cpuset.mems + 0-1,3 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.mems" or all the available memory nodes if none + is found. + + The value of "cpuset.mems" stays constant until the next update + and won't be affected by any memory nodes hotplug events. + + cpuset.mems.effective + A read-only multiple values file which exists on all + cpuset-enabled cgroups. + + It lists the onlined memory nodes that are actually granted to + this cgroup by its parent. These memory nodes are allowed to + be used by tasks within the current cgroup. + + If "cpuset.mems" is empty, it shows all the memory nodes from the + parent cgroup that will be available to be used by this cgroup. + Otherwise, it should be a subset of "cpuset.mems" unless none of + the memory nodes listed in "cpuset.mems" can be granted. In this + case, it will be treated just like an empty "cpuset.mems". + + Its value will be affected by memory nodes hotplug events. + + cpuset.cpus.partition + A read-write single value file which exists on non-root + cpuset-enabled cgroups. This flag is owned by the parent cgroup + and is not delegatable. + + It accepts only the following input values when written to. + + "root" - a paritition root + "member" - a non-root member of a partition + + When set to be a partition root, the current cgroup is the + root of a new partition or scheduling domain that comprises + itself and all its descendants except those that are separate + partition roots themselves and their descendants. The root + cgroup is always a partition root. + + There are constraints on where a partition root can be set. + It can only be set in a cgroup if all the following conditions + are true. + + 1) The "cpuset.cpus" is not empty and the list of CPUs are + exclusive, i.e. they are not shared by any of its siblings. + 2) The parent cgroup is a partition root. + 3) The "cpuset.cpus" is also a proper subset of the parent's + "cpuset.cpus.effective". + 4) There is no child cgroups with cpuset enabled. This is for + eliminating corner cases that have to be handled if such a + condition is allowed. + + Setting it to partition root will take the CPUs away from the + effective CPUs of the parent cgroup. Once it is set, this + file cannot be reverted back to "member" if there are any child + cgroups with cpuset enabled. + + A parent partition cannot distribute all its CPUs to its + child partitions. There must be at least one cpu left in the + parent partition. + + Once becoming a partition root, changes to "cpuset.cpus" is + generally allowed as long as the first condition above is true, + the change will not take away all the CPUs from the parent + partition and the new "cpuset.cpus" value is a superset of its + children's "cpuset.cpus" values. + + Sometimes, external factors like changes to ancestors' + "cpuset.cpus" or cpu hotplug can cause the state of the partition + root to change. On read, the "cpuset.sched.partition" file + can show the following values. + + "member" Non-root member of a partition + "root" Partition root + "root invalid" Invalid partition root + + It is a partition root if the first 2 partition root conditions + above are true and at least one CPU from "cpuset.cpus" is + granted by the parent cgroup. + + A partition root can become invalid if none of CPUs requested + in "cpuset.cpus" can be granted by the parent cgroup or the + parent cgroup is no longer a partition root itself. In this + case, it is not a real partition even though the restriction + of the first partition root condition above will still apply. + The cpu affinity of all the tasks in the cgroup will then be + associated with CPUs in the nearest ancestor partition. + + An invalid partition root can be transitioned back to a + real partition root if at least one of the requested CPUs + can now be granted by its parent. In this case, the cpu + affinity of all the tasks in the formerly invalid partition + will be associated to the CPUs of the newly formed partition. + Changing the partition state of an invalid partition root to + "member" is always allowed even if child cpusets are present. + + Device controller ----------------- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ff4daa780ae8..b7c9040f547e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -486,10 +486,14 @@ cut the overhead, others just disable the usage. So only cgroup_disable=memory is actually worthy} - cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1 - Format: { controller[,controller...] | "all" } + cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1 + Format: { { controller | "all" | "named" } + [,{ controller | "all" | "named" }...] } Like cgroup_disable, but only applies to cgroup v1; the blacklisted controllers remain available in cgroup2. + "all" blacklists all controllers and "named" disables + named mounts. Specifying both "all" and "named" disables + all v1 hierarchies. cgroup.memory= [KNL] Pass options to the cgroup memory controller. Format: <string> |