summaryrefslogtreecommitdiff
path: root/include/uapi/linux/sched.h
diff options
context:
space:
mode:
authorChristian Brauner <brauner@kernel.org>2026-02-26 14:51:01 +0100
committerChristian Brauner <brauner@kernel.org>2026-03-11 23:15:40 +0100
commitc8134b5f13ae959de2b3c8cc278e2602b0857345 (patch)
tree72b4939340f359060f9389eb36e49cc352bea31f /include/uapi/linux/sched.h
parent24baca56fafc33d4fb77cd9858a48c734183cb22 (diff)
downloadlwn-c8134b5f13ae959de2b3c8cc278e2602b0857345.tar.gz
lwn-c8134b5f13ae959de2b3c8cc278e2602b0857345.zip
pidfd: add CLONE_PIDFD_AUTOKILL
Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's lifetime to the pidfd returned from clone3(). When the last reference to the struct file created by clone3() is closed the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open() for the same process does not keep the child alive and does not trigger autokill - only the specific struct file from clone3() has this property. This is useful for container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd. CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no one to reap it would become a zombie). CLONE_THREAD is rejected because autokill targets a process not a thread. The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on the struct file at clone3() time. The pidfs .release handler checks this flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...) only when it is set. Files from pidfd_open() or open_by_handle_at() are distinct struct files that do not carry this flag. dup()/fork() share the same struct file so they extend the child's lifetime until the last reference drops. CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without CLONE_NNP the child could escalate privileges via setuid/setgid exec after being spawned, so the caller must have CAP_SYS_ADMIN in its user namespace. With CLONE_NNP the child can never gain new privileges so unprivileged usage is allowed. This is a deliberate departure from the pdeath_signal model which is reset during secureexec and commit_creds() rendering it useless for container runtimes that need to deprivilege themselves. Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-3-d148b984a989@kernel.org Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
Diffstat (limited to 'include/uapi/linux/sched.h')
-rw-r--r--include/uapi/linux/sched.h1
1 files changed, 1 insertions, 0 deletions
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 386c8d7e89cb..149dbc64923b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -38,6 +38,7 @@
#define CLONE_INTO_CGROUP (1ULL << 33) /* Clone into a specific cgroup given the right permissions. */
#define CLONE_AUTOREAP (1ULL << 34) /* Auto-reap child on exit. */
#define CLONE_NNP (1ULL << 35) /* Set no_new_privs on child. */
+#define CLONE_PIDFD_AUTOKILL (1ULL << 36) /* Kill child when clone pidfd closes. */
/*
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3