diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-08-04 15:12:02 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-08-04 15:12:02 -0700 |
commit | 4f30a60aa78410496e5ffe632a371c00f0d83a8d (patch) | |
tree | b2f74fcd6b286c961f310548de9b2cc855787849 /include | |
parent | 74858abbb1032222f922487fd1a24513bbed80f9 (diff) | |
parent | a5161eeef97cb0cdc4de966005926db2f5894af4 (diff) | |
download | lwn-4f30a60aa78410496e5ffe632a371c00f0d83a8d.tar.gz lwn-4f30a60aa78410496e5ffe632a371c00f0d83a8d.zip |
Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range() implementation from Christian Brauner:
"This adds the close_range() syscall. It allows to efficiently close a
range of file descriptors up to all file descriptors of a calling
task.
This is coordinated with the FreeBSD folks which have copied our
version of this syscall and in the meantime have already merged it in
April 2019:
https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836
The syscall originally came up in a discussion around the new mount
API and making new file descriptor types cloexec by default. During
this discussion, Al suggested the close_range() syscall.
First, it helps to close all file descriptors of an exec()ing task.
This can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that
file descriptors are not cloexec by default. This is aggravated by the
fact that we can't just switch them over without massively regressing
userspace. For a whole class of programs having an in-kernel method of
closing all file descriptors is very helpful (e.g. demons, service
managers, programming language standard libraries, container managers
etc.).
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on
each file descriptor and other hacks. From looking at various
large(ish) userspace code bases this or similar patterns are very
common in service managers, container runtimes, and programming
language runtimes/standard libraries such as Python or Rust.
In addition, the syscall will also work for tasks that do not have
procfs mounted and on kernels that do not have procfs support compiled
in. In such situations the only way to make sure that all file
descriptors are closed is to call close() on each file descriptor up
to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.
Based on Linus' suggestion close_range() also comes with a new flag
CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
right before exec. This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part
of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
gets especially handy when we're closing all file descriptors above a
certain threshold.
Test-suite as always included"
* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add CLOSE_RANGE_UNSHARE tests
close_range: add CLOSE_RANGE_UNSHARE
tests: add close_range() tests
arch: wire-up close_range()
open: add close_range()
Diffstat (limited to 'include')
-rw-r--r-- | include/linux/fdtable.h | 6 | ||||
-rw-r--r-- | include/linux/syscalls.h | 2 | ||||
-rw-r--r-- | include/uapi/asm-generic/unistd.h | 2 | ||||
-rw-r--r-- | include/uapi/linux/close_range.h | 9 |
4 files changed, 18 insertions, 1 deletions
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index f07c55ea0c22..a32bf47c593e 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -22,6 +22,7 @@ * as this is the granularity returned by copy_fdset(). */ #define NR_OPEN_DEFAULT BITS_PER_LONG +#define NR_OPEN_MAX ~0U struct fdtable { unsigned int max_fds; @@ -109,7 +110,7 @@ struct files_struct *get_files_struct(struct task_struct *); void put_files_struct(struct files_struct *fs); void reset_files_struct(struct files_struct *); int unshare_files(struct files_struct **); -struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy; +struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy; void do_close_on_exec(struct files_struct *); int iterate_fd(struct files_struct *, unsigned, int (*)(const void *, struct file *, unsigned), @@ -121,7 +122,10 @@ extern void __fd_install(struct files_struct *files, unsigned int fd, struct file *file); extern int __close_fd(struct files_struct *files, unsigned int fd); +extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags); extern int __close_fd_get_file(unsigned int fd, struct file **res); +extern int unshare_fd(unsigned long unshare_flags, unsigned int max_fds, + struct files_struct **new_fdp); extern struct kmem_cache *files_cachep; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b951a87da987..86d973509591 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -444,6 +444,8 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags, asmlinkage long sys_openat2(int dfd, const char __user *filename, struct open_how *how, size_t size); asmlinkage long sys_close(unsigned int fd); +asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd, + unsigned int flags); asmlinkage long sys_vhangup(void); /* fs/pipe.c */ diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index f4a01305d9a6..31001e3323bc 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -850,6 +850,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open) #define __NR_clone3 435 __SYSCALL(__NR_clone3, sys_clone3) #endif +#define __NR_close_range 436 +__SYSCALL(__NR_close_range, sys_close_range) #define __NR_openat2 437 __SYSCALL(__NR_openat2, sys_openat2) diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h new file mode 100644 index 000000000000..6928a9fdee3c --- /dev/null +++ b/include/uapi/linux/close_range.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_CLOSE_RANGE_H +#define _UAPI_LINUX_CLOSE_RANGE_H + +/* Unshare the file descriptor table before closing file descriptors. */ +#define CLOSE_RANGE_UNSHARE (1U << 1) + +#endif /* _UAPI_LINUX_CLOSE_RANGE_H */ + |