diff options
author | Jakub Kicinski <kuba@kernel.org> | 2023-01-04 20:21:25 -0800 |
---|---|---|
committer | Jakub Kicinski <kuba@kernel.org> | 2023-01-04 20:21:25 -0800 |
commit | d75858ef108c3b41f0f3215fe37505bb63e3795d (patch) | |
tree | d063793a087dbe32047cf32fa52681f3bb3b67b4 | |
parent | 1f47510ed50a511e7085a61d1a52fbe21f097a7c (diff) | |
parent | acd3b7768048fe338248cdf43ccfbf8c084a6bc1 (diff) | |
download | lwn-d75858ef108c3b41f0f3215fe37505bb63e3795d.tar.gz lwn-d75858ef108c3b41f0f3215fe37505bb63e3795d.zip |
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next 2023-01-04
We've added 45 non-merge commits during the last 21 day(s) which contain
a total of 50 files changed, 1454 insertions(+), 375 deletions(-).
The main changes are:
1) Fixes, improvements and refactoring of parts of BPF verifier's
state equivalence checks, from Andrii Nakryiko.
2) Fix a few corner cases in libbpf's BTF-to-C converter in particular
around padding handling and enums, also from Andrii Nakryiko.
3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better
support decap on GRE tunnel devices not operating in collect metadata,
from Christian Ehrig.
4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks,
from Dave Marchevsky.
5) Remove the need for trace_printk_lock for bpf_trace_printk
and bpf_trace_vprintk helpers, from Jiri Olsa.
6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps,
from Maryam Tahhan.
7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du.
8) Bigger batch of improvements to BPF tracing code samples,
from Daniel T. Lee.
9) Add LoongArch support to libbpf's bpf_tracing helper header,
from Hengqi Chen.
10) Fix a libbpf compiler warning in perf_event_open_probe on arm32,
from Khem Raj.
11) Optimize bpf_local_storage_elem by removing 56 bytes of padding,
from Martin KaFai Lau.
12) Use pkg-config to locate libelf for resolve_btfids build,
from Shen Jiamin.
13) Various libbpf improvements around API documentation and errno
handling, from Xin Liu.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
libbpf: Return -ENODATA for missing btf section
libbpf: Add LoongArch support to bpf_tracing.h
libbpf: Restore errno after pr_warn.
libbpf: Added the description of some API functions
libbpf: Fix invalid return address register in s390
samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs
samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro
samples/bpf: Change _kern suffix to .bpf with syscall tracing program
samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program
samples/bpf: Use kyscall instead of kprobe in syscall tracing program
bpf: rename list_head -> graph_root in field info types
libbpf: fix errno is overwritten after being closed.
bpf: fix regs_exact() logic in regsafe() to remap IDs correctly
bpf: perform byte-by-byte comparison only when necessary in regsafe()
bpf: reject non-exact register type matches in regsafe()
bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule
bpf: reorganize struct bpf_reg_state fields
bpf: teach refsafe() to take into account ID remapping
bpf: Remove unused field initialization in bpf's ctl_table
selftests/bpf: Add jit probe_mem corner case tests to s390x denylist
...
====================
Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
50 files changed, 1454 insertions, 375 deletions
diff --git a/Documentation/bpf/map_sockmap.rst b/Documentation/bpf/map_sockmap.rst new file mode 100644 index 000000000000..cc92047c6630 --- /dev/null +++ b/Documentation/bpf/map_sockmap.rst @@ -0,0 +1,498 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright Red Hat + +============================================== +BPF_MAP_TYPE_SOCKMAP and BPF_MAP_TYPE_SOCKHASH +============================================== + +.. note:: + - ``BPF_MAP_TYPE_SOCKMAP`` was introduced in kernel version 4.14 + - ``BPF_MAP_TYPE_SOCKHASH`` was introduced in kernel version 4.18 + +``BPF_MAP_TYPE_SOCKMAP`` and ``BPF_MAP_TYPE_SOCKHASH`` maps can be used to +redirect skbs between sockets or to apply policy at the socket level based on +the result of a BPF (verdict) program with the help of the BPF helpers +``bpf_sk_redirect_map()``, ``bpf_sk_redirect_hash()``, +``bpf_msg_redirect_map()`` and ``bpf_msg_redirect_hash()``. + +``BPF_MAP_TYPE_SOCKMAP`` is backed by an array that uses an integer key as the +index to look up a reference to a ``struct sock``. The map values are socket +descriptors. Similarly, ``BPF_MAP_TYPE_SOCKHASH`` is a hash backed BPF map that +holds references to sockets via their socket descriptors. + +.. note:: + The value type is either __u32 or __u64; the latter (__u64) is to support + returning socket cookies to userspace. Returning the ``struct sock *`` that + the map holds to user-space is neither safe nor useful. + +These maps may have BPF programs attached to them, specifically a parser program +and a verdict program. The parser program determines how much data has been +parsed and therefore how much data needs to be queued to come to a verdict. The +verdict program is essentially the redirect program and can return a verdict +of ``__SK_DROP``, ``__SK_PASS``, or ``__SK_REDIRECT``. + +When a socket is inserted into one of these maps, its socket callbacks are +replaced and a ``struct sk_psock`` is attached to it. Additionally, this +``sk_psock`` inherits the programs that are attached to the map. + +A sock object may be in multiple maps, but can only inherit a single +parse or verdict program. If adding a sock object to a map would result +in having multiple parser programs the update will return an EBUSY error. + +The supported programs to attach to these maps are: + +.. code-block:: c + + struct sk_psock_progs { + struct bpf_prog *msg_parser; + struct bpf_prog *stream_parser; + struct bpf_prog *stream_verdict; + struct bpf_prog *skb_verdict; + }; + +.. note:: + Users are not allowed to attach ``stream_verdict`` and ``skb_verdict`` + programs to the same map. + +The attach types for the map programs are: + +- ``msg_parser`` program - ``BPF_SK_MSG_VERDICT``. +- ``stream_parser`` program - ``BPF_SK_SKB_STREAM_PARSER``. +- ``stream_verdict`` program - ``BPF_SK_SKB_STREAM_VERDICT``. +- ``skb_verdict`` program - ``BPF_SK_SKB_VERDICT``. + +There are additional helpers available to use with the parser and verdict +programs: ``bpf_msg_apply_bytes()`` and ``bpf_msg_cork_bytes()``. With +``bpf_msg_apply_bytes()`` BPF programs can tell the infrastructure how many +bytes the given verdict should apply to. The helper ``bpf_msg_cork_bytes()`` +handles a different case where a BPF program cannot reach a verdict on a msg +until it receives more bytes AND the program doesn't want to forward the packet +until it is known to be good. + +Finally, the helpers ``bpf_msg_pull_data()`` and ``bpf_msg_push_data()`` are +available to ``BPF_PROG_TYPE_SK_MSG`` BPF programs to pull in data and set the +start and end pointers to given values or to add metadata to the ``struct +sk_msg_buff *msg``. + +All these helpers will be described in more detail below. + +Usage +===== +Kernel BPF +---------- +bpf_msg_redirect_map() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags) + +This helper is used in programs implementing policies at the socket level. If +the message ``msg`` is allowed to pass (i.e., if the verdict BPF program +returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces +can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used +to select the ingress path otherwise the egress path is selected. This is the +only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_sk_redirect_map() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key u64 flags) + +Redirect the packet to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces +can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used +to select the ingress path otherwise the egress path is selected. This is the +only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +socket entries of type ``struct sock *`` can be retrieved using the +``bpf_map_lookup_elem()`` helper. + +bpf_sock_map_update() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) + +Add an entry to, or update a ``map`` referencing sockets. The ``skops`` is used +as a new value for the entry associated to ``key``. The ``flags`` argument can +be one of the following: + +- ``BPF_ANY``: Create a new element or update an existing element. +- ``BPF_NOEXIST``: Create a new element only if it did not exist. +- ``BPF_EXIST``: Update an existing element. + +If the ``map`` has BPF programs (parser and verdict), those will be inherited +by the socket being added. If the socket is already attached to BPF programs, +this results in an error. + +Returns 0 on success, or a negative error in case of failure. + +bpf_sock_hash_update() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) + +Add an entry to, or update a sockhash ``map`` referencing sockets. The ``skops`` +is used as a new value for the entry associated to ``key``. + +The ``flags`` argument can be one of the following: + +- ``BPF_ANY``: Create a new element or update an existing element. +- ``BPF_NOEXIST``: Create a new element only if it did not exist. +- ``BPF_EXIST``: Update an existing element. + +If the ``map`` has BPF programs (parser and verdict), those will be inherited +by the socket being added. If the socket is already attached to BPF programs, +this results in an error. + +Returns 0 on success, or a negative error in case of failure. + +bpf_msg_redirect_hash() +^^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags) + +This helper is used in programs implementing policies at the socket level. If +the message ``msg`` is allowed to pass (i.e., if the verdict BPF program returns +``SK_PASS``), redirect it to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress +interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in +``flags`` is used to select the ingress path otherwise the egress path is +selected. This is the only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_sk_redirect_hash() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags) + +This helper is used in programs implementing policies at the skb socket level. +If the sk_buff ``skb`` is allowed to pass (i.e., if the verdict BPF program +returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress +interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in +``flags`` is used to select the ingress path otherwise the egress path is +selected. This is the only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_msg_apply_bytes() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes) + +For socket policies, apply the verdict of the BPF program to the next (number +of ``bytes``) of message ``msg``. For example, this helper can be used in the +following cases: + +- A single ``sendmsg()`` or ``sendfile()`` system call contains multiple + logical messages that the BPF program is supposed to read and for which it + should apply a verdict. +- A BPF program only cares to read the first ``bytes`` of a ``msg``. If the + message has a large payload, then setting up and calling the BPF program + repeatedly for all bytes, even though the verdict is already known, would + create unnecessary overhead. + +Returns 0 + +bpf_msg_cork_bytes() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes) + +For socket policies, prevent the execution of the verdict BPF program for +message ``msg`` until the number of ``bytes`` have been accumulated. + +This can be used when one needs a specific number of bytes before a verdict can +be assigned, even if the data spans multiple ``sendmsg()`` or ``sendfile()`` +calls. + +Returns 0 + +bpf_msg_pull_data() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags) + +For socket policies, pull in non-linear data from user space for ``msg`` and set +pointers ``msg->data`` and ``msg->data_end`` to ``start`` and ``end`` bytes +offsets into ``msg``, respectively. + +If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only +parse data that the (``data``, ``data_end``) pointers have already consumed. +For ``sendmsg()`` hooks this is likely the first scatterlist element. But for +calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be +the range (**0**, **0**) because the data is shared with user space and by +default the objective is to avoid allowing user space to modify data while (or +after) BPF verdict is being decided. This helper can be used to pull in data +and to set the start and end pointers to given values. Data will be copied if +necessary (i.e., if data was not linear and if start and end pointers do not +point to the same chunk). + +A call to this helper is susceptible to change the underlying packet buffer. +Therefore, at load time, all checks on pointers previously done by the verifier +are invalidated and must be performed again, if the helper is used in +combination with direct packet access. + +All values for ``flags`` are reserved for future usage, and must be left at +zero. + +Returns 0 on success, or a negative error in case of failure. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +Look up a socket entry in the sockmap or sockhash map. + +Returns the socket entry associated to ``key``, or NULL if no entry was found. + +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) + +Add or update a socket entry in a sockmap or sockhash. + +The flags argument can be one of the following: + +- BPF_ANY: Create a new element or update an existing element. +- BPF_NOEXIST: Create a new element only if it did not exist. +- BPF_EXIST: Update an existing element. + +Returns 0 on success, or a negative error in case of failure. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_map_delete_elem(struct bpf_map *map, const void *key) + +Delete a socket entry from a sockmap or a sockhash. + +Returns 0 on success, or a negative error in case of failure. + +User space +---------- +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags) + +Sockmap entries can be added or updated using the ``bpf_map_update_elem()`` +function. The ``key`` parameter is the index value of the sockmap array. And the +``value`` parameter is the FD value of that socket. + +Under the hood, the sockmap update function uses the socket FD value to +retrieve the associated socket and its attached psock. + +The flags argument can be one of the following: + +- BPF_ANY: Create a new element or update an existing element. +- BPF_NOEXIST: Create a new element only if it did not exist. +- BPF_EXIST: Update an existing element. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_lookup_elem(int fd, const void *key, void *value) + +Sockmap entries can be retrieved using the ``bpf_map_lookup_elem()`` function. + +.. note:: + The entry returned is a socket cookie rather than a socket itself. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_delete_elem(int fd, const void *key) + +Sockmap entries can be deleted using the ``bpf_map_delete_elem()`` +function. + +Returns 0 on success, or negative error in case of failure. + +Examples +======== + +Kernel BPF +---------- +Several examples of the use of sockmap APIs can be found in: + +- `tools/testing/selftests/bpf/progs/test_sockmap_kern.h`_ +- `tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`_ +- `tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`_ +- `tools/testing/selftests/bpf/progs/test_sockmap_listen.c`_ +- `tools/testing/selftests/bpf/progs/test_sockmap_update.c`_ + +The following code snippet shows how to declare a sockmap. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_SOCKMAP); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); + } sock_map_rx SEC(".maps"); + +The following code snippet shows a sample parser program. + +.. code-block:: c + + SEC("sk_skb/stream_parser") + int bpf_prog_parser(struct __sk_buff *skb) + { + return skb->len; + } + +The following code snippet shows a simple verdict program that interacts with a +sockmap to redirect traffic to another socket based on the local port. + +.. code-block:: c + + SEC("sk_skb/stream_verdict") + int bpf_prog_verdict(struct __sk_buff *skb) + { + __u32 lport = skb->local_port; + __u32 idx = 0; + + if (lport == 10000) + return bpf_sk_redirect_map(skb, &sock_map_rx, idx, 0); + + return SK_PASS; + } + +The following code snippet shows how to declare a sockhash map. + +.. code-block:: c + + struct socket_key { + __u32 src_ip; + __u32 dst_ip; + __u32 src_port; + __u32 dst_port; + }; + + struct { + __uint(type, BPF_MAP_TYPE_SOCKHASH); + __uint(max_entries, 1); + __type(key, struct socket_key); + __type(value, __u64); + } sock_hash_rx SEC(".maps"); + +The following code snippet shows a simple verdict program that interacts with a +sockhash to redirect traffic to another socket based on a hash of some of the +skb parameters. + +.. code-block:: c + + static inline + void extract_socket_key(struct __sk_buff *skb, struct socket_key *key) + { + key->src_ip = skb->remote_ip4; + key->dst_ip = skb->local_ip4; + key->src_port = skb->remote_port >> 16; + key->dst_port = (bpf_htonl(skb->local_port)) >> 16; + } + + SEC("sk_skb/stream_verdict") + int bpf_prog_verdict(struct __sk_buff *skb) + { + struct socket_key key; + + extract_socket_key(skb, &key); + + return bpf_sk_redirect_hash(skb, &sock_hash_rx, &key, 0); + } + +User space +---------- +Several examples of the use of sockmap APIs can be found in: + +- `tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`_ +- `tools/testing/selftests/bpf/test_sockmap.c`_ +- `tools/testing/selftests/bpf/test_maps.c`_ + +The following code sample shows how to create a sockmap, attach a parser and +verdict program, as well as add a socket entry. + +.. code-block:: c + + int create_sample_sockmap(int sock, int parse_prog_fd, int verdict_prog_fd) + { + int index = 0; + int map, err; + + map = bpf_map_create(BPF_MAP_TYPE_SOCKMAP, NULL, sizeof(int), sizeof(int), 1, NULL); + if (map < 0) { + fprintf(stderr, "Failed to create sockmap: %s\n", strerror(errno)); + return -1; + } + + err = bpf_prog_attach(parse_prog_fd, map, BPF_SK_SKB_STREAM_PARSER, 0); + if (err){ + fprintf(stderr, "Failed to attach_parser_prog_to_map: %s\n", strerror(errno)); + goto out; + } + + err = bpf_prog_attach(verdict_prog_fd, map, BPF_SK_SKB_STREAM_VERDICT, 0); + if (err){ + fprintf(stderr, "Failed to attach_verdict_prog_to_map: %s\n", strerror(errno)); + goto out; + } + + err = bpf_map_update_elem(map, &index, &sock, BPF_NOEXIST); + if (err) { + fprintf(stderr, "Failed to update sockmap: %s\n", strerror(errno)); + goto out; + } + + out: + close(map); + return err; + } + +References +=========== + +- https://github.com/jrfastab/linux-kernel-xdp/commit/c89fd73cb9d2d7f3c716c3e00836f07b1aeb261f +- https://lwn.net/Articles/731133/ +- http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf +- https://lwn.net/Articles/748628/ +- https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com/ + +.. _`tools/testing/selftests/bpf/progs/test_sockmap_kern.h`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_kern.h +.. _`tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c +.. _`tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c +.. _`tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +.. _`tools/testing/selftests/bpf/test_sockmap.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_sockmap.c +.. _`tools/testing/selftests/bpf/test_maps.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_maps.c +.. _`tools/testing/selftests/bpf/progs/test_sockmap_listen.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_listen.c +.. _`tools/testing/selftests/bpf/progs/test_sockmap_update.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_update.c diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index b808be77635e..8db6077febdd 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1003,6 +1003,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image u8 b2 = 0, b3 = 0; u8 *start_of_ldx; s64 jmp_offset; + s16 insn_off; u8 jmp_cond; u8 *func; int nops; @@ -1369,57 +1370,52 @@ st: if (is_imm8(insn->off)) case BPF_LDX | BPF_PROBE_MEM | BPF_W: case BPF_LDX | BPF_MEM | BPF_DW: case BPF_LDX | BPF_PROBE_MEM | BPF_DW: + insn_off = insn->off; + if (BPF_MODE(insn->code) == BPF_PROBE_MEM) { - /* Though the verifier prevents negative insn->off in BPF_PROBE_MEM - * add abs(insn->off) to the limit to make sure that negative - * offset won't be an issue. - * insn->off is s16, so it won't affect valid pointers. + /* Conservatively check that src_reg + insn->off is a kernel address: + * src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE + * src_reg is used as scratch for src_reg += insn->off and restored + * after emit_ldx if necessary */ - u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off); - u8 *end_of_jmp1, *end_of_jmp2; - /* Conservatively check that src_reg + insn->off is a kernel address: - * 1. src_reg + insn->off >= limit - * 2. src_reg + insn->off doesn't become small positive. - * Cannot do src_reg + insn->off >= limit in one branch, - * since it needs two spare registers, but JIT has only one. + u64 limit = TASK_SIZE_MAX + PAGE_SIZE; + u8 *end_of_jmp; + + /* At end of these emitted checks, insn->off will have been added + * to src_reg, so no need to do relative load with insn->off offset */ + insn_off = 0; /* movabsq r11, limit */ EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG)); EMIT((u32)limit, 4); EMIT(limit >> 32, 4); + + if (insn->off) { + /* add src_reg, insn->off */ + maybe_emit_1mod(&prog, src_reg, true); + EMIT2_off32(0x81, add_1reg(0xC0, src_reg), insn->off); + } + /* cmp src_reg, r11 */ maybe_emit_mod(&prog, src_reg, AUX_REG, true); EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG)); - /* if unsigned '<' goto end_of_jmp2 */ - EMIT2(X86_JB, 0); - end_of_jmp1 = prog; - - /* mov r11, src_reg */ - emit_mov_reg(&prog, true, AUX_REG, src_reg); - /* add r11, insn->off */ - maybe_emit_1mod(&prog, AUX_REG, true); - EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off); - /* jmp if not carry to start_of_ldx - * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr - * that has to be rejected. - */ - EMIT2(0x73 /* JNC */, 0); - end_of_jmp2 = prog; + + /* if unsigned '>=', goto load */ + EMIT2(X86_JAE, 0); + end_of_jmp = prog; /* xor dst_reg, dst_reg */ emit_mov_imm32(&prog, false, dst_reg, 0); /* jmp byte_after_ldx */ EMIT2(0xEB, 0); - /* populate jmp_offset for JB above to jump to xor dst_reg */ - end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1; - /* populate jmp_offset for JNC above to jump to start_of_ldx */ + /* populate jmp_offset for JAE above to jump to start_of_ldx */ start_of_ldx = prog; - end_of_jmp2[-1] = start_of_ldx - end_of_jmp2; + end_of_jmp[-1] = start_of_ldx - end_of_jmp; } - emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off); + emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn_off); if (BPF_MODE(insn->code) == BPF_PROBE_MEM) { struct exception_table_entry *ex; u8 *_insn = image + proglen + (start_of_ldx - temp); @@ -1428,6 +1424,18 @@ st: if (is_imm8(insn->off)) /* populate jmp_offset for JMP above */ start_of_ldx[-1] = prog - start_of_ldx; + if (insn->off && src_reg != dst_reg) { + /* sub src_reg, insn->off + * Restore src_reg after "add src_reg, insn->off" in prev + * if statement. But if src_reg == dst_reg, emit_ldx + * above already clobbered src_reg, so no need to restore. + * If add src_reg, insn->off was unnecessary, no need to + * restore either. + */ + maybe_emit_1mod(&prog, src_reg, true); + EMIT2_off32(0x81, add_1reg(0xE8, src_reg), insn->off); + } + if (!bpf_prog->aux->extable) break; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3de24cfb7a3d..1697bd87fc06 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -189,7 +189,7 @@ struct btf_field_kptr { u32 btf_id; }; -struct btf_field_list_head { +struct btf_field_graph_root { struct btf *btf; u32 value_btf_id; u32 node_offset; @@ -201,7 +201,7 @@ struct btf_field { enum btf_field_type type; union { struct btf_field_kptr kptr; - struct btf_field_list_head list_head; + struct btf_field_graph_root graph_root; }; }; @@ -2795,10 +2795,18 @@ struct btf_id_set; bool btf_id_set_contains(const struct btf_id_set *set, u32 id); #define MAX_BPRINTF_VARARGS 12 +#define MAX_BPRINTF_BUF 1024 + +struct bpf_bprintf_data { + u32 *bin_args; + char *buf; + bool get_bin_args; + bool get_buf; +}; int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args, - u32 **bin_buf, u32 num_args); -void bpf_bprintf_cleanup(void); + u32 num_args, struct bpf_bprintf_data *data); +void bpf_bprintf_cleanup(struct bpf_bprintf_data *data); /* the implementation of the opaque uapi struct bpf_dynptr */ struct bpf_dynptr_kern { diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 53d175cbaa02..127058cfec47 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -92,6 +92,26 @@ struct bpf_reg_state { u32 subprogno; /* for PTR_TO_FUNC */ }; + /* For scalar types (SCALAR_VALUE), this represents our knowledge of + * the actual value. + * For pointer types, this represents the variable part of the offset + * from the pointed-to object, and is shared with all bpf_reg_states + * with the same id as us. + */ + struct tnum var_off; + /* Used to determine if any memory access using this register will + * result in a bad access. + * These refer to the same value as var_off, not necessarily the actual + * contents of the register. + */ + s64 smin_value; /* minimum possible (s64)value */ + s64 smax_value; /* maximum possible (s64)value */ + u64 umin_value; /* minimum possible (u64)value */ + u64 umax_value; /* maximum possible (u64)value */ + s32 s32_min_value; /* minimum possible (s32)value */ + s32 s32_max_value; /* maximum possible (s32)value */ + u32 u32_min_value; /* minimum possible (u32)value */ + u32 u32_max_value; /* maximum possible (u32)value */ /* For PTR_TO_PACKET, used to find other pointers with the same variable * offset, so they can share range knowledge. * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we @@ -144,26 +164,6 @@ struct bpf_reg_state { * allowed and has the same effect as bpf_sk_release(sk). */ u32 ref_obj_id; - /* For scalar types (SCALAR_VALUE), this represents our knowledge of - * the actual value. - * For pointer types, this represents the variable part of the offset - * from the pointed-to object, and is shared with all bpf_reg_states - * with the same id as us. - */ - struct tnum var_off; - /* Used to determine if any memory access using this register will - * result in a bad access. - * These refer to the same value as var_off, not necessarily the actual - * contents of the register. - */ - s64 smin_value; /* minimum possible (s64)value */ - s64 smax_value; /* maximum possible (s64)value */ - u64 umin_value; /* minimum possible (u64)value */ - u64 umax_value; /* maximum possible (u64)value */ - s32 s32_min_value; /* minimum possible (s32)value */ - s32 s32_max_value; /* maximum possible (s32)value */ - u32 u32_min_value; /* minimum possible (u32)value */ - u32 u32_max_value; /* maximum possible (u32)value */ /* parentage chain for liveness checking */ struct bpf_reg_state *parent; /* Inside the callee two registers can be both PTR_TO_STACK like diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 464ca3f01fe7..bc1a3d232ae4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2001,6 +2001,9 @@ union bpf_attr { * sending the packet. This flag was added for GRE * encapsulation, but might be used with other protocols * as well in the future. + * **BPF_F_NO_TUNNEL_KEY** + * Add a flag to tunnel metadata indicating that no tunnel + * key should be set in the resulting tunnel header. * * Here is a typical usage on the transmit path: * @@ -5764,6 +5767,7 @@ enum { BPF_F_ZERO_CSUM_TX = (1ULL << 1), BPF_F_DONT_FRAGMENT = (1ULL << 2), BPF_F_SEQ_NUMBER = (1ULL << 3), + BPF_F_NO_TUNNEL_KEY = (1ULL << 4), }; /* BPF_FUNC_skb_get_tunnel_key flags. */ diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index b39a46e8fb08..373c3c2c75bc 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -580,8 +580,8 @@ static struct bpf_local_storage_map *__bpf_local_storage_map_alloc(union bpf_att raw_spin_lock_init(&smap->buckets[i].lock); } - smap->elem_size = - sizeof(struct bpf_local_storage_elem) + attr->value_size; + smap->elem_size = offsetof(struct bpf_local_storage_elem, + sdata.data[attr->value_size]); return smap; } diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index f7dd8af06413..578cee398550 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3228,7 +3228,7 @@ struct btf_field_info { struct { const char *node_name; u32 value_btf_id; - } list_head; + } graph_root; }; }; @@ -3335,8 +3335,8 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt, return -EINVAL; info->type = BPF_LIST_HEAD; info->off = off; - info->list_head.value_btf_id = id; - info->list_head.node_name = list_node; + info->graph_root.value_btf_id = id; + info->graph_root.node_name = list_node; return BTF_FIELD_FOUND; } @@ -3604,13 +3604,14 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field, u32 offset; int i; - t = btf_type_by_id(btf, info->list_head.value_btf_id); + t = btf_type_by_id(btf, info->graph_root.value_btf_id); /* We've already checked that value_btf_id is a struct type. We * just need to figure out the offset of the list_node, and * verify its type. */ for_each_member(i, t, member) { - if (strcmp(info->list_head.node_name, __btf_name_by_offset(btf, member->name_off))) + if (strcmp(info->graph_root.node_name, + __btf_name_by_offset(btf, member->name_off))) continue; /* Invalid BTF, two members with same name */ if (n) @@ -3627,9 +3628,9 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field, if (offset % __alignof__(struct bpf_list_node)) return -EINVAL; - field->list_head.btf = (struct btf *)btf; - field->list_head.value_btf_id = info->list_head.value_btf_id; - field->list_head.node_offset = offset; + field->graph_root.btf = (struct btf *)btf; + field->graph_root.value_btf_id = info->graph_root.value_btf_id; + field->graph_root.node_offset = offset; } if (!n) return -ENOENT; @@ -3736,11 +3737,11 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec) if (!(rec->fields[i].type & BPF_LIST_HEAD)) continue; - btf_id = rec->fields[i].list_head.value_btf_id; + btf_id = rec->fields[i].graph_root.value_btf_id; meta = btf_find_struct_meta(btf, btf_id); if (!meta) return -EFAULT; - rec->fields[i].list_head.value_rec = meta->record; + rec->fields[i].graph_root.value_rec = meta->record; if (!(rec->field_mask & BPF_LIST_NODE)) continue; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index af30c6cbd65d..458db2db2f81 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -756,19 +756,20 @@ static int bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype, /* Per-cpu temp buffers used by printf-like helpers to store the bprintf binary * arguments representation. */ -#define MAX_BPRINTF_BUF_LEN 512 +#define MAX_BPRINTF_BIN_ARGS 512 /* Support executing three nested bprintf helper calls on a given CPU */ #define MAX_BPRINTF_NEST_LEVEL 3 struct bpf_bprintf_buffers { - char tmp_bufs[MAX_BPRINTF_NEST_LEVEL][MAX_BPRINTF_BUF_LEN]; + char bin_args[MAX_BPRINTF_BIN_ARGS]; + char buf[MAX_BPRINTF_BUF]; }; -static DEFINE_PER_CPU(struct bpf_bprintf_buffers, bpf_bprintf_bufs); + +static DEFINE_PER_CPU(struct bpf_bprintf_buffers[MAX_BPRINTF_NEST_LEVEL], bpf_bprintf_bufs); static DEFINE_PER_CPU(int, bpf_bprintf_nest_level); -static int try_get_fmt_tmp_buf(char **tmp_buf) +static int try_get_buffers(struct bpf_bprintf_buffers **bufs) { - struct bpf_bprintf_buffers *bufs; int nest_level; preempt_disable(); @@ -778,18 +779,19 @@ static int try_get_fmt_tmp_buf(char **tmp_buf) preempt_enable(); return -EBUSY; } - bufs = this_cpu_ptr(&bpf_bprintf_bufs); - *tmp_buf = bufs->tmp_bufs[nest_level - 1]; + *bufs = this_cpu_ptr(&bpf_bprintf_bufs[nest_level - 1]); return 0; } -void bpf_bprintf_cleanup(void) +void bpf_bprintf_cleanup(struct bpf_bprintf_data *data) { - if (this_cpu_read(bpf_bprintf_nest_level)) { - this_cpu_dec(bpf_bprintf_nest_level); - preempt_enable(); - } + if (!data->bin_args && !data->buf) + return; + if (WARN_ON_ONCE(this_cpu_read(bpf_bprintf_nest_level) == 0)) + return; + this_cpu_dec(bpf_bprintf_nest_level); + preempt_enable(); } /* @@ -798,18 +800,20 @@ void bpf_bprintf_cleanup(void) * Returns a negative value if fmt is an invalid format string or 0 otherwise. * * This can be used in two ways: - * - Format string verification only: when bin_args is NULL + * - Format string verification only: when data->get_bin_args is false * - Arguments preparation: in addition to the above verification, it writes in - * bin_args a binary representation of arguments usable by bstr_printf where - * pointers from BPF have been sanitized. + * data->bin_args a binary representation of arguments usable by bstr_printf + * where pointers from BPF have been sanitized. * * In argument preparation mode, if 0 is returned, safe temporary buffers are * allocated and bpf_bprintf_cleanup should be called to free them after use. */ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args, - u32 **bin_args, u32 num_args) + u32 num_args, struct bpf_bprintf_data *data) { + bool get_buffers = (data->get_bin_args && num_args) || data->get_buf; char *unsafe_ptr = NULL, *tmp_buf = NULL, *tmp_buf_end, *fmt_end; + struct bpf_bprintf_buffers *buffers = NULL; size_t sizeof_cur_arg, sizeof_cur_ip; int err, i, num_spec = 0; u64 cur_arg; @@ -820,14 +824,19 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args, return -EINVAL; fmt_size = fmt_end - fmt; - if (bin_args) { - if (num_args && try_get_fmt_tmp_buf(&tmp_buf)) - return -EBUSY; + if (get_buffers && try_get_buffers(&buffers)) + return -EBUSY; - tmp_buf_end = tmp_buf + MAX_BPRINTF_BUF_LEN; - *bin_args = (u32 *)tmp_buf; + if (data->get_bin_args) { + if (num_args) + tmp_buf = buffers->bin_args; + tmp_buf_end = tmp_buf + MAX_BPRINTF_BIN_ARGS; + data->bin_args = (u32 *)tmp_buf; } + if (data->get_buf) + data->buf = buffers->buf; + for (i = 0; i < fmt_size; i++) { if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i])) { err = -EINVAL; @@ -1021,31 +1030,33 @@ nocopy_fmt: err = 0; out: if (err) - bpf_bprintf_cleanup(); + bpf_bprintf_cleanup(data); return err; } BPF_CALL_5(bpf_snprintf, char *, str, u32, str_size, char *, fmt, - const void *, data, u32, data_len) + const void *, args, u32, data_len) { + struct bpf_bprintf_data data = { + .get_bin_args = true, + }; int err, num_args; - u32 *bin_args; if (data_len % 8 || data_len > MAX_BPRINTF_VARARGS * 8 || - (data_len && !data)) + (data_len && !args)) return -EINVAL; num_args = data_len / 8; /* ARG_PTR_TO_CONST_STR guarantees that fmt is zero-terminated so we * can safely give an unbounded size. */ - err = bpf_bprintf_prepare(fmt, UINT_MAX, data, &bin_args, num_args); + err = bpf_bprintf_prepare(fmt, UINT_MAX, args, num_args, &data); if (err < 0) return err; - err = bstr_printf(str, str_size, fmt, bin_args); + err = bstr_printf(str, str_size, fmt, data.bin_args); - bpf_bprintf_cleanup(); + bpf_bprintf_cleanup(&data); return err + 1; } @@ -1745,12 +1756,12 @@ unlock: while (head != orig_head) { void *obj = head; - obj -= field->list_head.node_offset; + obj -= field->graph_root.node_offset; head = head->next; /* The contained type can also have resources, including a * bpf_list_head which needs to be freed. */ - bpf_obj_free_fields(field->list_head.value_rec, obj); + bpf_obj_free_fields(field->graph_root.value_rec, obj); /* bpf_mem_free requires migrate_disable(), since we can be * called from map free path as well apart from BPF program (as * part of map ops doing bpf_obj_free_fields). diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 64131f88c553..35ffd808f281 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -5319,7 +5319,6 @@ static struct ctl_table bpf_syscall_table[] = { { .procname = "bpf_stats_enabled", .data = &bpf_stats_enabled_key.key, - .maxlen = sizeof(bpf_stats_enabled_key), .mode = 0644, .proc_handler = bpf_stats_handler, }, diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index a5255a0dcbb6..4a25375ebb0d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1402,9 +1402,11 @@ static void ___mark_reg_known(struct bpf_reg_state *reg, u64 imm) */ static void __mark_reg_known(struct bpf_reg_state *reg, u64 imm) { - /* Clear id, off, and union(map_ptr, range) */ + /* Clear off and union(map_ptr, range) */ memset(((u8 *)reg) + sizeof(reg->type), 0, offsetof(struct bpf_reg_state, var_off) - sizeof(reg->type)); + reg->id = 0; + reg->ref_obj_id = 0; ___mark_reg_known(reg, imm); } @@ -1750,11 +1752,13 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env, struct bpf_reg_state *reg) { /* - * Clear type, id, off, and union(map_ptr, range) and + * Clear type, off, and union(map_ptr, range) and * padding between 'type' and union */ memset(reg, 0, offsetof(struct bpf_reg_state, var_off)); reg->type = SCALAR_VALUE; + reg->id = 0; + reg->ref_obj_id = 0; reg->var_off = tnum_unknown; reg->frameno = 0; reg->precise = !env->bpf_capable; @@ -7612,6 +7616,7 @@ static int check_bpf_snprintf_call(struct bpf_verifier_env *env, struct bpf_reg_state *fmt_reg = ®s[BPF_REG_3]; struct bpf_reg_state *data_len_reg = ®s[BPF_REG_5]; struct bpf_map *fmt_map = fmt_reg->map_ptr; + struct bpf_bprintf_data data = {}; int err, fmt_map_off, num_args; u64 fmt_addr; char *fmt; @@ -7636,7 +7641,7 @@ static int check_bpf_snprintf_call(struct bpf_verifier_env *env, /* We are also guaranteed that fmt+fmt_map_off is NULL terminated, we * can focus on validating the format specifiers. */ - err = bpf_bprintf_prepare(fmt, UINT_MAX, NULL, NULL, num_args); + err = bpf_bprintf_prepare(fmt, UINT_MAX, NULL, num_args, &data); if (err < 0) verbose(env, "Invalid format string\n"); @@ -8771,21 +8776,22 @@ static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env, field = meta->arg_list_head.field; - et = btf_type_by_id(field->list_head.btf, field->list_head.value_btf_id); + et = btf_type_by_id(field->graph_root.btf, field->graph_root.value_btf_id); t = btf_type_by_id(reg->btf, reg->btf_id); - if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->list_head.btf, - field->list_head.value_btf_id, true)) { + if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->graph_root.btf, + field->graph_root.value_btf_id, true)) { verbose(env, "operation on bpf_list_head expects arg#1 bpf_list_node at offset=%d " "in struct %s, but arg is at offset=%d in struct %s\n", - field->list_head.node_offset, btf_name_by_offset(field->list_head.btf, et->name_off), + field->graph_root.node_offset, + btf_name_by_offset(field->graph_root.btf, et->name_off), list_node_off, btf_name_by_offset(reg->btf, t->name_off)); return -EINVAL; } - if (list_node_off != field->list_head.node_offset) { + if (list_node_off != field->graph_root.node_offset) { verbose(env, "arg#1 offset=%d, but expected bpf_list_node at offset=%d in struct %s\n", - list_node_off, field->list_head.node_offset, - btf_name_by_offset(field->list_head.btf, et->name_off)); + list_node_off, field->graph_root.node_offset, + btf_name_by_offset(field->graph_root.btf, et->name_off)); return -EINVAL; } /* Set arg#1 for expiration after unlock */ @@ -9227,9 +9233,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC; - regs[BPF_REG_0].btf = field->list_head.btf; - regs[BPF_REG_0].btf_id = field->list_head.value_btf_id; - regs[BPF_REG_0].off = field->list_head.node_offset; + regs[BPF_REG_0].btf = field->graph_root.btf; + regs[BPF_REG_0].btf_id = field->graph_root.value_btf_id; + regs[BPF_REG_0].off = field->graph_root.node_offset; } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) { mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED; @@ -12941,6 +12947,13 @@ static bool check_ids(u32 old_id, u32 cur_id, struct bpf_id_pair *idmap) { unsigned int i; + /* either both IDs should be set or both should be zero */ + if (!!old_id != !!cur_id) + return false; + + if (old_id == 0) /* cur_id == 0 as well */ + return true; + for (i = 0; i < BPF_ID_MAP_SIZE; i++) { if (!idmap[i].old) { /* Reached an empty slot; haven't seen this id before */ @@ -13052,79 +13065,74 @@ next: } } +static bool regs_exact(const struct bpf_reg_state *rold, + const struct bpf_reg_state *rcur, + struct bpf_id_pair *idmap) +{ + return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 && + check_ids(rold->id, rcur->id, idmap) && + check_ids(rold->ref_obj_id, rcur->ref_obj_id, idmap); +} + /* Returns true if (rold safe implies rcur safe) */ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold, struct bpf_reg_state *rcur, struct bpf_id_pair *idmap) { - bool equal; - if (!(rold->live & REG_LIVE_READ)) /* explored state didn't use this */ return true; - - equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0; - if (rold->type == NOT_INIT) /* explored state can't have used this */ return true; if (rcur->type == NOT_INIT) return false; + + /* Enforce that register types have to match exactly, including their + * modifiers (like PTR_MAYBE_NULL, MEM_RDONLY, etc), as a general + * rule. + * + * One can make a point that using a pointer register as unbounded + * SCALAR would be technically acceptable, but this could lead to + * pointer leaks because scalars are allowed to leak while pointers + * are not. We could make this safe in special cases if root is + * calling us, but it's probably not worth the hassle. + * + * Also, register types that are *not* MAYBE_NULL could technically be + * safe to use as their MAYBE_NULL variants (e.g., PTR_TO_MAP_VALUE + * is safe to be used as PTR_TO_MAP_VALUE_OR_NULL, provided both point + * to the same map). + * However, if the old MAYBE_NULL register then got NULL checked, + * doing so could have affected others with the same id, and we can't + * check for that because we lost the id when we converted to + * a non-MAYBE_NULL variant. + * So, as a general rule we don't allow mixing MAYBE_NULL and + * non-MAYBE_NULL registers as well. + */ + if (rold->type != rcur->type) + return false; + switch (base_type(rold->type)) { case SCALAR_VALUE: - if (equal) + if (regs_exact(rold, rcur, idmap)) return true; if (env->explore_alu_limits) return false; - if (rcur->type == SCALAR_VALUE) { - if (!rold->precise) - return true; - /* new val must satisfy old val knowledge */ - return range_within(rold, rcur) && - tnum_in(rold->var_off, rcur->var_off); - } else { - /* We're trying to use a pointer in place of a scalar. - * Even if the scalar was unbounded, this could lead to - * pointer leaks because scalars are allowed to leak - * while pointers are not. We could make this safe in - * special cases if root is calling us, but it's - * probably not worth the hassle. - */ - return false; - } + if (!rold->precise) + return true; + /* new val must satisfy old val knowledge */ + return range_within(rold, rcur) && + tnum_in(rold->var_off, rcur->var_off); case PTR_TO_MAP_KEY: case PTR_TO_MAP_VALUE: - /* a PTR_TO_MAP_VALUE could be safe to use as a - * PTR_TO_MAP_VALUE_OR_NULL into the same map. - * However, if the old PTR_TO_MAP_VALUE_OR_NULL then got NULL- - * checked, doing so could have affected others with the same - * id, and we can't check for that because we lost the id when - * we converted to a PTR_TO_MAP_VALUE. - */ - if (type_may_be_null(rold->type)) { - if (!type_may_be_null(rcur->type)) - return false; - if (memcmp(rold, rcur, offsetof(struct bpf_reg_state, id))) - return false; - /* Check our ids match any regs they're supposed to */ - return check_ids(rold->id, rcur->id, idmap); - } - /* If the new min/max/var_off satisfy the old ones and * everything else matches, we are OK. - * 'id' is not compared, since it's only used for maps with - * bpf_spin_lock inside map element and in such cases if - * the rest of the prog is valid for one map element then - * it's valid for all map elements regardless of the key - * used in bpf_map_lookup() */ - return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 && + return memcmp(rold, rcur, offsetof(struct bpf_reg_state, var_off)) == 0 && range_within(rold, rcur) && tnum_in(rold->var_off, rcur->var_off) && check_ids(rold->id, rcur->id, idmap); case PTR_TO_PACKET_META: case PTR_TO_PACKET: - if (rcur->type != rold->type) - return false; /* We must have at least as much range as the old ptr * did, so that any accesses which were safe before are * still safe. This is true even if old range < old off, @@ -13139,7 +13147,7 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold, if (rold->off != rcur->off) return false; /* id relations must be preserved */ - if (rold->id && !check_ids(rold->id, rcur->id, idmap)) + if (!check_ids(rold->id, rcur->id, idmap)) return false; /* new val must satisfy old val knowledge */ return range_within(rold, rcur) && @@ -13148,15 +13156,10 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold, /* two stack pointers are equal only if they're pointing to * the same stack frame, since fp-8 in foo != fp-8 in bar */ - return equal && rold->frameno == rcur->frameno; + return regs_exact(rold, rcur, idmap) && rold->frameno == rcur->frameno; default: - /* Only valid matches are exact, which memcmp() */ - return equal; + return regs_exact(rold, rcur, idmap); } - - /* Shouldn't get here; if we do, say it's not safe */ - WARN_ON_ONCE(1); - return false; } static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, @@ -13222,12 +13225,20 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, return true; } -static bool refsafe(struct bpf_func_state *old, struct bpf_func_state *cur) +static bool refsafe(struct bpf_func_state *old, struct bpf_func_state *cur, + struct bpf_id_pair *idmap) { + int i; + if (old->acquired_refs != cur->acquired_refs) return false; - return !memcmp(old->refs, cur->refs, - sizeof(*old->refs) * old->acquired_refs); + + for (i = 0; i < old->acquired_refs; i++) { + if (!check_ids(old->refs[i].id, cur->refs[i].id, idmap)) + return false; + } + + return true; } /* compare two verifier states @@ -13269,7 +13280,7 @@ static bool func_states_equal(struct bpf_verifier_env *env, struct bpf_func_stat if (!stacksafe(env, old, cur, env->idmap_scratch)) return false; - if (!refsafe(old, cur)) + if (!refsafe(old, cur, env->idmap_scratch)) return false; return true; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 3bbd3f0c810c..23ce498bca97 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -369,8 +369,6 @@ static const struct bpf_func_proto *bpf_get_probe_write_proto(void) return &bpf_probe_write_user_proto; } -static DEFINE_RAW_SPINLOCK(trace_printk_lock); - #define MAX_TRACE_PRINTK_VARARGS 3 #define BPF_TRACE_PRINTK_SIZE 1024 @@ -378,23 +376,22 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, u64, arg2, u64, arg3) { u64 args[MAX_TRACE_PRINTK_VARARGS] = { arg1, arg2, arg3 }; - u32 *bin_args; - static char buf[BPF_TRACE_PRINTK_SIZE]; - unsigned long flags; + struct bpf_bprintf_data data = { + .get_bin_args = true, + .get_buf = true, + }; int ret; - ret = bpf_bprintf_prepare(fmt, fmt_size, args, &bin_args, - MAX_TRACE_PRINTK_VARARGS); + ret = bpf_bprintf_prepare(fmt, fmt_size, args, + MAX_TRACE_PRINTK_VARARGS, &data); if (ret < 0) return ret; - raw_spin_lock_irqsave(&trace_printk_lock, flags); - ret = bstr_printf(buf, sizeof(buf), fmt, bin_args); + ret = bstr_printf(data.buf, MAX_BPRINTF_BUF, fmt, data.bin_args); - trace_bpf_trace_printk(buf); - raw_spin_unlock_irqrestore(&trace_printk_lock, flags); + trace_bpf_trace_printk(data.buf); - bpf_bprintf_cleanup(); + bpf_bprintf_cleanup(&data); return ret; } @@ -427,30 +424,29 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void) return &bpf_trace_printk_proto; } -BPF_CALL_4(bpf_trace_vprintk, char *, fmt, u32, fmt_size, const void *, data, +BPF_CALL_4(bpf_trace_vprintk, char *, fmt, u32, fmt_size, const void *, args, u32, data_len) { - static char buf[BPF_TRACE_PRINTK_SIZE]; - unsigned long flags; + struct bpf_bprintf_data data = { + .get_bin_args = true, + .get_buf = true, + }; int ret, num_args; - u32 *bin_args; if (data_len & 7 || data_len > MAX_BPRINTF_VARARGS * 8 || - (data_len && !data)) + (data_len && !args)) return -EINVAL; num_args = data_len / 8; - ret = bpf_bprintf_prepare(fmt, fmt_size, data, &bin_args, num_args); + ret = bpf_bprintf_prepare(fmt, fmt_size, args, num_args, &data); if (ret < 0) return ret; - raw_spin_lock_irqsave(&trace_printk_lock, flags); - ret = bstr_printf(buf, sizeof(buf), fmt, bin_args); + ret = bstr_printf(data.buf, MAX_BPRINTF_BUF, fmt, data.bin_args); - trace_bpf_trace_printk(buf); - raw_spin_unlock_irqrestore(&trace_printk_lock, flags); + trace_bpf_trace_printk(data.buf); - bpf_bprintf_cleanup(); + bpf_bprintf_cleanup(&data); return ret; } @@ -472,23 +468,25 @@ const struct bpf_func_proto *bpf_get_trace_vprintk_proto(void) } BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, - const void *, data, u32, data_len) + const void *, args, u32, data_len) { + struct bpf_bprintf_data data = { + .get_bin_args = true, + }; int err, num_args; - u32 *bin_args; if (data_len & 7 || data_len > MAX_BPRINTF_VARARGS * 8 || - (data_len && !data)) + (data_len && !args)) return -EINVAL; num_args = data_len / 8; - err = bpf_bprintf_prepare(fmt, fmt_size, data, &bin_args, num_args); + err = bpf_bprintf_prepare(fmt, fmt_size, args, num_args, &data); if (err < 0) return err; - seq_bprintf(m, fmt, bin_args); + seq_bprintf(m, fmt, data.bin_args); - bpf_bprintf_cleanup(); + bpf_bprintf_cleanup(&data); return seq_has_overflowed(m) ? -EOVERFLOW : 0; } diff --git a/net/core/filter.c b/net/core/filter.c index 929358677183..c746e4d77214 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4615,7 +4615,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb, struct ip_tunnel_info *info; if (unlikely(flags & ~(BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX | - BPF_F_DONT_FRAGMENT | BPF_F_SEQ_NUMBER))) + BPF_F_DONT_FRAGMENT | BPF_F_SEQ_NUMBER | + BPF_F_NO_TUNNEL_KEY))) return -EINVAL; if (unlikely(size != sizeof(struct bpf_tunnel_key))) { switch (size) { @@ -4653,6 +4654,8 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb, info->key.tun_flags &= ~TUNNEL_CSUM; if (flags & BPF_F_SEQ_NUMBER) info->key.tun_flags |= TUNNEL_SEQ; + if (flags & BPF_F_NO_TUNNEL_KEY) + info->key.tun_flags &= ~TUNNEL_KEY; info->key.tun_id = cpu_to_be64(from->tunnel_id); info->key.tos = from->tunnel_tos; diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 727da3c5879b..22039a0a5b35 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -125,21 +125,21 @@ always-y += sockex1_kern.o always-y += sockex2_kern.o always-y += sockex3_kern.o always-y += tracex1_kern.o -always-y += tracex2_kern.o +always-y += tracex2.bpf.o always-y += tracex3_kern.o always-y += tracex4_kern.o always-y += tracex5_kern.o always-y += tracex6_kern.o always-y += tracex7_kern.o always-y += sock_flags_kern.o -always-y += test_probe_write_user_kern.o -always-y += trace_output_kern.o +always-y += test_probe_write_user.bpf.o +always-y += trace_output.bpf.o always-y += tcbpf1_kern.o always-y += tc_l2_redirect_kern.o always-y += lathist_kern.o always-y += offwaketime_kern.o always-y += spintest_kern.o -always-y += map_perf_test_kern.o +always-y += map_perf_test.bpf.o always-y += test_overhead_tp_kern.o always-y += test_overhead_raw_tp_kern.o always-y += test_overhead_kprobe_kern.o @@ -147,7 +147,7 @@ always-y += parse_varlen.o parse_simple.o parse_ldabs.o always-y += test_cgrp2_tc_kern.o always-y += xdp1_kern.o always-y += xdp2_kern.o -always-y += test_current_task_under_cgroup_kern.o +always-y += test_current_task_under_cgroup.bpf.o always-y += trace_event_kern.o always-y += sampleip_kern.o always-y += lwt_len_hist_kern.o diff --git a/samples/bpf/gnu/stubs.h b/samples/bpf/gnu/stubs.h new file mode 100644 index 000000000000..719225b16626 --- /dev/null +++ b/samples/bpf/gnu/stubs.h @@ -0,0 +1 @@ +/* dummy .h to trick /usr/include/features.h to work with 'clang -target bpf' */ diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test.bpf.c index 7342c5b2f278..3cdeba2afe12 100644 --- a/samples/bpf/map_perf_test_kern.c +++ b/samples/bpf/map_perf_test.bpf.c @@ -4,14 +4,12 @@ * modify it under the terms of version 2 of the GNU General Public * License as published by the Free Software Foundation. */ -#include <linux/skbuff.h> -#include <linux/netdevice.h> +#include "vmlinux.h" +#include <errno.h> #include <linux/version.h> -#include <uapi/linux/bpf.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include <bpf/bpf_core_read.h> -#include "trace_common.h" #define MAX_ENTRIES 1000 #define MAX_NR_CPUS 1024 @@ -102,8 +100,8 @@ struct { __uint(max_entries, MAX_ENTRIES); } lru_hash_lookup_map SEC(".maps"); -SEC("kprobe/" SYSCALL(sys_getuid)) -int stress_hmap(struct pt_regs *ctx) +SEC("ksyscall/getuid") +int BPF_KSYSCALL(stress_hmap) { u32 key = bpf_get_current_pid_tgid(); long init_val = 1; @@ -120,8 +118,8 @@ int stress_hmap(struct pt_regs *ctx) return 0; } -SEC("kprobe/" SYSCALL(sys_geteuid)) -int stress_percpu_hmap(struct pt_regs *ctx) +SEC("ksyscall/geteuid") +int BPF_KSYSCALL(stress_percpu_hmap) { u32 key = bpf_get_current_pid_tgid(); long init_val = 1; @@ -137,8 +135,8 @@ int stress_percpu_hmap(struct pt_regs *ctx) return 0; } -SEC("kprobe/" SYSCALL(sys_getgid)) -int stress_hmap_alloc(struct pt_regs *ctx) +SEC("ksyscall/getgid") +int BPF_KSYSCALL(stress_hmap_alloc) { u32 key = bpf_get_current_pid_tgid(); long init_val = 1; @@ -154,8 +152,8 @@ int stress_hmap_alloc(struct pt_regs *ctx) return 0; } -SEC("kprobe/" SYSCALL(sys_getegid)) -int stress_percpu_hmap_alloc(struct pt_regs *ctx) +SEC("ksyscall/getegid") +int BPF_KSYSCALL(stress_percpu_hmap_alloc) { u32 key = bpf_get_current_pid_tgid(); long init_val = 1; @@ -170,11 +168,10 @@ int stress_percpu_hmap_alloc(struct pt_regs *ctx) } return 0; } - -SEC("kprobe/" SYSCALL(sys_connect)) -int stress_lru_hmap_alloc(struct pt_regs *ctx) +SEC("ksyscall/connect") +int BPF_KSYSCALL(stress_lru_hmap_alloc, int fd, struct sockaddr_in *uservaddr, + int addrlen) { - struct pt_regs *real_regs = (struct pt_regs *)PT_REGS_PARM1_CORE(ctx); char fmt[] = "Failed at stress_lru_hmap_alloc. ret:%dn"; union { u16 dst6[8]; @@ -187,14 +184,11 @@ int stress_lru_hmap_alloc(struct pt_regs *ctx) u32 key; }; } test_params; - struct sockaddr_in6 *in6; + struct sockaddr_in6 *in6 = (struct sockaddr_in6 *)uservaddr; u16 test_case; - int addrlen, ret; long val = 1; u32 key = 0; - - in6 = (struct sockaddr_in6 *)PT_REGS_PARM2_CORE(real_regs); - addrlen = (int)PT_REGS_PARM3_CORE(real_regs); + int ret; if (addrlen != sizeof(*in6)) return 0; @@ -251,8 +245,8 @@ done: return 0; } -SEC("kprobe/" SYSCALL(sys_gettid)) -int stress_lpm_trie_map_alloc(struct pt_regs *ctx) +SEC("ksyscall/gettid") +int BPF_KSYSCALL(stress_lpm_trie_map_alloc) { union { u32 b32[2]; @@ -273,8 +267,8 @@ int stress_lpm_trie_map_alloc(struct pt_regs *ctx) return 0; } -SEC("kprobe/" SYSCALL(sys_getpgid)) -int stress_hash_map_lookup(struct pt_regs *ctx) +SEC("ksyscall/getpgid") +int BPF_KSYSCALL(stress_hash_map_lookup) { u32 key = 1, i; long *value; @@ -286,8 +280,8 @@ int stress_hash_map_lookup(struct pt_regs *ctx) return 0; } -SEC("kprobe/" SYSCALL(sys_getppid)) -int stress_array_map_lookup(struct pt_regs *ctx) +SEC("ksyscall/getppid") +int BPF_KSYSCALL(stress_array_map_lookup) { u32 key = 1, i; long *value; diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c index 1bb53f4b29e1..d2fbcf963cdf 100644 --- a/samples/bpf/map_perf_test_user.c +++ b/samples/bpf/map_perf_test_user.c @@ -443,7 +443,7 @@ int main(int argc, char **argv) if (argc > 4) max_cnt = atoi(argv[4]); - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + snprintf(filename, sizeof(filename), "%s.bpf.o", argv[0]); obj = bpf_object__open_file(filename, NULL); if (libbpf_get_error(obj)) { fprintf(stderr, "ERROR: opening BPF object file failed\n"); diff --git a/samples/bpf/test_current_task_under_cgroup_kern.c b/samples/bpf/test_current_task_under_cgroup.bpf.c index fbd43e2bb4d3..58b9cf7ed659 100644 --- a/samples/bpf/test_current_task_under_cgroup_kern.c +++ b/samples/bpf/test_current_task_under_cgroup.bpf.c @@ -5,12 +5,11 @@ * License as published by the Free Software Foundation. */ -#include <linux/ptrace.h> -#include <uapi/linux/bpf.h> +#include "vmlinux.h" #include <linux/version.h> #include <bpf/bpf_helpers.h> -#include <uapi/linux/utsname.h> -#include "trace_common.h" +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h> struct { __uint(type, BPF_MAP_TYPE_CGROUP_ARRAY); @@ -27,8 +26,8 @@ struct { } perf_map SEC(".maps"); /* Writes the last PID that called sync to a map at index 0 */ -SEC("kprobe/" SYSCALL(sys_sync)) -int bpf_prog1(struct pt_regs *ctx) +SEC("ksyscall/sync") +int BPF_KSYSCALL(bpf_prog1) { u64 pid = bpf_get_current_pid_tgid(); int idx = 0; diff --git a/samples/bpf/test_current_task_under_cgroup_user.c b/samples/bpf/test_current_task_under_cgroup_user.c index ac251a417f45..9726ed2a8a8b 100644 --- a/samples/bpf/test_current_task_under_cgroup_user.c +++ b/samples/bpf/test_current_task_under_cgroup_user.c @@ -14,14 +14,14 @@ int main(int argc, char **argv) { pid_t remote_pid, local_pid = getpid(); + int cg2 = -1, idx = 0, rc = 1; struct bpf_link *link = NULL; struct bpf_program *prog; - int cg2, idx = 0, rc = 1; struct bpf_object *obj; char filename[256]; int map_fd[2]; - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + snprintf(filename, sizeof(filename), "%s.bpf.o", argv[0]); obj = bpf_object__open_file(filename, NULL); if (libbpf_get_error(obj)) { fprintf(stderr, "ERROR: opening BPF object file failed\n"); @@ -103,7 +103,9 @@ int main(int argc, char **argv) rc = 0; err: - close(cg2); + if (cg2 != -1) + close(cg2); + cleanup_cgroup_environment(); cleanup: diff --git a/samples/bpf/test_lru_dist.c b/samples/bpf/test_lru_dist.c index 5efb91763d65..1c161276d57b 100644 --- a/samples/bpf/test_lru_dist.c +++ b/samples/bpf/test_lru_dist.c @@ -42,11 +42,6 @@ static inline void INIT_LIST_HEAD(struct list_head *list) list->prev = list; } -static inline int list_empty(const struct list_head *head) -{ - return head->next == head; -} - static inline void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next) diff --git a/samples/bpf/test_map_in_map_kern.c b/samples/bpf/test_map_in_map_kern.c index b0200c8eac09..0e17f9ade5c5 100644 --- a/samples/bpf/test_map_in_map_kern.c +++ b/samples/bpf/test_map_in_map_kern.c @@ -13,7 +13,6 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include <bpf/bpf_core_read.h> -#include "trace_common.h" #define MAX_NR_PORTS 65536 diff --git a/samples/bpf/test_probe_write_user_kern.c b/samples/bpf/test_probe_write_user.bpf.c index 220a96438d75..a4f3798b7fb0 100644 --- a/samples/bpf/test_probe_write_user_kern.c +++ b/samples/bpf/test_probe_write_user.bpf.c @@ -4,14 +4,12 @@ * modify it under the terms of version 2 of the GNU General Public * License as published by the Free Software Foundation. */ -#include <linux/skbuff.h> -#include <linux/netdevice.h> -#include <uapi/linux/bpf.h> +#include "vmlinux.h" +#include <string.h> #include <linux/version.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include <bpf/bpf_core_read.h> -#include "trace_common.h" struct { __uint(type, BPF_MAP_TYPE_HASH); @@ -28,25 +26,23 @@ struct { * This example sits on a syscall, and the syscall ABI is relatively stable * of course, across platforms, and over time, the ABI may change. */ -SEC("kprobe/" SYSCALL(sys_connect)) -int bpf_prog1(struct pt_regs *ctx) +SEC("ksyscall/connect") +int BPF_KSYSCALL(bpf_prog1, int fd, struct sockaddr_in *uservaddr, + int addrlen) { - struct pt_regs *real_regs = (struct pt_regs *)PT_REGS_PARM1_CORE(ctx); - void *sockaddr_arg = (void *)PT_REGS_PARM2_CORE(real_regs); - int sockaddr_len = (int)PT_REGS_PARM3_CORE(real_regs); struct sockaddr_in new_addr, orig_addr = {}; struct sockaddr_in *mapped_addr; - if (sockaddr_len > sizeof(orig_addr)) + if (addrlen > sizeof(orig_addr)) return 0; - if (bpf_probe_read_user(&orig_addr, sizeof(orig_addr), sockaddr_arg) != 0) + if (bpf_probe_read_user(&orig_addr, sizeof(orig_addr), uservaddr) != 0) return 0; mapped_addr = bpf_map_lookup_elem(&dnat_map, &orig_addr); if (mapped_addr != NULL) { memcpy(&new_addr, mapped_addr, sizeof(new_addr)); - bpf_probe_write_user(sockaddr_arg, &new_addr, + bpf_probe_write_user(uservaddr, &new_addr, sizeof(new_addr)); } return 0; diff --git a/samples/bpf/test_probe_write_user_user.c b/samples/bpf/test_probe_write_user_user.c index 00ccfb834e45..2a539aec4116 100644 --- a/samples/bpf/test_probe_write_user_user.c +++ b/samples/bpf/test_probe_write_user_user.c @@ -24,7 +24,7 @@ int main(int ac, char **argv) mapped_addr_in = (struct sockaddr_in *)&mapped_addr; tmp_addr_in = (struct sockaddr_in *)&tmp_addr; - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + snprintf(filename, sizeof(filename), "%s.bpf.o", argv[0]); obj = bpf_object__open_file(filename, NULL); if (libbpf_get_error(obj)) { fprintf(stderr, "ERROR: opening BPF object file failed\n"); diff --git a/samples/bpf/trace_common.h b/samples/bpf/trace_common.h deleted file mode 100644 index 8cb5400aed1f..000000000000 --- a/samples/bpf/trace_common.h +++ /dev/null @@ -1,13 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#ifndef __TRACE_COMMON_H -#define __TRACE_COMMON_H - -#ifdef __x86_64__ -#define SYSCALL(SYS) "__x64_" __stringify(SYS) -#elif defined(__s390x__) -#define SYSCALL(SYS) "__s390x_" __stringify(SYS) -#else -#define SYSCALL(SYS) __stringify(SYS) -#endif - -#endif diff --git a/samples/bpf/trace_output_kern.c b/samples/bpf/trace_output.bpf.c index b64815af0943..565a73b51b04 100644 --- a/samples/bpf/trace_output_kern.c +++ b/samples/bpf/trace_output.bpf.c @@ -1,8 +1,6 @@ -#include <linux/ptrace.h> +#include "vmlinux.h" #include <linux/version.h> -#include <uapi/linux/bpf.h> #include <bpf/bpf_helpers.h> -#include "trace_common.h" struct { __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); @@ -11,7 +9,7 @@ struct { __uint(max_entries, 2); } my_map SEC(".maps"); -SEC("kprobe/" SYSCALL(sys_write)) +SEC("ksyscall/write") int bpf_prog1(struct pt_regs *ctx) { struct S { diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c index 371732f9cf8e..d316fd2c8e24 100644 --- a/samples/bpf/trace_output_user.c +++ b/samples/bpf/trace_output_user.c @@ -51,7 +51,7 @@ int main(int argc, char **argv) char filename[256]; FILE *f; - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + snprintf(filename, sizeof(filename), "%s.bpf.o", argv[0]); obj = bpf_object__open_file(filename, NULL); if (libbpf_get_error(obj)) { fprintf(stderr, "ERROR: opening BPF object file failed\n"); diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2.bpf.c index 93e0b7680b4f..0a5c75b367be 100644 --- a/samples/bpf/tracex2_kern.c +++ b/samples/bpf/tracex2.bpf.c @@ -4,13 +4,11 @@ * modify it under the terms of version 2 of the GNU General Public * License as published by the Free Software Foundation. */ -#include <linux/skbuff.h> -#include <linux/netdevice.h> +#include "vmlinux.h" #include <linux/version.h> -#include <uapi/linux/bpf.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -#include "trace_common.h" +#include <bpf/bpf_core_read.h> struct { __uint(type, BPF_MAP_TYPE_HASH); @@ -78,15 +76,14 @@ struct { __uint(max_entries, 1024); } my_hist_map SEC(".maps"); -SEC("kprobe/" SYSCALL(sys_write)) -int bpf_prog3(struct pt_regs *ctx) +SEC("ksyscall/write") +int BPF_KSYSCALL(bpf_prog3, unsigned int fd, const char *buf, size_t count) { - long write_size = PT_REGS_PARM3(ctx); long init_val = 1; long *value; struct hist_key key; - key.index = log2l(write_size); + key.index = log2l(count); key.pid_tgid = bpf_get_current_pid_tgid(); key.uid_gid = bpf_get_current_uid_gid(); bpf_get_current_comm(&key.comm, sizeof(key.comm)); diff --git a/samples/bpf/tracex2_user.c b/samples/bpf/tracex2_user.c index 089e408abd7a..2131f1648cf1 100644 --- a/samples/bpf/tracex2_user.c +++ b/samples/bpf/tracex2_user.c @@ -123,7 +123,7 @@ int main(int ac, char **argv) int i, j = 0; FILE *f; - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + snprintf(filename, sizeof(filename), "%s.bpf.o", argv[0]); obj = bpf_object__open_file(filename, NULL); if (libbpf_get_error(obj)) { fprintf(stderr, "ERROR: opening BPF object file failed\n"); diff --git a/samples/bpf/tracex4_user.c b/samples/bpf/tracex4_user.c index 227b05a0bc88..dee8f0a091ba 100644 --- a/samples/bpf/tracex4_user.c +++ b/samples/bpf/tracex4_user.c @@ -51,7 +51,7 @@ int main(int ac, char **argv) struct bpf_program *prog; struct bpf_object *obj; char filename[256]; - int map_fd, i, j = 0; + int map_fd, j = 0; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); obj = bpf_object__open_file(filename, NULL); @@ -82,7 +82,7 @@ int main(int ac, char **argv) j++; } - for (i = 0; ; i++) { + while (1) { print_old_objects(map_fd); sleep(1); } diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 787b857d3fb5..313fd1b09189 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -289,3 +289,6 @@ FORCE: .PHONY: all FORCE bootstrap clean install-bin install uninstall .PHONY: doc doc-clean doc-install doc-uninstall .DEFAULT_GOAL := all + +# Delete partially updated (corrupted) files on error +.DELETE_ON_ERROR: diff --git a/tools/bpf/resolve_btfids/Makefile b/tools/bpf/resolve_btfids/Makefile index 19a3112e271a..f7375a119f54 100644 --- a/tools/bpf/resolve_btfids/Makefile +++ b/tools/bpf/resolve_btfids/Makefile @@ -56,13 +56,17 @@ $(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(LIBBPF_OU DESTDIR=$(LIBBPF_DESTDIR) prefix= EXTRA_CFLAGS="$(CFLAGS)" \ $(abspath $@) install_headers +LIBELF_FLAGS := $(shell $(HOSTPKG_CONFIG) libelf --cflags 2>/dev/null) +LIBELF_LIBS := $(shell $(HOSTPKG_CONFIG) libelf --libs 2>/dev/null || echo -lelf) + CFLAGS += -g \ -I$(srctree)/tools/include \ -I$(srctree)/tools/include/uapi \ -I$(LIBBPF_INCLUDE) \ - -I$(SUBCMD_SRC) + -I$(SUBCMD_SRC) \ + $(LIBELF_FLAGS) -LIBS = -lelf -lz +LIBS = $(LIBELF_LIBS) -lz export srctree OUTPUT CFLAGS Q include $(srctree)/tools/build/Makefile.include diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 464ca3f01fe7..bc1a3d232ae4 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2001,6 +2001,9 @@ union bpf_attr { * sending the packet. This flag was added for GRE * encapsulation, but might be used with other protocols * as well in the future. + * **BPF_F_NO_TUNNEL_KEY** + * Add a flag to tunnel metadata indicating that no tunnel + * key should be set in the resulting tunnel header. * * Here is a typical usage on the transmit path: * @@ -5764,6 +5767,7 @@ enum { BPF_F_ZERO_CSUM_TX = (1ULL << 1), BPF_F_DONT_FRAGMENT = (1ULL << 2), BPF_F_SEQ_NUMBER = (1ULL << 3), + BPF_F_NO_TUNNEL_KEY = (1ULL << 4), }; /* BPF_FUNC_skb_get_tunnel_key flags. */ diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h index 2972dc25ff72..bdb0f6b5be84 100644 --- a/tools/lib/bpf/bpf_tracing.h +++ b/tools/lib/bpf/bpf_tracing.h @@ -32,6 +32,9 @@ #elif defined(__TARGET_ARCH_arc) #define bpf_target_arc #define bpf_target_defined +#elif defined(__TARGET_ARCH_loongarch) + #define bpf_target_loongarch + #define bpf_target_defined #else /* Fall back to what the compiler says */ @@ -62,6 +65,9 @@ #elif defined(__arc__) #define bpf_target_arc #define bpf_target_defined +#elif defined(__loongarch__) + #define bpf_target_loongarch + #define bpf_target_defined #endif /* no compiler target */ #endif @@ -137,7 +143,7 @@ struct pt_regs___s390 { #define __PT_PARM3_REG gprs[4] #define __PT_PARM4_REG gprs[5] #define __PT_PARM5_REG gprs[6] -#define __PT_RET_REG grps[14] +#define __PT_RET_REG gprs[14] #define __PT_FP_REG gprs[11] /* Works only with CONFIG_FRAME_POINTER */ #define __PT_RC_REG gprs[2] #define __PT_SP_REG gprs[15] @@ -258,6 +264,23 @@ struct pt_regs___arm64 { /* arc does not select ARCH_HAS_SYSCALL_WRAPPER. */ #define PT_REGS_SYSCALL_REGS(ctx) ctx +#elif defined(bpf_target_loongarch) + +/* https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html */ + +#define __PT_PARM1_REG regs[4] +#define __PT_PARM2_REG regs[5] +#define __PT_PARM3_REG regs[6] +#define __PT_PARM4_REG regs[7] +#define __PT_PARM5_REG regs[8] +#define __PT_RET_REG regs[1] +#define __PT_FP_REG regs[22] +#define __PT_RC_REG regs[4] +#define __PT_SP_REG regs[3] +#define __PT_IP_REG csr_era +/* loongarch does not select ARCH_HAS_SYSCALL_WRAPPER. */ +#define PT_REGS_SYSCALL_REGS(ctx) ctx + #endif #if defined(bpf_target_defined) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 71e165b09ed5..64841117fbb2 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -688,8 +688,21 @@ int btf__align_of(const struct btf *btf, __u32 id) if (align <= 0) return libbpf_err(align); max_align = max(max_align, align); + + /* if field offset isn't aligned according to field + * type's alignment, then struct must be packed + */ + if (btf_member_bitfield_size(t, i) == 0 && + (m->offset % (8 * align)) != 0) + return 1; } + /* if struct/union size isn't a multiple of its alignment, + * then struct must be packed + */ + if ((t->size % max_align) != 0) + return 1; + return max_align; } default: @@ -990,7 +1003,8 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, err = 0; if (!btf_data) { - err = -ENOENT; + pr_warn("failed to find '%s' ELF section in %s\n", BTF_ELF_SEC, path); + err = -ENODATA; goto done; } btf = btf_new(btf_data->d_buf, btf_data->d_size, base_btf); diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c index deb2bc9a0a7b..580985ee5545 100644 --- a/tools/lib/bpf/btf_dump.c +++ b/tools/lib/bpf/btf_dump.c @@ -13,6 +13,7 @@ #include <ctype.h> #include <endian.h> #include <errno.h> +#include <limits.h> #include <linux/err.h> #include <linux/btf.h> #include <linux/kernel.h> @@ -833,14 +834,9 @@ static bool btf_is_struct_packed(const struct btf *btf, __u32 id, const struct btf_type *t) { const struct btf_member *m; - int align, i, bit_sz; + int max_align = 1, align, i, bit_sz; __u16 vlen; - align = btf__align_of(btf, id); - /* size of a non-packed struct has to be a multiple of its alignment*/ - if (align && t->size % align) - return true; - m = btf_members(t); vlen = btf_vlen(t); /* all non-bitfield fields have to be naturally aligned */ @@ -849,8 +845,11 @@ static bool btf_is_struct_packed(const struct btf *btf, __u32 id, bit_sz = btf_member_bitfield_size(t, i); if (align && bit_sz == 0 && m->offset % (8 * align) != 0) return true; + max_align = max(align, max_align); } - + /* size of a non-packed struct has to be a multiple of its alignment */ + if (t->size % max_align != 0) + return true; /* * if original struct was marked as packed, but its layout is * naturally aligned, we'll detect that it's not packed @@ -858,44 +857,97 @@ static bool btf_is_struct_packed(const struct btf *btf, __u32 id, return false; } -static int chip_away_bits(int total, int at_most) -{ - return total % at_most ? : at_most; -} - static void btf_dump_emit_bit_padding(const struct btf_dump *d, - int cur_off, int m_off, int m_bit_sz, - int align, int lvl) + int cur_off, int next_off, int next_align, + bool in_bitfield, int lvl) { - int off_diff = m_off - cur_off; - int ptr_bits = d->ptr_sz * 8; + const struct { + const char *name; + int bits; + } pads[] = { + {"long", d->ptr_sz * 8}, {"int", 32}, {"short", 16}, {"char", 8} + }; + int new_off, pad_bits, bits, i; + const char *pad_type; + + if (cur_off >= next_off) + return; /* no gap */ + + /* For filling out padding we want to take advantage of + * natural alignment rules to minimize unnecessary explicit + * padding. First, we find the largest type (among long, int, + * short, or char) that can be used to force naturally aligned + * boundary. Once determined, we'll use such type to fill in + * the remaining padding gap. In some cases we can rely on + * compiler filling some gaps, but sometimes we need to force + * alignment to close natural alignment with markers like + * `long: 0` (this is always the case for bitfields). Note + * that even if struct itself has, let's say 4-byte alignment + * (i.e., it only uses up to int-aligned types), using `long: + * X;` explicit padding doesn't actually change struct's + * overall alignment requirements, but compiler does take into + * account that type's (long, in this example) natural + * alignment requirements when adding implicit padding. We use + * this fact heavily and don't worry about ruining correct + * struct alignment requirement. + */ + for (i = 0; i < ARRAY_SIZE(pads); i++) { + pad_bits = pads[i].bits; + pad_type = pads[i].name; - if (off_diff <= 0) - /* no gap */ - return; - if (m_bit_sz == 0 && off_diff < align * 8) - /* natural padding will take care of a gap */ - return; + new_off = roundup(cur_off, pad_bits); + if (new_off <= next_off) + break; + } - while (off_diff > 0) { - const char *pad_type; - int pad_bits; - - if (ptr_bits > 32 && off_diff > 32) { - pad_type = "long"; - pad_bits = chip_away_bits(off_diff, ptr_bits); - } else if (off_diff > 16) { - pad_type = "int"; - pad_bits = chip_away_bits(off_diff, 32); - } else if (off_diff > 8) { - pad_type = "short"; - pad_bits = chip_away_bits(off_diff, 16); - } else { - pad_type = "char"; - pad_bits = chip_away_bits(off_diff, 8); + if (new_off > cur_off && new_off <= next_off) { + /* We need explicit `<type>: 0` aligning mark if next + * field is right on alignment offset and its + * alignment requirement is less strict than <type>'s + * alignment (so compiler won't naturally align to the + * offset we expect), or if subsequent `<type>: X`, + * will actually completely fit in the remaining hole, + * making compiler basically ignore `<type>: X` + * completely. + */ + if (in_bitfield || + (new_off == next_off && roundup(cur_off, next_align * 8) != new_off) || + (new_off != next_off && next_off - new_off <= new_off - cur_off)) + /* but for bitfields we'll emit explicit bit count */ + btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type, + in_bitfield ? new_off - cur_off : 0); + cur_off = new_off; + } + + /* Now we know we start at naturally aligned offset for a chosen + * padding type (long, int, short, or char), and so the rest is just + * a straightforward filling of remaining padding gap with full + * `<type>: sizeof(<type>);` markers, except for the last one, which + * might need smaller than sizeof(<type>) padding. + */ + while (cur_off != next_off) { + bits = min(next_off - cur_off, pad_bits); + if (bits == pad_bits) { + btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type, pad_bits); + cur_off += bits; + continue; + } + /* For the remainder padding that doesn't cover entire + * pad_type bit length, we pick the smallest necessary type. + * This is pure aesthetics, we could have just used `long`, + * but having smallest necessary one communicates better the + * scale of the padding gap. + */ + for (i = ARRAY_SIZE(pads) - 1; i >= 0; i--) { + pad_type = pads[i].name; + pad_bits = pads[i].bits; + if (pad_bits < bits) + continue; + + btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type, bits); + cur_off += bits; + break; } - btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type, pad_bits); - off_diff -= pad_bits; } } @@ -915,9 +967,11 @@ static void btf_dump_emit_struct_def(struct btf_dump *d, { const struct btf_member *m = btf_members(t); bool is_struct = btf_is_struct(t); - int align, i, packed, off = 0; + bool packed, prev_bitfield = false; + int align, i, off = 0; __u16 vlen = btf_vlen(t); + align = btf__align_of(d->btf, id); packed = is_struct ? btf_is_struct_packed(d->btf, id, t) : 0; btf_dump_printf(d, "%s%s%s {", @@ -927,41 +981,47 @@ static void btf_dump_emit_struct_def(struct btf_dump *d, for (i = 0; i < vlen; i++, m++) { const char *fname; - int m_off, m_sz; + int m_off, m_sz, m_align; + bool in_bitfield; fname = btf_name_of(d, m->name_off); m_sz = btf_member_bitfield_size(t, i); m_off = btf_member_bit_offset(t, i); - align = packed ? 1 : btf__align_of(d->btf, m->type); + m_align = packed ? 1 : btf__align_of(d->btf, m->type); + + in_bitfield = prev_bitfield && m_sz != 0; - btf_dump_emit_bit_padding(d, off, m_off, m_sz, align, lvl + 1); + btf_dump_emit_bit_padding(d, off, m_off, m_align, in_bitfield, lvl + 1); btf_dump_printf(d, "\n%s", pfx(lvl + 1)); btf_dump_emit_type_decl(d, m->type, fname, lvl + 1); if (m_sz) { btf_dump_printf(d, ": %d", m_sz); off = m_off + m_sz; + prev_bitfield = true; } else { m_sz = max((__s64)0, btf__resolve_size(d->btf, m->type)); off = m_off + m_sz * 8; + prev_bitfield = false; } + btf_dump_printf(d, ";"); } /* pad at the end, if necessary */ - if (is_struct) { - align = packed ? 1 : btf__align_of(d->btf, id); - btf_dump_emit_bit_padding(d, off, t->size * 8, 0, align, - lvl + 1); - } + if (is_struct) + btf_dump_emit_bit_padding(d, off, t->size * 8, align, false, lvl + 1); /* * Keep `struct empty {}` on a single line, * only print newline when there are regular or padding fields. */ - if (vlen || t->size) + if (vlen || t->size) { btf_dump_printf(d, "\n"); - btf_dump_printf(d, "%s}", pfx(lvl)); + btf_dump_printf(d, "%s}", pfx(lvl)); + } else { + btf_dump_printf(d, "}"); + } if (packed) btf_dump_printf(d, " __attribute__((packed))"); } @@ -1073,6 +1133,43 @@ static void btf_dump_emit_enum_def(struct btf_dump *d, __u32 id, else btf_dump_emit_enum64_val(d, t, lvl, vlen); btf_dump_printf(d, "\n%s}", pfx(lvl)); + + /* special case enums with special sizes */ + if (t->size == 1) { + /* one-byte enums can be forced with mode(byte) attribute */ + btf_dump_printf(d, " __attribute__((mode(byte)))"); + } else if (t->size == 8 && d->ptr_sz == 8) { + /* enum can be 8-byte sized if one of the enumerator values + * doesn't fit in 32-bit integer, or by adding mode(word) + * attribute (but probably only on 64-bit architectures); do + * our best here to try to satisfy the contract without adding + * unnecessary attributes + */ + bool needs_word_mode; + + if (btf_is_enum(t)) { + /* enum can't represent 64-bit values, so we need word mode */ + needs_word_mode = true; + } else { + /* enum64 needs mode(word) if none of its values has + * non-zero upper 32-bits (which means that all values + * fit in 32-bit integers and won't cause compiler to + * bump enum to be 64-bit naturally + */ + int i; + + needs_word_mode = true; + for (i = 0; i < vlen; i++) { + if (btf_enum64(t)[i].val_hi32 != 0) { + needs_word_mode = false; + break; + } + } + } + if (needs_word_mode) + btf_dump_printf(d, " __attribute__((mode(word)))"); + } + } static void btf_dump_emit_fwd_def(struct btf_dump *d, __u32 id, diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 2a82f49ce16f..a5c67a3c93c5 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -9903,7 +9903,7 @@ static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name, char errmsg[STRERR_BUFSIZE]; int type, pfd; - if (ref_ctr_off >= (1ULL << PERF_UPROBE_REF_CTR_OFFSET_BITS)) + if ((__u64)ref_ctr_off >= (1ULL << PERF_UPROBE_REF_CTR_OFFSET_BITS)) return -EINVAL; memset(&attr, 0, attr_sz); diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index eee883f007f9..898db26e42e9 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -96,6 +96,12 @@ enum libbpf_print_level { typedef int (*libbpf_print_fn_t)(enum libbpf_print_level level, const char *, va_list ap); +/** + * @brief **libbpf_set_print()** sets user-provided log callback function to + * be used for libbpf warnings and informational messages. + * @param fn The log print function. If NULL, libbpf won't print anything. + * @return Pointer to old print function. + */ LIBBPF_API libbpf_print_fn_t libbpf_set_print(libbpf_print_fn_t fn); /* Hide internal to user */ @@ -174,6 +180,14 @@ struct bpf_object_open_opts { }; #define bpf_object_open_opts__last_field kernel_log_level +/** + * @brief **bpf_object__open()** creates a bpf_object by opening + * the BPF ELF object file pointed to by the passed path and loading it + * into memory. + * @param path BPF object file path. + * @return pointer to the new bpf_object; or NULL is returned on error, + * error code is stored in errno + */ LIBBPF_API struct bpf_object *bpf_object__open(const char *path); /** @@ -203,10 +217,21 @@ LIBBPF_API struct bpf_object * bpf_object__open_mem(const void *obj_buf, size_t obj_buf_sz, const struct bpf_object_open_opts *opts); -/* Load/unload object into/from kernel */ +/** + * @brief **bpf_object__load()** loads BPF object into kernel. + * @param obj Pointer to a valid BPF object instance returned by + * **bpf_object__open*()** APIs + * @return 0, on success; negative error code, otherwise, error code is + * stored in errno + */ LIBBPF_API int bpf_object__load(struct bpf_object *obj); -LIBBPF_API void bpf_object__close(struct bpf_object *object); +/** + * @brief **bpf_object__close()** closes a BPF object and releases all + * resources. + * @param obj Pointer to a valid BPF object + */ +LIBBPF_API void bpf_object__close(struct bpf_object *obj); /* pin_maps and unpin_maps can both be called with a NULL path, in which case * they will use the pin_path attribute of each map (and ignore all maps that diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 71bf5691a689..11c36a3c1a9f 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -382,3 +382,6 @@ LIBBPF_1.1.0 { user_ring_buffer__reserve_blocking; user_ring_buffer__submit; } LIBBPF_1.0.0; + +LIBBPF_1.2.0 { +} LIBBPF_1.1.0; diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c index 96f67a772a1b..6b180172ec6b 100644 --- a/tools/lib/bpf/libbpf_errno.c +++ b/tools/lib/bpf/libbpf_errno.c @@ -39,14 +39,14 @@ static const char *libbpf_strerror_table[NR_ERRNO] = { int libbpf_strerror(int err, char *buf, size_t size) { + int ret; + if (!buf || !size) return libbpf_err(-EINVAL); err = err > 0 ? err : -err; if (err < __LIBBPF_ERRNO__START) { - int ret; - ret = strerror_r(err, buf, size); buf[size - 1] = '\0'; return libbpf_err_errno(ret); @@ -56,12 +56,20 @@ int libbpf_strerror(int err, char *buf, size_t size) const char *msg; msg = libbpf_strerror_table[ERRNO_OFFSET(err)]; - snprintf(buf, size, "%s", msg); + ret = snprintf(buf, size, "%s", msg); buf[size - 1] = '\0'; + /* The length of the buf and msg is positive. + * A negative number may be returned only when the + * size exceeds INT_MAX. Not likely to appear. + */ + if (ret >= size) + return libbpf_err(-ERANGE); return 0; } - snprintf(buf, size, "Unknown libbpf error %d", err); + ret = snprintf(buf, size, "Unknown libbpf error %d", err); buf[size - 1] = '\0'; + if (ret >= size) + return libbpf_err(-ERANGE); return libbpf_err(-ENOENT); } diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 377642ff51fc..e4d05662a96c 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -543,6 +543,7 @@ static inline int ensure_good_fd(int fd) fd = fcntl(fd, F_DUPFD_CLOEXEC, 3); saved_errno = errno; close(old_fd); + errno = saved_errno; if (fd < 0) { pr_warn("failed to dup FD %d to FD > 2: %d\n", old_fd, -saved_errno); errno = saved_errno; diff --git a/tools/lib/bpf/libbpf_version.h b/tools/lib/bpf/libbpf_version.h index e944f5bce728..1fd2eeac5cfc 100644 --- a/tools/lib/bpf/libbpf_version.h +++ b/tools/lib/bpf/libbpf_version.h @@ -4,6 +4,6 @@ #define __LIBBPF_VERSION_H #define LIBBPF_MAJOR_VERSION 1 -#define LIBBPF_MINOR_VERSION 1 +#define LIBBPF_MINOR_VERSION 2 #endif /* __LIBBPF_VERSION_H */ diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x index 585fcf73c731..3efe091255bf 100644 --- a/tools/testing/selftests/bpf/DENYLIST.s390x +++ b/tools/testing/selftests/bpf/DENYLIST.s390x @@ -26,6 +26,7 @@ get_func_args_test # trampoline get_func_ip_test # get_func_ip_test__attach unexpected error: -524 (trampoline) get_stack_raw_tp # user_stack corrupted user stack (no backchain userspace) htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline) +jit_probe_mem # jit_probe_mem__open_and_load unexpected error: -524 (kfunc) kfree_skb # attach fentry unexpected error: -524 (trampoline) kfunc_call # 'bpf_prog_active': not found in kernel BTF (?) kfunc_dynptr_param # JIT does not support calling kernel function (kfunc) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index c22c43bbee19..205e8c3c346a 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -626,3 +626,6 @@ EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) $(HOST_SCRATCH_DIR) \ liburandom_read.so) .PHONY: docs docs-clean + +# Delete partially updated (corrupted) files on error +.DELETE_ON_ERROR: diff --git a/tools/testing/selftests/bpf/prog_tests/jit_probe_mem.c b/tools/testing/selftests/bpf/prog_tests/jit_probe_mem.c new file mode 100644 index 000000000000..5639428607e6 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/jit_probe_mem.c @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ +#include <test_progs.h> +#include <network_helpers.h> + +#include "jit_probe_mem.skel.h" + +void test_jit_probe_mem(void) +{ + LIBBPF_OPTS(bpf_test_run_opts, opts, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .repeat = 1, + ); + struct jit_probe_mem *skel; + int ret; + + skel = jit_probe_mem__open_and_load(); + if (!ASSERT_OK_PTR(skel, "jit_probe_mem__open_and_load")) + return; + + ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.test_jit_probe_mem), &opts); + ASSERT_OK(ret, "jit_probe_mem ret"); + ASSERT_OK(opts.retval, "jit_probe_mem opts.retval"); + ASSERT_EQ(skel->data->total_sum, 192, "jit_probe_mem total_sum"); + + jit_probe_mem__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/btf_dump_test_case_bitfields.c b/tools/testing/selftests/bpf/progs/btf_dump_test_case_bitfields.c index e5560a656030..e01690618e1e 100644 --- a/tools/testing/selftests/bpf/progs/btf_dump_test_case_bitfields.c +++ b/tools/testing/selftests/bpf/progs/btf_dump_test_case_bitfields.c @@ -53,7 +53,7 @@ struct bitfields_only_mixed_types { */ /* ------ END-EXPECTED-OUTPUT ------ */ struct bitfield_mixed_with_others { - long: 4; /* char is enough as a backing field */ + char: 4; /* char is enough as a backing field */ int a: 4; /* 8-bit implicit padding */ short b; /* combined with previous bitfield */ diff --git a/tools/testing/selftests/bpf/progs/btf_dump_test_case_packing.c b/tools/testing/selftests/bpf/progs/btf_dump_test_case_packing.c index e304b6204bd9..7998f27df7dd 100644 --- a/tools/testing/selftests/bpf/progs/btf_dump_test_case_packing.c +++ b/tools/testing/selftests/bpf/progs/btf_dump_test_case_packing.c @@ -58,7 +58,81 @@ union jump_code_union { } __attribute__((packed)); }; -/*------ END-EXPECTED-OUTPUT ------ */ +/* ----- START-EXPECTED-OUTPUT ----- */ +/* + *struct nested_packed_but_aligned_struct { + * int x1; + * int x2; + *}; + * + *struct outer_implicitly_packed_struct { + * char y1; + * struct nested_packed_but_aligned_struct y2; + *} __attribute__((packed)); + * + */ +/* ------ END-EXPECTED-OUTPUT ------ */ + +struct nested_packed_but_aligned_struct { + int x1; + int x2; +} __attribute__((packed)); + +struct outer_implicitly_packed_struct { + char y1; + struct nested_packed_but_aligned_struct y2; +}; +/* ----- START-EXPECTED-OUTPUT ----- */ +/* + *struct usb_ss_ep_comp_descriptor { + * char: 8; + * char bDescriptorType; + * char bMaxBurst; + * short wBytesPerInterval; + *}; + * + *struct usb_host_endpoint { + * long: 64; + * char: 8; + * struct usb_ss_ep_comp_descriptor ss_ep_comp; + * long: 0; + *} __attribute__((packed)); + * + */ +/* ------ END-EXPECTED-OUTPUT ------ */ + +struct usb_ss_ep_comp_descriptor { + char: 8; + char bDescriptorType; + char bMaxBurst; + int: 0; + short wBytesPerInterval; +} __attribute__((packed)); + +struct usb_host_endpoint { + long: 64; + char: 8; + struct usb_ss_ep_comp_descriptor ss_ep_comp; + long: 0; +}; + +/* ----- START-EXPECTED-OUTPUT ----- */ +struct nested_packed_struct { + int a; + char b; +} __attribute__((packed)); + +struct outer_nonpacked_struct { + short a; + struct nested_packed_struct b; +}; + +struct outer_packed_struct { + short a; + struct nested_packed_struct b; +} __attribute__((packed)); + +/* ------ END-EXPECTED-OUTPUT ------ */ int f(struct { struct packed_trailing_space _1; @@ -69,6 +143,10 @@ int f(struct { union union_is_never_packed _6; union union_does_not_need_packing _7; union jump_code_union _8; + struct outer_implicitly_packed_struct _9; + struct usb_host_endpoint _10; + struct outer_nonpacked_struct _11; + struct outer_packed_struct _12; } *_) { return 0; diff --git a/tools/testing/selftests/bpf/progs/btf_dump_test_case_padding.c b/tools/testing/selftests/bpf/progs/btf_dump_test_case_padding.c index 7cb522d22a66..79276fbe454a 100644 --- a/tools/testing/selftests/bpf/progs/btf_dump_test_case_padding.c +++ b/tools/testing/selftests/bpf/progs/btf_dump_test_case_padding.c @@ -19,7 +19,7 @@ struct padded_implicitly { /* *struct padded_explicitly { * int a; - * int: 32; + * long: 0; * int b; *}; * @@ -28,41 +28,28 @@ struct padded_implicitly { struct padded_explicitly { int a; - int: 1; /* algo will explicitly pad with full 32 bits here */ + int: 1; /* algo will emit aligning `long: 0;` here */ int b; }; /* ----- START-EXPECTED-OUTPUT ----- */ -/* - *struct padded_a_lot { - * int a; - * long: 32; - * long: 64; - * long: 64; - * int b; - *}; - * - */ -/* ------ END-EXPECTED-OUTPUT ------ */ - struct padded_a_lot { int a; - /* 32 bit of implicit padding here, which algo will make explicit */ long: 64; long: 64; int b; }; +/* ------ END-EXPECTED-OUTPUT ------ */ + /* ----- START-EXPECTED-OUTPUT ----- */ /* *struct padded_cache_line { * int a; - * long: 32; * long: 64; * long: 64; * long: 64; * int b; - * long: 32; * long: 64; * long: 64; * long: 64; @@ -85,7 +72,7 @@ struct padded_cache_line { *struct zone { * int a; * short b; - * short: 16; + * long: 0; * struct zone_padding __pad__; *}; * @@ -108,6 +95,131 @@ struct padding_wo_named_members { long: 64; }; +struct padding_weird_1 { + int a; + long: 64; + short: 16; + short b; +}; + +/* ------ END-EXPECTED-OUTPUT ------ */ + +/* ----- START-EXPECTED-OUTPUT ----- */ +/* + *struct padding_weird_2 { + * long: 56; + * char a; + * long: 56; + * char b; + * char: 8; + *}; + * + */ +/* ------ END-EXPECTED-OUTPUT ------ */ +struct padding_weird_2 { + int: 32; /* these paddings will be collapsed into `long: 56;` */ + short: 16; + char: 8; + char a; + int: 32; /* these paddings will be collapsed into `long: 56;` */ + short: 16; + char: 8; + char b; + char: 8; +}; + +/* ----- START-EXPECTED-OUTPUT ----- */ +struct exact_1byte { + char x; +}; + +struct padded_1byte { + char: 8; +}; + +struct exact_2bytes { + short x; +}; + +struct padded_2bytes { + short: 16; +}; + +struct exact_4bytes { + int x; +}; + +struct padded_4bytes { + int: 32; +}; + +struct exact_8bytes { + long x; +}; + +struct padded_8bytes { + long: 64; +}; + +struct ff_periodic_effect { + int: 32; + short magnitude; + long: 0; + short phase; + long: 0; + int: 32; + int custom_len; + short *custom_data; +}; + +struct ib_wc { + long: 64; + long: 64; + int: 32; + int byte_len; + void *qp; + union {} ex; + long: 64; + int slid; + int wc_flags; + long: 64; + char smac[6]; + long: 0; + char network_hdr_type; +}; + +struct acpi_object_method { + long: 64; + char: 8; + char type; + short reference_count; + char flags; + short: 0; + char: 8; + char sync_level; + long: 64; + void *node; + void *aml_start; + union {} dispatch; + long: 64; + int aml_length; +}; + +struct nested_unpacked { + int x; +}; + +struct nested_packed { + struct nested_unpacked a; + char c; +} __attribute__((packed)); + +struct outer_mixed_but_unpacked { + struct nested_packed b1; + short a1; + struct nested_packed b2; +}; + /* ------ END-EXPECTED-OUTPUT ------ */ int f(struct { @@ -117,6 +229,20 @@ int f(struct { struct padded_cache_line _4; struct zone _5; struct padding_wo_named_members _6; + struct padding_weird_1 _7; + struct padding_weird_2 _8; + struct exact_1byte _100; + struct padded_1byte _101; + struct exact_2bytes _102; + struct padded_2bytes _103; + struct exact_4bytes _104; + struct padded_4bytes _105; + struct exact_8bytes _106; + struct padded_8bytes _107; + struct ff_periodic_effect _200; + struct ib_wc _201; + struct acpi_object_method _202; + struct outer_mixed_but_unpacked _203; } *_) { return 0; diff --git a/tools/testing/selftests/bpf/progs/btf_dump_test_case_syntax.c b/tools/testing/selftests/bpf/progs/btf_dump_test_case_syntax.c index 4ee4748133fe..26fffb02ed10 100644 --- a/tools/testing/selftests/bpf/progs/btf_dump_test_case_syntax.c +++ b/tools/testing/selftests/bpf/progs/btf_dump_test_case_syntax.c @@ -25,6 +25,39 @@ typedef enum { H = 2, } e3_t; +/* ----- START-EXPECTED-OUTPUT ----- */ +/* + *enum e_byte { + * EBYTE_1 = 0, + * EBYTE_2 = 1, + *} __attribute__((mode(byte))); + * + */ +/* ----- END-EXPECTED-OUTPUT ----- */ +enum e_byte { + EBYTE_1, + EBYTE_2, +} __attribute__((mode(byte))); + +/* ----- START-EXPECTED-OUTPUT ----- */ +/* + *enum e_word { + * EWORD_1 = 0LL, + * EWORD_2 = 1LL, + *} __attribute__((mode(word))); + * + */ +/* ----- END-EXPECTED-OUTPUT ----- */ +enum e_word { + EWORD_1, + EWORD_2, +} __attribute__((mode(word))); /* force to use 8-byte backing for this enum */ + +/* ----- START-EXPECTED-OUTPUT ----- */ +enum e_big { + EBIG_1 = 1000000000000ULL, +}; + typedef int int_t; typedef volatile const int * volatile const crazy_ptr_t; @@ -224,6 +257,9 @@ struct root_struct { enum e2 _2; e2_t _2_1; e3_t _2_2; + enum e_byte _100; + enum e_word _101; + enum e_big _102; struct struct_w_typedefs _3; anon_struct_t _7; struct struct_fwd *_8; diff --git a/tools/testing/selftests/bpf/progs/jit_probe_mem.c b/tools/testing/selftests/bpf/progs/jit_probe_mem.c new file mode 100644 index 000000000000..2d2e61470794 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/jit_probe_mem.c @@ -0,0 +1,61 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> + +static struct prog_test_ref_kfunc __kptr_ref *v; +long total_sum = -1; + +extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym; +extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym; + +SEC("tc") +int test_jit_probe_mem(struct __sk_buff *ctx) +{ + struct prog_test_ref_kfunc *p; + unsigned long zero = 0, sum; + + p = bpf_kfunc_call_test_acquire(&zero); + if (!p) + return 1; + + p = bpf_kptr_xchg(&v, p); + if (p) + goto release_out; + + /* Direct map value access of kptr, should be PTR_UNTRUSTED */ + p = v; + if (!p) + return 1; + + asm volatile ( + "r9 = %[p];" + "%[sum] = 0;" + + /* r8 = p->a */ + "r8 = *(u32 *)(r9 + 0);" + "%[sum] += r8;" + + /* r8 = p->b */ + "r8 = *(u32 *)(r9 + 4);" + "%[sum] += r8;" + + "r9 += 8;" + /* r9 = p->a */ + "r9 = *(u32 *)(r9 - 8);" + "%[sum] += r9;" + + : [sum] "=r"(sum) + : [p] "r"(p) + : "r8", "r9" + ); + + total_sum = sum; + return 0; +release_out: + bpf_kfunc_call_test_release(p); + return 1; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c index 98af55f0bcd3..508da4a23c4f 100644 --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c @@ -82,6 +82,27 @@ int gre_set_tunnel(struct __sk_buff *skb) } SEC("tc") +int gre_set_tunnel_no_key(struct __sk_buff *skb) +{ + int ret; + struct bpf_tunnel_key key; + + __builtin_memset(&key, 0x0, sizeof(key)); + key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */ + key.tunnel_ttl = 64; + + ret = bpf_skb_set_tunnel_key(skb, &key, sizeof(key), + BPF_F_ZERO_CSUM_TX | BPF_F_SEQ_NUMBER | + BPF_F_NO_TUNNEL_KEY); + if (ret < 0) { + log_err(ret); + return TC_ACT_SHOT; + } + + return TC_ACT_OK; +} + +SEC("tc") int gre_get_tunnel(struct __sk_buff *skb) { int ret; diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh index 2eaedc1d9ed3..06857b689c11 100755 --- a/tools/testing/selftests/bpf/test_tunnel.sh +++ b/tools/testing/selftests/bpf/test_tunnel.sh @@ -66,15 +66,20 @@ config_device() add_gre_tunnel() { + tun_key= + if [ -n "$1" ]; then + tun_key="key $1" + fi + # at_ns0 namespace ip netns exec at_ns0 \ - ip link add dev $DEV_NS type $TYPE seq key 2 \ + ip link add dev $DEV_NS type $TYPE seq $tun_key \ local 172.16.1.100 remote 172.16.1.200 ip netns exec at_ns0 ip link set dev $DEV_NS up ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24 # root namespace - ip link add dev $DEV type $TYPE key 2 external + ip link add dev $DEV type $TYPE $tun_key external ip link set dev $DEV up ip addr add dev $DEV 10.1.1.200/24 } @@ -238,7 +243,7 @@ test_gre() check $TYPE config_device - add_gre_tunnel + add_gre_tunnel 2 attach_bpf $DEV gre_set_tunnel gre_get_tunnel ping $PING_ARG 10.1.1.100 check_err $? @@ -253,6 +258,30 @@ test_gre() echo -e ${GREEN}"PASS: $TYPE"${NC} } +test_gre_no_tunnel_key() +{ + TYPE=gre + DEV_NS=gre00 + DEV=gre11 + ret=0 + + check $TYPE + config_device + add_gre_tunnel + attach_bpf $DEV gre_set_tunnel_no_key gre_get_tunnel + ping $PING_ARG 10.1.1.100 + check_err $? + ip netns exec at_ns0 ping $PING_ARG 10.1.1.200 + check_err $? + cleanup + + if [ $ret -ne 0 ]; then + echo -e ${RED}"FAIL: $TYPE"${NC} + return 1 + fi + echo -e ${GREEN}"PASS: $TYPE"${NC} +} + test_ip6gre() { TYPE=ip6gre @@ -589,6 +618,7 @@ cleanup() ip link del ipip6tnl11 2> /dev/null ip link del ip6ip6tnl11 2> /dev/null ip link del gretap11 2> /dev/null + ip link del gre11 2> /dev/null ip link del ip6gre11 2> /dev/null ip link del ip6gretap11 2> /dev/null ip link del geneve11 2> /dev/null @@ -641,6 +671,10 @@ bpf_tunnel_test() test_gre errors=$(( $errors + $? )) + echo "Testing GRE tunnel (without tunnel keys)..." + test_gre_no_tunnel_key + errors=$(( $errors + $? )) + echo "Testing IP6GRE tunnel..." test_ip6gre errors=$(( $errors + $? )) |