summaryrefslogtreecommitdiff
path: root/Documentation/bpf
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/bpf')
-rw-r--r--Documentation/bpf/bpf_design_QA.rst25
-rw-r--r--Documentation/bpf/cpumasks.rst393
-rw-r--r--Documentation/bpf/graph_ds_impl.rst267
-rw-r--r--Documentation/bpf/index.rst1
-rw-r--r--Documentation/bpf/instruction-set.rst136
-rw-r--r--Documentation/bpf/kfuncs.rst219
-rw-r--r--Documentation/bpf/libbpf/libbpf_naming_convention.rst6
-rw-r--r--Documentation/bpf/map_sockmap.rst498
-rw-r--r--Documentation/bpf/map_xskmap.rst2
-rw-r--r--Documentation/bpf/other.rst3
-rw-r--r--Documentation/bpf/ringbuf.rst4
-rw-r--r--Documentation/bpf/verifier.rst297
12 files changed, 1790 insertions, 61 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst
index cec2371173d7..bfff0e7e37c2 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -208,6 +208,10 @@ data structures and compile with kernel internal headers. Both of these
kernel internals are subject to change and can break with newer kernels
such that the program needs to be adapted accordingly.
+New BPF functionality is generally added through the use of kfuncs instead of
+new helpers. Kfuncs are not considered part of the stable API, and have their own
+lifecycle expectations as described in :ref:`BPF_kfunc_lifecycle_expectations`.
+
Q: Are tracepoints part of the stable ABI?
------------------------------------------
A: NO. Tracepoints are tied to internal implementation details hence they are
@@ -236,8 +240,8 @@ A: NO. Classic BPF programs are converted into extend BPF instructions.
Q: Can BPF call arbitrary kernel functions?
-------------------------------------------
-A: NO. BPF programs can only call a set of helper functions which
-is defined for every program type.
+A: NO. BPF programs can only call specific functions exposed as BPF helpers or
+kfuncs. The set of available functions is defined for every program type.
Q: Can BPF overwrite arbitrary kernel memory?
---------------------------------------------
@@ -263,7 +267,12 @@ Q: New functionality via kernel modules?
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?
-A: NO.
+A: Yes, through kfuncs and kptrs
+
+The core BPF functionality such as program types, maps and helpers cannot be
+added to by modules. However, modules can expose functionality to BPF programs
+by exporting kfuncs (which may return pointers to module-internal data
+structures as kptrs).
Q: Directly calling kernel function is an ABI?
----------------------------------------------
@@ -278,7 +287,8 @@ kernel functions have already been used by other kernel tcp
cc (congestion-control) implementations. If any of these kernel
functions has changed, both the in-tree and out-of-tree kernel tcp cc
implementations have to be changed. The same goes for the bpf
-programs and they have to be adjusted accordingly.
+programs and they have to be adjusted accordingly. See
+:ref:`BPF_kfunc_lifecycle_expectations` for details.
Q: Attaching to arbitrary kernel functions is an ABI?
-----------------------------------------------------
@@ -340,6 +350,7 @@ compatibility for these features?
A: NO.
-Unlike map value types, there are no stability guarantees for this case. The
-whole API to work with allocated objects and any support for special fields
-inside them is unstable (since it is exposed through kfuncs).
+Unlike map value types, the API to work with allocated objects and any support
+for special fields inside them is exposed through kfuncs, and thus has the same
+lifecycle expectations as the kfuncs themselves. See
+:ref:`BPF_kfunc_lifecycle_expectations` for details.
diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst
new file mode 100644
index 000000000000..24bef9cbbeee
--- /dev/null
+++ b/Documentation/bpf/cpumasks.rst
@@ -0,0 +1,393 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _cpumasks-header-label:
+
+==================
+BPF cpumask kfuncs
+==================
+
+1. Introduction
+===============
+
+``struct cpumask`` is a bitmap data structure in the kernel whose indices
+reflect the CPUs on the system. Commonly, cpumasks are used to track which CPUs
+a task is affinitized to, but they can also be used to e.g. track which cores
+are associated with a scheduling domain, which cores on a machine are idle,
+etc.
+
+BPF provides programs with a set of :ref:`kfuncs-header-label` that can be
+used to allocate, mutate, query, and free cpumasks.
+
+2. BPF cpumask objects
+======================
+
+There are two different types of cpumasks that can be used by BPF programs.
+
+2.1 ``struct bpf_cpumask *``
+----------------------------
+
+``struct bpf_cpumask *`` is a cpumask that is allocated by BPF, on behalf of a
+BPF program, and whose lifecycle is entirely controlled by BPF. These cpumasks
+are RCU-protected, can be mutated, can be used as kptrs, and can be safely cast
+to a ``struct cpumask *``.
+
+2.1.1 ``struct bpf_cpumask *`` lifecycle
+----------------------------------------
+
+A ``struct bpf_cpumask *`` is allocated, acquired, and released, using the
+following functions:
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_create
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_acquire
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_release
+
+For example:
+
+.. code-block:: c
+
+ struct cpumask_map_value {
+ struct bpf_cpumask __kptr_ref * cpumask;
+ };
+
+ struct array_map {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct cpumask_map_value);
+ __uint(max_entries, 65536);
+ } cpumask_map SEC(".maps");
+
+ static int cpumask_map_insert(struct bpf_cpumask *mask, u32 pid)
+ {
+ struct cpumask_map_value local, *v;
+ long status;
+ struct bpf_cpumask *old;
+ u32 key = pid;
+
+ local.cpumask = NULL;
+ status = bpf_map_update_elem(&cpumask_map, &key, &local, 0);
+ if (status) {
+ bpf_cpumask_release(mask);
+ return status;
+ }
+
+ v = bpf_map_lookup_elem(&cpumask_map, &key);
+ if (!v) {
+ bpf_cpumask_release(mask);
+ return -ENOENT;
+ }
+
+ old = bpf_kptr_xchg(&v->cpumask, mask);
+ if (old)
+ bpf_cpumask_release(old);
+
+ return 0;
+ }
+
+ /**
+ * A sample tracepoint showing how a task's cpumask can be queried and
+ * recorded as a kptr.
+ */
+ SEC("tp_btf/task_newtask")
+ int BPF_PROG(record_task_cpumask, struct task_struct *task, u64 clone_flags)
+ {
+ struct bpf_cpumask *cpumask;
+ int ret;
+
+ cpumask = bpf_cpumask_create();
+ if (!cpumask)
+ return -ENOMEM;
+
+ if (!bpf_cpumask_full(task->cpus_ptr))
+ bpf_printk("task %s has CPU affinity", task->comm);
+
+ bpf_cpumask_copy(cpumask, task->cpus_ptr);
+ return cpumask_map_insert(cpumask, task->pid);
+ }
+
+----
+
+2.1.1 ``struct bpf_cpumask *`` as kptrs
+---------------------------------------
+
+As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can
+also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in
+a map, the reference can be removed from the map with bpf_kptr_xchg(), or
+opportunistically acquired with bpf_cpumask_kptr_get():
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_kptr_get
+
+Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
+
+.. code-block:: c
+
+ /* struct containing the struct bpf_cpumask kptr which is stored in the map. */
+ struct cpumasks_kfunc_map_value {
+ struct bpf_cpumask __kptr_ref * bpf_cpumask;
+ };
+
+ /* The map containing struct cpumasks_kfunc_map_value entries. */
+ struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct cpumasks_kfunc_map_value);
+ __uint(max_entries, 1);
+ } cpumasks_kfunc_map SEC(".maps");
+
+ /* ... */
+
+ /**
+ * A simple example tracepoint program showing how a
+ * struct bpf_cpumask * kptr that is stored in a map can
+ * be acquired using the bpf_cpumask_kptr_get() kfunc.
+ */
+ SEC("tp_btf/cgroup_mkdir")
+ int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
+ {
+ struct bpf_cpumask *kptr;
+ struct cpumasks_kfunc_map_value *v;
+ u32 key = 0;
+
+ /* Assume a bpf_cpumask * kptr was previously stored in the map. */
+ v = bpf_map_lookup_elem(&cpumasks_kfunc_map, &key);
+ if (!v)
+ return -ENOENT;
+
+ /* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */
+ kptr = bpf_cpumask_kptr_get(&v->cpumask);
+ if (!kptr)
+ /* If no bpf_cpumask was present in the map, it's because
+ * we're racing with another CPU that removed it with
+ * bpf_kptr_xchg() between the bpf_map_lookup_elem()
+ * above, and our call to bpf_cpumask_kptr_get().
+ * bpf_cpumask_kptr_get() internally safely handles this
+ * race, and will return NULL if the cpumask is no longer
+ * present in the map by the time we invoke the kfunc.
+ */
+ return -EBUSY;
+
+ /* Free the reference we just took above. Note that the
+ * original struct bpf_cpumask * kptr is still in the map. It will
+ * be freed either at a later time if another context deletes
+ * it from the map, or automatically by the BPF subsystem if
+ * it's still present when the map is destroyed.
+ */
+ bpf_cpumask_release(kptr);
+
+ return 0;
+ }
+
+----
+
+2.2 ``struct cpumask``
+----------------------
+
+``struct cpumask`` is the object that actually contains the cpumask bitmap
+being queried, mutated, etc. A ``struct bpf_cpumask`` wraps a ``struct
+cpumask``, which is why it's safe to cast it as such (note however that it is
+**not** safe to cast a ``struct cpumask *`` to a ``struct bpf_cpumask *``, and
+the verifier will reject any program that tries to do so).
+
+As we'll see below, any kfunc that mutates its cpumask argument will take a
+``struct bpf_cpumask *`` as that argument. Any argument that simply queries the
+cpumask will instead take a ``struct cpumask *``.
+
+3. cpumask kfuncs
+=================
+
+Above, we described the kfuncs that can be used to allocate, acquire, release,
+etc a ``struct bpf_cpumask *``. This section of the document will describe the
+kfuncs for mutating and querying cpumasks.
+
+3.1 Mutating cpumasks
+---------------------
+
+Some cpumask kfuncs are "read-only" in that they don't mutate any of their
+arguments, whereas others mutate at least one argument (which means that the
+argument must be a ``struct bpf_cpumask *``, as described above).
+
+This section will describe all of the cpumask kfuncs which mutate at least one
+argument. :ref:`cpumasks-querying-label` below describes the read-only kfuncs.
+
+3.1.1 Setting and clearing CPUs
+-------------------------------
+
+bpf_cpumask_set_cpu() and bpf_cpumask_clear_cpu() can be used to set and clear
+a CPU in a ``struct bpf_cpumask`` respectively:
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_set_cpu bpf_cpumask_clear_cpu
+
+These kfuncs are pretty straightforward, and can be used, for example, as
+follows:
+
+.. code-block:: c
+
+ /**
+ * A sample tracepoint showing how a cpumask can be queried.
+ */
+ SEC("tp_btf/task_newtask")
+ int BPF_PROG(test_set_clear_cpu, struct task_struct *task, u64 clone_flags)
+ {
+ struct bpf_cpumask *cpumask;
+
+ cpumask = bpf_cpumask_create();
+ if (!cpumask)
+ return -ENOMEM;
+
+ bpf_cpumask_set_cpu(0, cpumask);
+ if (!bpf_cpumask_test_cpu(0, cast(cpumask)))
+ /* Should never happen. */
+ goto release_exit;
+
+ bpf_cpumask_clear_cpu(0, cpumask);
+ if (bpf_cpumask_test_cpu(0, cast(cpumask)))
+ /* Should never happen. */
+ goto release_exit;
+
+ /* struct cpumask * pointers such as task->cpus_ptr can also be queried. */
+ if (bpf_cpumask_test_cpu(0, task->cpus_ptr))
+ bpf_printk("task %s can use CPU %d", task->comm, 0);
+
+ release_exit:
+ bpf_cpumask_release(cpumask);
+ return 0;
+ }
+
+----
+
+bpf_cpumask_test_and_set_cpu() and bpf_cpumask_test_and_clear_cpu() are
+complementary kfuncs that allow callers to atomically test and set (or clear)
+CPUs:
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_test_and_set_cpu bpf_cpumask_test_and_clear_cpu
+
+----
+
+We can also set and clear entire ``struct bpf_cpumask *`` objects in one
+operation using bpf_cpumask_setall() and bpf_cpumask_clear():
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_setall bpf_cpumask_clear
+
+3.1.2 Operations between cpumasks
+---------------------------------
+
+In addition to setting and clearing individual CPUs in a single cpumask,
+callers can also perform bitwise operations between multiple cpumasks using
+bpf_cpumask_and(), bpf_cpumask_or(), and bpf_cpumask_xor():
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_and bpf_cpumask_or bpf_cpumask_xor
+
+The following is an example of how they may be used. Note that some of the
+kfuncs shown in this example will be covered in more detail below.
+
+.. code-block:: c
+
+ /**
+ * A sample tracepoint showing how a cpumask can be mutated using
+ bitwise operators (and queried).
+ */
+ SEC("tp_btf/task_newtask")
+ int BPF_PROG(test_and_or_xor, struct task_struct *task, u64 clone_flags)
+ {
+ struct bpf_cpumask *mask1, *mask2, *dst1, *dst2;
+
+ mask1 = bpf_cpumask_create();
+ if (!mask1)
+ return -ENOMEM;
+
+ mask2 = bpf_cpumask_create();
+ if (!mask2) {
+ bpf_cpumask_release(mask1);
+ return -ENOMEM;
+ }
+
+ // ...Safely create the other two masks... */
+
+ bpf_cpumask_set_cpu(0, mask1);
+ bpf_cpumask_set_cpu(1, mask2);
+ bpf_cpumask_and(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
+ if (!bpf_cpumask_empty((const struct cpumask *)dst1))
+ /* Should never happen. */
+ goto release_exit;
+
+ bpf_cpumask_or(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
+ if (!bpf_cpumask_test_cpu(0, (const struct cpumask *)dst1))
+ /* Should never happen. */
+ goto release_exit;
+
+ if (!bpf_cpumask_test_cpu(1, (const struct cpumask *)dst1))
+ /* Should never happen. */
+ goto release_exit;
+
+ bpf_cpumask_xor(dst2, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
+ if (!bpf_cpumask_equal((const struct cpumask *)dst1,
+ (const struct cpumask *)dst2))
+ /* Should never happen. */
+ goto release_exit;
+
+ release_exit:
+ bpf_cpumask_release(mask1);
+ bpf_cpumask_release(mask2);
+ bpf_cpumask_release(dst1);
+ bpf_cpumask_release(dst2);
+ return 0;
+ }
+
+----
+
+The contents of an entire cpumask may be copied to another using
+bpf_cpumask_copy():
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_copy
+
+----
+
+.. _cpumasks-querying-label:
+
+3.2 Querying cpumasks
+---------------------
+
+In addition to the above kfuncs, there is also a set of read-only kfuncs that
+can be used to query the contents of cpumasks.
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset
+ bpf_cpumask_empty bpf_cpumask_full
+
+.. kernel-doc:: kernel/bpf/cpumask.c
+ :identifiers: bpf_cpumask_any bpf_cpumask_any_and
+
+----
+
+Some example usages of these querying kfuncs were shown above. We will not
+replicate those exmaples here. Note, however, that all of the aforementioned
+kfuncs are tested in `tools/testing/selftests/bpf/progs/cpumask_success.c`_, so
+please take a look there if you're looking for more examples of how they can be
+used.
+
+.. _tools/testing/selftests/bpf/progs/cpumask_success.c:
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c
+
+
+4. Adding BPF cpumask kfuncs
+============================
+
+The set of supported BPF cpumask kfuncs are not (yet) a 1-1 match with the
+cpumask operations in include/linux/cpumask.h. Any of those cpumask operations
+could easily be encapsulated in a new kfunc if and when required. If you'd like
+to support a new cpumask operation, please feel free to submit a patch. If you
+do add a new cpumask kfunc, please document it here, and add any relevant
+selftest testcases to the cpumask selftest suite.
diff --git a/Documentation/bpf/graph_ds_impl.rst b/Documentation/bpf/graph_ds_impl.rst
new file mode 100644
index 000000000000..61274622b71d
--- /dev/null
+++ b/Documentation/bpf/graph_ds_impl.rst
@@ -0,0 +1,267 @@
+=========================
+BPF Graph Data Structures
+=========================
+
+This document describes implementation details of new-style "graph" data
+structures (linked_list, rbtree), with particular focus on the verifier's
+implementation of semantics specific to those data structures.
+
+Although no specific verifier code is referred to in this document, the document
+assumes that the reader has general knowledge of BPF verifier internals, BPF
+maps, and BPF program writing.
+
+Note that the intent of this document is to describe the current state of
+these graph data structures. **No guarantees** of stability for either
+semantics or APIs are made or implied here.
+
+.. contents::
+ :local:
+ :depth: 2
+
+Introduction
+------------
+
+The BPF map API has historically been the main way to expose data structures
+of various types for use within BPF programs. Some data structures fit naturally
+with the map API (HASH, ARRAY), others less so. Consequentially, programs
+interacting with the latter group of data structures can be hard to parse
+for kernel programmers without previous BPF experience.
+
+Luckily, some restrictions which necessitated the use of BPF map semantics are
+no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
+BPF allocator, it is now possible to implement BPF data structures whose API
+and semantics more closely match those exposed to the rest of the kernel.
+
+Two such data structures - linked_list and rbtree - have many verification
+details in common. Because both have "root"s ("head" for linked_list) and
+"node"s, the verifier code and this document refer to common functionality
+as "graph_api", "graph_root", "graph_node", etc.
+
+Unless otherwise stated, examples and semantics below apply to both graph data
+structures.
+
+Unstable API
+------------
+
+Data structures implemented using the BPF map API have historically used BPF
+helper functions - either standard map API helpers like ``bpf_map_update_elem``
+or map-specific helpers. The new-style graph data structures instead use kfuncs
+to define their manipulation helpers. Because there are no stability guarantees
+for kfuncs, the API and semantics for these data structures can be evolved in
+a way that breaks backwards compatibility if necessary.
+
+Root and node types for the new data structures are opaquely defined in the
+``uapi/linux/bpf.h`` header.
+
+Locking
+-------
+
+The new-style data structures are intrusive and are defined similarly to their
+vanilla kernel counterparts:
+
+.. code-block:: c
+
+ struct node_data {
+ long key;
+ long data;
+ struct bpf_rb_node node;
+ };
+
+ struct bpf_spin_lock glock;
+ struct bpf_rb_root groot __contains(node_data, node);
+
+The "root" type for both linked_list and rbtree expects to be in a map_value
+which also contains a ``bpf_spin_lock`` - in the above example both global
+variables are placed in a single-value arraymap. The verifier considers this
+spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
+the same map_value and will enforce that the correct lock is held when
+verifying BPF programs that manipulate the tree. Since this lock checking
+happens at verification time, there is no runtime penalty.
+
+Non-owning references
+---------------------
+
+**Motivation**
+
+Consider the following BPF code:
+
+.. code-block:: c
+
+ struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
+
+ bpf_spin_lock(&lock);
+
+ bpf_rbtree_add(&tree, n); /* PASSED */
+
+ bpf_spin_unlock(&lock);
+
+From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
+has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
+``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
+program has ownership of the pointee's (object pointed to by ``n``) lifetime.
+The BPF program must pass off ownership before exiting - either via
+``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
+``bpf_rbtree_add``.
+
+(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
+"ownership is acquired" and "ownership is passed", respectively)
+
+What should the verifier do with ``n`` after ownership is passed off? If the
+object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
+should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
+the object is no longer valid. The underlying memory may have been reused for
+some other allocation, unmapped, etc.
+
+When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
+obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
+but that would result in programs with useful, common coding patterns being
+rejected, e.g.:
+
+.. code-block:: c
+
+ int x;
+ struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
+
+ bpf_spin_lock(&lock);
+
+ bpf_rbtree_add(&tree, n); /* PASSED */
+ x = n->data;
+ n->data = 42;
+
+ bpf_spin_unlock(&lock);
+
+Both the read from and write to ``n->data`` would be rejected. The verifier
+can do better, though, by taking advantage of two details:
+
+ * Graph data structure APIs can only be used when the ``bpf_spin_lock``
+ associated with the graph root is held
+
+ * Both graph data structures have pointer stability
+
+ * Because graph nodes are allocated with ``bpf_obj_new`` and
+ adding / removing from the root involves fiddling with the
+ ``bpf_{list,rb}_node`` field of the node struct, a graph node will
+ remain at the same address after either operation.
+
+Because the associated ``bpf_spin_lock`` must be held by any program adding
+or removing, if we're in the critical section bounded by that lock, we know
+that no other program can add or remove until the end of the critical section.
+This combined with pointer stability means that, until the critical section
+ends, we can safely access the graph node through ``n`` even after it was used
+to pass ownership.
+
+The verifier considers such a reference a *non-owning reference*. The ref
+returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
+Both terms currently only have meaning in the context of graph nodes and API.
+
+**Details**
+
+Let's enumerate the properties of both types of references.
+
+*owning reference*
+
+ * This reference controls the lifetime of the pointee
+
+ * Ownership of pointee must be 'released' by passing it to some graph API
+ kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
+
+ * If not released before program ends, verifier considers program invalid
+
+ * Access to the pointee's memory will not page fault
+
+*non-owning reference*
+
+ * This reference does not own the pointee
+
+ * It cannot be used to add the graph node to a graph root, nor ``free``'d via
+ ``bpf_obj_drop``
+
+ * No explicit control of lifetime, but can infer valid lifetime based on
+ non-owning ref existence (see explanation below)
+
+ * Access to the pointee's memory will not page fault
+
+From verifier's perspective non-owning references can only exist
+between spin_lock and spin_unlock. Why? After spin_unlock another program
+can do arbitrary operations on the data structure like removing and ``free``-ing
+via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
+``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
+Or the memory could go away.
+
+To prevent this logic violation all non-owning references are invalidated by the
+verifier after a critical section ends. This is necessary to ensure the "will
+not page fault" property of non-owning references. So if the verifier hasn't
+invalidated a non-owning ref, accessing it will not page fault.
+
+Currently ``bpf_obj_drop`` is not allowed in the critical section, so
+if there's a valid non-owning ref, we must be in a critical section, and can
+conclude that the ref's memory hasn't been dropped-and- ``free``'d or
+dropped-and-reused.
+
+Any reference to a node that is in an rbtree _must_ be non-owning, since
+the tree has control of the pointee's lifetime. Similarly, any ref to a node
+that isn't in rbtree _must_ be owning. This results in a nice property:
+graph API add / remove implementations don't need to check if a node
+has already been added (or already removed), as the ownership model
+allows the verifier to prevent such a state from being valid by simply checking
+types.
+
+However, pointer aliasing poses an issue for the above "nice property".
+Consider the following example:
+
+.. code-block:: c
+
+ struct node_data *n, *m, *o, *p;
+ n = bpf_obj_new(typeof(*n)); /* 1 */
+
+ bpf_spin_lock(&lock);
+
+ bpf_rbtree_add(&tree, n); /* 2 */
+ m = bpf_rbtree_first(&tree); /* 3 */
+
+ o = bpf_rbtree_remove(&tree, n); /* 4 */
+ p = bpf_rbtree_remove(&tree, m); /* 5 */
+
+ bpf_spin_unlock(&lock);
+
+ bpf_obj_drop(o);
+ bpf_obj_drop(p); /* 6 */
+
+Assume the tree is empty before this program runs. If we track verifier state
+changes here using numbers in above comments:
+
+ 1) n is an owning reference
+
+ 2) n is a non-owning reference, it's been added to the tree
+
+ 3) n and m are non-owning references, they both point to the same node
+
+ 4) o is an owning reference, n and m non-owning, all point to same node
+
+ 5) o and p are owning, n and m non-owning, all point to the same node
+
+ 6) a double-free has occurred, since o and p point to same node and o was
+ ``free``'d in previous statement
+
+States 4 and 5 violate our "nice property", as there are non-owning refs to
+a node which is not in an rbtree. Statement 5 will try to remove a node which
+has already been removed as a result of this violation. State 6 is a dangerous
+double-free.
+
+At a minimum we should prevent state 6 from being possible. If we can't also
+prevent state 5 then we must abandon our "nice property" and check whether a
+node has already been removed at runtime.
+
+We prevent both by generalizing the "invalidate non-owning references" behavior
+of ``bpf_spin_unlock`` and doing similar invalidation after
+``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
+
+ * takes an arbitrary node argument
+
+ * removes it from the data structure
+
+ * returns an owning reference to the removed node
+
+May result in a state where some other non-owning reference points to the same
+node. So ``remove``-type kfuncs must be considered a non-owning reference
+invalidation point as well.
diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index b81533d8b061..dbb39e8f9889 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -20,6 +20,7 @@ that goes into great technical depth about the BPF Architecture.
syscall_api
helpers
kfuncs
+ cpumasks
programs
maps
bpf_prog_run
diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst
index e672d5ec6cc7..af515de5fc38 100644
--- a/Documentation/bpf/instruction-set.rst
+++ b/Documentation/bpf/instruction-set.rst
@@ -7,6 +7,11 @@ eBPF Instruction Set Specification, v1.0
This document specifies version 1.0 of the eBPF instruction set.
+Documentation conventions
+=========================
+
+For brevity, this document uses the type notion "u64", "u32", etc.
+to mean an unsigned integer whose width is the specified number of bits.
Registers and calling convention
================================
@@ -30,20 +35,56 @@ Instruction encoding
eBPF has two instruction encodings:
* the basic instruction encoding, which uses 64 bits to encode an instruction
-* the wide instruction encoding, which appends a second 64-bit immediate value
- (imm64) after the basic instruction for a total of 128 bits.
+* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
+ constant) value after the basic instruction for a total of 128 bits.
+
+The basic instruction encoding is as follows, where MSB and LSB mean the most significant
+bits and least significant bits, respectively:
+
+============= ======= ======= ======= ============
+32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
+============= ======= ======= ======= ============
+imm offset src_reg dst_reg opcode
+============= ======= ======= ======= ============
+
+**imm**
+ signed integer immediate value
-The basic instruction encoding looks as follows:
+**offset**
+ signed integer offset used with pointer arithmetic
-============= ======= =============== ==================== ============
-32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
-============= ======= =============== ==================== ============
-immediate offset source register destination register opcode
-============= ======= =============== ==================== ============
+**src_reg**
+ the source register number (0-10), except where otherwise specified
+ (`64-bit immediate instructions`_ reuse this field for other purposes)
+
+**dst_reg**
+ destination register number (0-10)
+
+**opcode**
+ operation to perform
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
+As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
+instruction uses a 64-bit immediate value that is constructed as follows.
+The 64 bits following the basic instruction contain a pseudo instruction
+using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
+and imm containing the high 32 bits of the immediate value.
+
+================= ==================
+64 bits (MSB) 64 bits (LSB)
+================= ==================
+basic instruction pseudo instruction
+================= ==================
+
+Thus the 64-bit immediate value is constructed as follows:
+
+ imm64 = (next_imm << 32) | imm
+
+where 'next_imm' refers to the imm value of the pseudo instruction
+following the basic instruction.
+
Instruction classes
-------------------
@@ -71,27 +112,32 @@ For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` an
============== ====== =================
4 bits (MSB) 1 bit 3 bits (LSB)
============== ====== =================
-operation code source instruction class
+code source instruction class
============== ====== =================
-The 4th bit encodes the source operand:
+**code**
+ the operation code, whose meaning varies by instruction class
- ====== ===== ========================================
- source value description
- ====== ===== ========================================
- BPF_K 0x00 use 32-bit immediate as source operand
- BPF_X 0x08 use 'src_reg' register as source operand
- ====== ===== ========================================
+**source**
+ the source operand location, which unless otherwise specified is one of:
-The four MSB bits store the operation code.
+ ====== ===== ==============================================
+ source value description
+ ====== ===== ==============================================
+ BPF_K 0x00 use 32-bit 'imm' value as source operand
+ BPF_X 0x08 use 'src_reg' register value as source operand
+ ====== ===== ==============================================
+**instruction class**
+ the instruction class (see `Instruction classes`_)
Arithmetic instructions
-----------------------
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
-The 'code' field encodes the operation as below:
+The 'code' field encodes the operation as below, where 'src' and 'dst' refer
+to the values of the source and destination registers, respectively.
======== ===== ==========================================================
code value description
@@ -99,35 +145,49 @@ code value description
BPF_ADD 0x00 dst += src
BPF_SUB 0x10 dst -= src
BPF_MUL 0x20 dst \*= src
-BPF_DIV 0x30 dst /= src
+BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0
BPF_OR 0x40 dst \|= src
BPF_AND 0x50 dst &= src
BPF_LSH 0x60 dst <<= src
BPF_RSH 0x70 dst >>= src
BPF_NEG 0x80 dst = ~src
-BPF_MOD 0x90 dst %= src
+BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst
BPF_XOR 0xa0 dst ^= src
BPF_MOV 0xb0 dst = src
BPF_ARSH 0xc0 sign extending shift right
BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
======== ===== ==========================================================
+Underflow and overflow are allowed during arithmetic operations, meaning
+the 64-bit or 32-bit value will wrap. If eBPF program execution would
+result in division by zero, the destination register is instead set to zero.
+If execution would result in modulo by zero, for ``BPF_ALU64`` the value of
+the destination register is unchanged whereas for ``BPF_ALU`` the upper
+32 bits of the destination register are zeroed.
+
``BPF_ADD | BPF_X | BPF_ALU`` means::
- dst_reg = (u32) dst_reg + (u32) src_reg;
+ dst = (u32) ((u32) dst + (u32) src)
+
+where '(u32)' indicates that the upper 32 bits are zeroed.
``BPF_ADD | BPF_X | BPF_ALU64`` means::
- dst_reg = dst_reg + src_reg
+ dst = dst + src
``BPF_XOR | BPF_K | BPF_ALU`` means::
- dst_reg = (u32) dst_reg ^ (u32) imm32
+ dst = (u32) dst ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means::
- dst_reg = dst_reg ^ imm32
+ dst = dst ^ imm32
+Also note that the division and modulo operations are unsigned. Thus, for
+``BPF_ALU``, 'imm' is first interpreted as an unsigned 32-bit value, whereas
+for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result
+interpreted as an unsigned 64-bit value. There are no instructions for
+signed division or modulo.
Byte swap instructions
~~~~~~~~~~~~~~~~~~~~~~
@@ -155,11 +215,11 @@ Examples:
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
- dst_reg = htole16(dst_reg)
+ dst = htole16(dst)
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
- dst_reg = htobe64(dst_reg)
+ dst = htobe64(dst)
Jump instructions
-----------------
@@ -234,15 +294,15 @@ instructions that transfer data between a register and memory.
``BPF_MEM | <size> | BPF_STX`` means::
- *(size *) (dst_reg + off) = src_reg
+ *(size *) (dst + offset) = src
``BPF_MEM | <size> | BPF_ST`` means::
- *(size *) (dst_reg + off) = imm32
+ *(size *) (dst + offset) = imm32
``BPF_MEM | <size> | BPF_LDX`` means::
- dst_reg = *(size *) (src_reg + off)
+ dst = *(size *) (src + offset)
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
@@ -276,11 +336,11 @@ BPF_XOR 0xa0 atomic xor
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
- *(u32 *)(dst_reg + off16) += src_reg
+ *(u32 *)(dst + offset) += src
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
- *(u64 *)(dst_reg + off16) += src_reg
+ *(u64 *)(dst + offset) += src
In addition to the simple atomic operations, there also is a modifier and
two complex atomic operations:
@@ -295,16 +355,16 @@ BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
-is set, then the operation also overwrites ``src_reg`` with the value that
+is set, then the operation also overwrites ``src`` with the value that
was in memory before it was modified.
-The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
-addressed by ``dst_reg + off``.
+The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
+addressed by ``dst + offset``.
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
-``dst_reg + off`` with ``R0``. If they match, the value addressed by
-``dst_reg + off`` is replaced with ``src_reg``. In either case, the
-value that was at ``dst_reg + off`` before the operation is zero-extended
+``dst + offset`` with ``R0``. If they match, the value addressed by
+``dst + offset`` is replaced with ``src``. In either case, the
+value that was at ``dst + offset`` before the operation is zero-extended
and loaded back to ``R0``.
64-bit immediate instructions
@@ -317,7 +377,7 @@ There is currently only one such instruction.
``BPF_LD | BPF_DW | BPF_IMM`` means::
- dst_reg = imm64
+ dst = imm64
Legacy BPF Packet access instructions
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 9fd7fb539f85..ca96ef3f6896 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -1,3 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _kfuncs-header-label:
+
=============================
BPF Kernel Functions (kfuncs)
=============================
@@ -9,7 +13,7 @@ BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
kfuncs do not have a stable interface and can change from one kernel release to
another. Hence, BPF programs need to be updated in response to changes in the
-kernel.
+kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information.
2. Defining a kfunc
===================
@@ -37,7 +41,7 @@ An example is given below::
__diag_ignore_all("-Wmissing-prototypes",
"Global kfuncs as their definitions will be in BTF");
- struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
+ __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
{
return find_get_task_by_vpid(nr);
}
@@ -62,7 +66,7 @@ kfunc with a __tag, where tag may be one of the supported annotations.
This annotation is used to indicate a memory and size pair in the argument list.
An example is given below::
- void bpf_memzero(void *mem, int mem__sz)
+ __bpf_kfunc void bpf_memzero(void *mem, int mem__sz)
{
...
}
@@ -82,7 +86,7 @@ safety of the program.
An example is given below::
- void *bpf_obj_new(u32 local_type_id__k, ...)
+ __bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...)
{
...
}
@@ -121,6 +125,20 @@ flags on a set of kfuncs as follows::
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
along with it. Ofcourse, it is also allowed to specify no flags.
+kfunc definitions should also always be annotated with the ``__bpf_kfunc``
+macro. This prevents issues such as the compiler inlining the kfunc if it's a
+static kernel function, or the function being elided in an LTO build as it's
+not used in the rest of the kernel. Developers should not manually add
+annotations to their kfunc to prevent these issues. If an annotation is
+required to prevent such an issue with your kfunc, it is a bug and should be
+added to the definition of the macro so that other kfuncs are similarly
+protected. An example is given below::
+
+ __bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid)
+ {
+ ...
+ }
+
2.4.1 KF_ACQUIRE flag
---------------------
@@ -163,7 +181,8 @@ KF_ACQUIRE and KF_RET_NULL flags.
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
indicates that the all pointer arguments are valid, and that all pointers to
BTF objects have been passed in their unmodified form (that is, at a zero
-offset, and without having been obtained from walking another pointer).
+offset, and without having been obtained from walking another pointer, with one
+exception described below).
There are two types of pointers to kernel objects which are considered "valid":
@@ -176,6 +195,25 @@ KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
The definition of "valid" pointers is subject to change at any time, and has
absolutely no ABI stability guarantees.
+As mentioned above, a nested pointer obtained from walking a trusted pointer is
+no longer trusted, with one exception. If a struct type has a field that is
+guaranteed to be valid as long as its parent pointer is trusted, the
+``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as
+follows:
+
+.. code-block:: c
+
+ BTF_TYPE_SAFE_NESTED(struct task_struct) {
+ const cpumask_t *cpus_ptr;
+ };
+
+In other words, you must:
+
+1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro.
+
+2. Specify the type and name of the trusted nested field. This field must match
+ the field in the original type definition exactly.
+
2.4.6 KF_SLEEPABLE flag
-----------------------
@@ -200,6 +238,28 @@ single argument which must be a trusted argument or a MEM_RCU pointer.
The argument may have reference count of 0 and the kfunc must take this
into consideration.
+.. _KF_deprecated_flag:
+
+2.4.9 KF_DEPRECATED flag
+------------------------
+
+The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
+changed or removed in a subsequent kernel release. A kfunc that is
+marked with KF_DEPRECATED should also have any relevant information
+captured in its kernel doc. Such information typically includes the
+kfunc's expected remaining lifespan, a recommendation for new
+functionality that can replace it if any is available, and possibly a
+rationale for why it is being removed.
+
+Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be
+supported and have its KF_DEPRECATED flag removed, it is likely to be far more
+difficult to remove a KF_DEPRECATED flag after it's been added than it is to
+prevent it from being added in the first place. As described in
+:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are
+encouraged to make their use-cases known as early as possible, and participate
+in upstream discussions regarding whether to keep, change, deprecate, or remove
+those kfuncs if and when such discussions occur.
+
2.5 Registering the kfuncs
--------------------------
@@ -223,14 +283,150 @@ type. An example is shown below::
}
late_initcall(init_subsystem);
-3. Core kfuncs
+2.6 Specifying no-cast aliases with ___init
+--------------------------------------------
+
+The verifier will always enforce that the BTF type of a pointer passed to a
+kfunc by a BPF program, matches the type of pointer specified in the kfunc
+definition. The verifier, does, however, allow types that are equivalent
+according to the C standard to be passed to the same kfunc arg, even if their
+BTF_IDs differ.
+
+For example, for the following type definition:
+
+.. code-block:: c
+
+ struct bpf_cpumask {
+ cpumask_t cpumask;
+ refcount_t usage;
+ };
+
+The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc
+taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For
+instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed
+to bpf_cpumask_test_cpu().
+
+In some cases, this type-aliasing behavior is not desired. ``struct
+nf_conn___init`` is one such example:
+
+.. code-block:: c
+
+ struct nf_conn___init {
+ struct nf_conn ct;
+ };
+
+The C standard would consider these types to be equivalent, but it would not
+always be safe to pass either type to a trusted kfunc. ``struct
+nf_conn___init`` represents an allocated ``struct nf_conn`` object that has
+*not yet been initialized*, so it would therefore be unsafe to pass a ``struct
+nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct
+nf_conn *`` (e.g. ``bpf_ct_change_timeout()``).
+
+In order to accommodate such requirements, the verifier will enforce strict
+PTR_TO_BTF_ID type matching if two types have the exact same name, with one
+being suffixed with ``___init``.
+
+.. _BPF_kfunc_lifecycle_expectations:
+
+3. kfunc lifecycle expectations
+===============================
+
+kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the
+strict stability restrictions associated with kernel <-> user UAPIs. This means
+they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be
+modified or removed by a maintainer of the subsystem they're defined in when
+it's deemed necessary.
+
+Like any other change to the kernel, maintainers will not change or remove a
+kfunc without having a reasonable justification. Whether or not they'll choose
+to change a kfunc will ultimately depend on a variety of factors, such as how
+widely used the kfunc is, how long the kfunc has been in the kernel, whether an
+alternative kfunc exists, what the norm is in terms of stability for the
+subsystem in question, and of course what the technical cost is of continuing
+to support the kfunc.
+
+There are several implications of this:
+
+a) kfuncs that are widely used or have been in the kernel for a long time will
+ be more difficult to justify being changed or removed by a maintainer. In
+ other words, kfuncs that are known to have a lot of users and provide
+ significant value provide stronger incentives for maintainers to invest the
+ time and complexity in supporting them. It is therefore important for
+ developers that are using kfuncs in their BPF programs to communicate and
+ explain how and why those kfuncs are being used, and to participate in
+ discussions regarding those kfuncs when they occur upstream.
+
+b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs
+ that call kfuncs are generally not part of the kernel tree. This means that
+ refactoring cannot typically change callers in-place when a kfunc changes,
+ as is done for e.g. an upstreamed driver being updated in place when a
+ kernel symbol is changed.
+
+ Unlike with regular kernel symbols, this is expected behavior for BPF
+ symbols, and out-of-tree BPF programs that use kfuncs should be considered
+ relevant to discussions and decisions around modifying and removing those
+ kfuncs. The BPF community will take an active role in participating in
+ upstream discussions when necessary to ensure that the perspectives of such
+ users are taken into account.
+
+c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and
+ will not ever hard-block a change in the kernel purely for stability
+ reasons. That being said, kfuncs are features that are meant to solve
+ problems and provide value to users. The decision of whether to change or
+ remove a kfunc is a multivariate technical decision that is made on a
+ case-by-case basis, and which is informed by data points such as those
+ mentioned above. It is expected that a kfunc being removed or changed with
+ no warning will not be a common occurrence or take place without sound
+ justification, but it is a possibility that must be accepted if one is to
+ use kfuncs.
+
+3.1 kfunc deprecation
+---------------------
+
+As described above, while sometimes a maintainer may find that a kfunc must be
+changed or removed immediately to accommodate some changes in their subsystem,
+usually kfuncs will be able to accommodate a longer and more measured
+deprecation process. For example, if a new kfunc comes along which provides
+superior functionality to an existing kfunc, the existing kfunc may be
+deprecated for some period of time to allow users to migrate their BPF programs
+to use the new one. Or, if a kfunc has no known users, a decision may be made
+to remove the kfunc (without providing an alternative API) after some
+deprecation period so as to provide users with a window to notify the kfunc
+maintainer if it turns out that the kfunc is actually being used.
+
+It's expected that the common case will be that kfuncs will go through a
+deprecation period rather than being changed or removed without warning. As
+described in :ref:`KF_deprecated_flag`, the kfunc framework provides the
+KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been
+deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following
+procedure is followed for removal:
+
+1. Any relevant information for deprecated kfuncs is documented in the kfunc's
+ kernel docs. This documentation will typically include the kfunc's expected
+ remaining lifespan, a recommendation for new functionality that can replace
+ the usage of the deprecated function (or an explanation as to why no such
+ replacement exists), etc.
+
+2. The deprecated kfunc is kept in the kernel for some period of time after it
+ was first marked as deprecated. This time period will be chosen on a
+ case-by-case basis, and will typically depend on how widespread the use of
+ the kfunc is, how long it has been in the kernel, and how hard it is to move
+ to alternatives. This deprecation time period is "best effort", and as
+ described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may
+ sometimes dictate that the kfunc be removed before the full intended
+ deprecation period has elapsed.
+
+3. After the deprecation period the kfunc will be removed. At this point, BPF
+ programs calling the kfunc will be rejected by the verifier.
+
+4. Core kfuncs
==============
The BPF subsystem provides a number of "core" kfuncs that are potentially
applicable to a wide variety of different possible use cases and programs.
Those kfuncs are documented here.
-3.1 struct task_struct * kfuncs
+4.1 struct task_struct * kfuncs
-------------------------------
There are a number of kfuncs that allow ``struct task_struct *`` objects to be
@@ -306,7 +502,7 @@ Here is an example of it being used:
return 0;
}
-3.2 struct cgroup * kfuncs
+4.2 struct cgroup * kfuncs
--------------------------
``struct cgroup *`` objects also have acquire and release functions:
@@ -420,3 +616,10 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
bpf_cgroup_release(parent);
return 0;
}
+
+4.3 struct cpumask * kfuncs
+---------------------------
+
+BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
+destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
+for more details.
diff --git a/Documentation/bpf/libbpf/libbpf_naming_convention.rst b/Documentation/bpf/libbpf/libbpf_naming_convention.rst
index c5ac97f3d4c4..b5b41b61b3c0 100644
--- a/Documentation/bpf/libbpf/libbpf_naming_convention.rst
+++ b/Documentation/bpf/libbpf/libbpf_naming_convention.rst
@@ -83,8 +83,8 @@ This prevents from accidentally exporting a symbol, that is not supposed
to be a part of ABI what, in turn, improves both libbpf developer- and
user-experiences.
-ABI versionning
----------------
+ABI versioning
+--------------
To make future ABI extensions possible libbpf ABI is versioned.
Versioning is implemented by ``libbpf.map`` version script that is
@@ -148,7 +148,7 @@ API documentation convention
The libbpf API is documented via comments above definitions in
header files. These comments can be rendered by doxygen and sphinx
for well organized html output. This section describes the
-convention in which these comments should be formated.
+convention in which these comments should be formatted.
Here is an example from btf.h:
diff --git a/Documentation/bpf/map_sockmap.rst b/Documentation/bpf/map_sockmap.rst
new file mode 100644
index 000000000000..cc92047c6630
--- /dev/null
+++ b/Documentation/bpf/map_sockmap.rst
@@ -0,0 +1,498 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright Red Hat
+
+==============================================
+BPF_MAP_TYPE_SOCKMAP and BPF_MAP_TYPE_SOCKHASH
+==============================================
+
+.. note::
+ - ``BPF_MAP_TYPE_SOCKMAP`` was introduced in kernel version 4.14
+ - ``BPF_MAP_TYPE_SOCKHASH`` was introduced in kernel version 4.18
+
+``BPF_MAP_TYPE_SOCKMAP`` and ``BPF_MAP_TYPE_SOCKHASH`` maps can be used to
+redirect skbs between sockets or to apply policy at the socket level based on
+the result of a BPF (verdict) program with the help of the BPF helpers
+``bpf_sk_redirect_map()``, ``bpf_sk_redirect_hash()``,
+``bpf_msg_redirect_map()`` and ``bpf_msg_redirect_hash()``.
+
+``BPF_MAP_TYPE_SOCKMAP`` is backed by an array that uses an integer key as the
+index to look up a reference to a ``struct sock``. The map values are socket
+descriptors. Similarly, ``BPF_MAP_TYPE_SOCKHASH`` is a hash backed BPF map that
+holds references to sockets via their socket descriptors.
+
+.. note::
+ The value type is either __u32 or __u64; the latter (__u64) is to support
+ returning socket cookies to userspace. Returning the ``struct sock *`` that
+ the map holds to user-space is neither safe nor useful.
+
+These maps may have BPF programs attached to them, specifically a parser program
+and a verdict program. The parser program determines how much data has been
+parsed and therefore how much data needs to be queued to come to a verdict. The
+verdict program is essentially the redirect program and can return a verdict
+of ``__SK_DROP``, ``__SK_PASS``, or ``__SK_REDIRECT``.
+
+When a socket is inserted into one of these maps, its socket callbacks are
+replaced and a ``struct sk_psock`` is attached to it. Additionally, this
+``sk_psock`` inherits the programs that are attached to the map.
+
+A sock object may be in multiple maps, but can only inherit a single
+parse or verdict program. If adding a sock object to a map would result
+in having multiple parser programs the update will return an EBUSY error.
+
+The supported programs to attach to these maps are:
+
+.. code-block:: c
+
+ struct sk_psock_progs {
+ struct bpf_prog *msg_parser;
+ struct bpf_prog *stream_parser;
+ struct bpf_prog *stream_verdict;
+ struct bpf_prog *skb_verdict;
+ };
+
+.. note::
+ Users are not allowed to attach ``stream_verdict`` and ``skb_verdict``
+ programs to the same map.
+
+The attach types for the map programs are:
+
+- ``msg_parser`` program - ``BPF_SK_MSG_VERDICT``.
+- ``stream_parser`` program - ``BPF_SK_SKB_STREAM_PARSER``.
+- ``stream_verdict`` program - ``BPF_SK_SKB_STREAM_VERDICT``.
+- ``skb_verdict`` program - ``BPF_SK_SKB_VERDICT``.
+
+There are additional helpers available to use with the parser and verdict
+programs: ``bpf_msg_apply_bytes()`` and ``bpf_msg_cork_bytes()``. With
+``bpf_msg_apply_bytes()`` BPF programs can tell the infrastructure how many
+bytes the given verdict should apply to. The helper ``bpf_msg_cork_bytes()``
+handles a different case where a BPF program cannot reach a verdict on a msg
+until it receives more bytes AND the program doesn't want to forward the packet
+until it is known to be good.
+
+Finally, the helpers ``bpf_msg_pull_data()`` and ``bpf_msg_push_data()`` are
+available to ``BPF_PROG_TYPE_SK_MSG`` BPF programs to pull in data and set the
+start and end pointers to given values or to add metadata to the ``struct
+sk_msg_buff *msg``.
+
+All these helpers will be described in more detail below.
+
+Usage
+=====
+Kernel BPF
+----------
+bpf_msg_redirect_map()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
+
+This helper is used in programs implementing policies at the socket level. If
+the message ``msg`` is allowed to pass (i.e., if the verdict BPF program
+returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
+``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces
+can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used
+to select the ingress path otherwise the egress path is selected. This is the
+only flag supported for now.
+
+Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
+
+bpf_sk_redirect_map()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key u64 flags)
+
+Redirect the packet to the socket referenced by ``map`` (of type
+``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces
+can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used
+to select the ingress path otherwise the egress path is selected. This is the
+only flag supported for now.
+
+Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+socket entries of type ``struct sock *`` can be retrieved using the
+``bpf_map_lookup_elem()`` helper.
+
+bpf_sock_map_update()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
+
+Add an entry to, or update a ``map`` referencing sockets. The ``skops`` is used
+as a new value for the entry associated to ``key``. The ``flags`` argument can
+be one of the following:
+
+- ``BPF_ANY``: Create a new element or update an existing element.
+- ``BPF_NOEXIST``: Create a new element only if it did not exist.
+- ``BPF_EXIST``: Update an existing element.
+
+If the ``map`` has BPF programs (parser and verdict), those will be inherited
+by the socket being added. If the socket is already attached to BPF programs,
+this results in an error.
+
+Returns 0 on success, or a negative error in case of failure.
+
+bpf_sock_hash_update()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
+
+Add an entry to, or update a sockhash ``map`` referencing sockets. The ``skops``
+is used as a new value for the entry associated to ``key``.
+
+The ``flags`` argument can be one of the following:
+
+- ``BPF_ANY``: Create a new element or update an existing element.
+- ``BPF_NOEXIST``: Create a new element only if it did not exist.
+- ``BPF_EXIST``: Update an existing element.
+
+If the ``map`` has BPF programs (parser and verdict), those will be inherited
+by the socket being added. If the socket is already attached to BPF programs,
+this results in an error.
+
+Returns 0 on success, or a negative error in case of failure.
+
+bpf_msg_redirect_hash()
+^^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
+
+This helper is used in programs implementing policies at the socket level. If
+the message ``msg`` is allowed to pass (i.e., if the verdict BPF program returns
+``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
+``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress
+interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in
+``flags`` is used to select the ingress path otherwise the egress path is
+selected. This is the only flag supported for now.
+
+Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
+
+bpf_sk_redirect_hash()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
+
+This helper is used in programs implementing policies at the skb socket level.
+If the sk_buff ``skb`` is allowed to pass (i.e., if the verdict BPF program
+returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
+``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress
+interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in
+``flags`` is used to select the ingress path otherwise the egress path is
+selected. This is the only flag supported for now.
+
+Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
+
+bpf_msg_apply_bytes()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
+
+For socket policies, apply the verdict of the BPF program to the next (number
+of ``bytes``) of message ``msg``. For example, this helper can be used in the
+following cases:
+
+- A single ``sendmsg()`` or ``sendfile()`` system call contains multiple
+ logical messages that the BPF program is supposed to read and for which it
+ should apply a verdict.
+- A BPF program only cares to read the first ``bytes`` of a ``msg``. If the
+ message has a large payload, then setting up and calling the BPF program
+ repeatedly for all bytes, even though the verdict is already known, would
+ create unnecessary overhead.
+
+Returns 0
+
+bpf_msg_cork_bytes()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
+
+For socket policies, prevent the execution of the verdict BPF program for
+message ``msg`` until the number of ``bytes`` have been accumulated.
+
+This can be used when one needs a specific number of bytes before a verdict can
+be assigned, even if the data spans multiple ``sendmsg()`` or ``sendfile()``
+calls.
+
+Returns 0
+
+bpf_msg_pull_data()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
+
+For socket policies, pull in non-linear data from user space for ``msg`` and set
+pointers ``msg->data`` and ``msg->data_end`` to ``start`` and ``end`` bytes
+offsets into ``msg``, respectively.
+
+If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only
+parse data that the (``data``, ``data_end``) pointers have already consumed.
+For ``sendmsg()`` hooks this is likely the first scatterlist element. But for
+calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be
+the range (**0**, **0**) because the data is shared with user space and by
+default the objective is to avoid allowing user space to modify data while (or
+after) BPF verdict is being decided. This helper can be used to pull in data
+and to set the start and end pointers to given values. Data will be copied if
+necessary (i.e., if data was not linear and if start and end pointers do not
+point to the same chunk).
+
+A call to this helper is susceptible to change the underlying packet buffer.
+Therefore, at load time, all checks on pointers previously done by the verifier
+are invalidated and must be performed again, if the helper is used in
+combination with direct packet access.
+
+All values for ``flags`` are reserved for future usage, and must be left at
+zero.
+
+Returns 0 on success, or a negative error in case of failure.
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+Look up a socket entry in the sockmap or sockhash map.
+
+Returns the socket entry associated to ``key``, or NULL if no entry was found.
+
+bpf_map_update_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+
+Add or update a socket entry in a sockmap or sockhash.
+
+The flags argument can be one of the following:
+
+- BPF_ANY: Create a new element or update an existing element.
+- BPF_NOEXIST: Create a new element only if it did not exist.
+- BPF_EXIST: Update an existing element.
+
+Returns 0 on success, or a negative error in case of failure.
+
+bpf_map_delete_elem()
+^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_map_delete_elem(struct bpf_map *map, const void *key)
+
+Delete a socket entry from a sockmap or a sockhash.
+
+Returns 0 on success, or a negative error in case of failure.
+
+User space
+----------
+bpf_map_update_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
+
+Sockmap entries can be added or updated using the ``bpf_map_update_elem()``
+function. The ``key`` parameter is the index value of the sockmap array. And the
+``value`` parameter is the FD value of that socket.
+
+Under the hood, the sockmap update function uses the socket FD value to
+retrieve the associated socket and its attached psock.
+
+The flags argument can be one of the following:
+
+- BPF_ANY: Create a new element or update an existing element.
+- BPF_NOEXIST: Create a new element only if it did not exist.
+- BPF_EXIST: Update an existing element.
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_lookup_elem(int fd, const void *key, void *value)
+
+Sockmap entries can be retrieved using the ``bpf_map_lookup_elem()`` function.
+
+.. note::
+ The entry returned is a socket cookie rather than a socket itself.
+
+bpf_map_delete_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_delete_elem(int fd, const void *key)
+
+Sockmap entries can be deleted using the ``bpf_map_delete_elem()``
+function.
+
+Returns 0 on success, or negative error in case of failure.
+
+Examples
+========
+
+Kernel BPF
+----------
+Several examples of the use of sockmap APIs can be found in:
+
+- `tools/testing/selftests/bpf/progs/test_sockmap_kern.h`_
+- `tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`_
+- `tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`_
+- `tools/testing/selftests/bpf/progs/test_sockmap_listen.c`_
+- `tools/testing/selftests/bpf/progs/test_sockmap_update.c`_
+
+The following code snippet shows how to declare a sockmap.
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_SOCKMAP);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, __u64);
+ } sock_map_rx SEC(".maps");
+
+The following code snippet shows a sample parser program.
+
+.. code-block:: c
+
+ SEC("sk_skb/stream_parser")
+ int bpf_prog_parser(struct __sk_buff *skb)
+ {
+ return skb->len;
+ }
+
+The following code snippet shows a simple verdict program that interacts with a
+sockmap to redirect traffic to another socket based on the local port.
+
+.. code-block:: c
+
+ SEC("sk_skb/stream_verdict")
+ int bpf_prog_verdict(struct __sk_buff *skb)
+ {
+ __u32 lport = skb->local_port;
+ __u32 idx = 0;
+
+ if (lport == 10000)
+ return bpf_sk_redirect_map(skb, &sock_map_rx, idx, 0);
+
+ return SK_PASS;
+ }
+
+The following code snippet shows how to declare a sockhash map.
+
+.. code-block:: c
+
+ struct socket_key {
+ __u32 src_ip;
+ __u32 dst_ip;
+ __u32 src_port;
+ __u32 dst_port;
+ };
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_SOCKHASH);
+ __uint(max_entries, 1);
+ __type(key, struct socket_key);
+ __type(value, __u64);
+ } sock_hash_rx SEC(".maps");
+
+The following code snippet shows a simple verdict program that interacts with a
+sockhash to redirect traffic to another socket based on a hash of some of the
+skb parameters.
+
+.. code-block:: c
+
+ static inline
+ void extract_socket_key(struct __sk_buff *skb, struct socket_key *key)
+ {
+ key->src_ip = skb->remote_ip4;
+ key->dst_ip = skb->local_ip4;
+ key->src_port = skb->remote_port >> 16;
+ key->dst_port = (bpf_htonl(skb->local_port)) >> 16;
+ }
+
+ SEC("sk_skb/stream_verdict")
+ int bpf_prog_verdict(struct __sk_buff *skb)
+ {
+ struct socket_key key;
+
+ extract_socket_key(skb, &key);
+
+ return bpf_sk_redirect_hash(skb, &sock_hash_rx, &key, 0);
+ }
+
+User space
+----------
+Several examples of the use of sockmap APIs can be found in:
+
+- `tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`_
+- `tools/testing/selftests/bpf/test_sockmap.c`_
+- `tools/testing/selftests/bpf/test_maps.c`_
+
+The following code sample shows how to create a sockmap, attach a parser and
+verdict program, as well as add a socket entry.
+
+.. code-block:: c
+
+ int create_sample_sockmap(int sock, int parse_prog_fd, int verdict_prog_fd)
+ {
+ int index = 0;
+ int map, err;
+
+ map = bpf_map_create(BPF_MAP_TYPE_SOCKMAP, NULL, sizeof(int), sizeof(int), 1, NULL);
+ if (map < 0) {
+ fprintf(stderr, "Failed to create sockmap: %s\n", strerror(errno));
+ return -1;
+ }
+
+ err = bpf_prog_attach(parse_prog_fd, map, BPF_SK_SKB_STREAM_PARSER, 0);
+ if (err){
+ fprintf(stderr, "Failed to attach_parser_prog_to_map: %s\n", strerror(errno));
+ goto out;
+ }
+
+ err = bpf_prog_attach(verdict_prog_fd, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (err){
+ fprintf(stderr, "Failed to attach_verdict_prog_to_map: %s\n", strerror(errno));
+ goto out;
+ }
+
+ err = bpf_map_update_elem(map, &index, &sock, BPF_NOEXIST);
+ if (err) {
+ fprintf(stderr, "Failed to update sockmap: %s\n", strerror(errno));
+ goto out;
+ }
+
+ out:
+ close(map);
+ return err;
+ }
+
+References
+===========
+
+- https://github.com/jrfastab/linux-kernel-xdp/commit/c89fd73cb9d2d7f3c716c3e00836f07b1aeb261f
+- https://lwn.net/Articles/731133/
+- http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf
+- https://lwn.net/Articles/748628/
+- https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com/
+
+.. _`tools/testing/selftests/bpf/progs/test_sockmap_kern.h`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_kern.h
+.. _`tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
+.. _`tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c
+.. _`tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+.. _`tools/testing/selftests/bpf/test_sockmap.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_sockmap.c
+.. _`tools/testing/selftests/bpf/test_maps.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_maps.c
+.. _`tools/testing/selftests/bpf/progs/test_sockmap_listen.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
+.. _`tools/testing/selftests/bpf/progs/test_sockmap_update.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_update.c
diff --git a/Documentation/bpf/map_xskmap.rst b/Documentation/bpf/map_xskmap.rst
index 7093b8208451..dc143edd9233 100644
--- a/Documentation/bpf/map_xskmap.rst
+++ b/Documentation/bpf/map_xskmap.rst
@@ -178,7 +178,7 @@ The following code snippet shows how to update an XSKMAP with an XSK entry.
For an example on how create AF_XDP sockets, please see the AF_XDP-example and
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
-For a detailed explaination of the AF_XDP interface please see:
+For a detailed explanation of the AF_XDP interface please see:
- `libxdp-readme`_.
- `AF_XDP`_ kernel documentation.
diff --git a/Documentation/bpf/other.rst b/Documentation/bpf/other.rst
index 3d61963403b4..7e6b12018802 100644
--- a/Documentation/bpf/other.rst
+++ b/Documentation/bpf/other.rst
@@ -6,4 +6,5 @@ Other
:maxdepth: 1
ringbuf
- llvm_reloc \ No newline at end of file
+ llvm_reloc
+ graph_ds_impl
diff --git a/Documentation/bpf/ringbuf.rst b/Documentation/bpf/ringbuf.rst
index 6a615cd62bda..a99cd05d79d4 100644
--- a/Documentation/bpf/ringbuf.rst
+++ b/Documentation/bpf/ringbuf.rst
@@ -124,7 +124,7 @@ buffer. Currently 4 are supported:
- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
-- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
+- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position
of consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
@@ -146,7 +146,7 @@ Design and Implementation
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
-means that if BPF program was interruped by another BPF program sharing the
+means that if BPF program was interrupted by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
diff --git a/Documentation/bpf/verifier.rst b/Documentation/bpf/verifier.rst
index d4326caf01f9..f0ec19db301c 100644
--- a/Documentation/bpf/verifier.rst
+++ b/Documentation/bpf/verifier.rst
@@ -192,7 +192,7 @@ checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
As well as range-checking, the tracked information is also used for enforcing
alignment of pointer accesses. For instance, on most systems the packet pointer
is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
-over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
+over the Ethernet header, then reads IHL and adds (IHL * 4), the resulting
pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
that pointer are safe.
@@ -316,6 +316,301 @@ Pruning considers not only the registers but also the stack (and any spilled
registers it may hold). They must all be safe for the branch to be pruned.
This is implemented in states_equal().
+Some technical details about state pruning implementation could be found below.
+
+Register liveness tracking
+--------------------------
+
+In order to make state pruning effective, liveness state is tracked for each
+register and stack slot. The basic idea is to track which registers and stack
+slots are actually used during subseqeuent execution of the program, until
+program exit is reached. Registers and stack slots that were never used could be
+removed from the cached state thus making more states equivalent to a cached
+state. This could be illustrated by the following program::
+
+ 0: call bpf_get_prandom_u32()
+ 1: r1 = 0
+ 2: if r0 == 0 goto +1
+ 3: r0 = 1
+ --- checkpoint ---
+ 4: r0 = r1
+ 5: exit
+
+Suppose that a state cache entry is created at instruction #4 (such entries are
+also called "checkpoints" in the text below). The verifier could reach the
+instruction with one of two possible register states:
+
+* r0 = 1, r1 = 0
+* r0 = 0, r1 = 0
+
+However, only the value of register ``r1`` is important to successfully finish
+verification. The goal of the liveness tracking algorithm is to spot this fact
+and figure out that both states are actually equivalent.
+
+Data structures
+~~~~~~~~~~~~~~~
+
+Liveness is tracked using the following data structures::
+
+ enum bpf_reg_liveness {
+ REG_LIVE_NONE = 0,
+ REG_LIVE_READ32 = 0x1,
+ REG_LIVE_READ64 = 0x2,
+ REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
+ REG_LIVE_WRITTEN = 0x4,
+ REG_LIVE_DONE = 0x8,
+ };
+
+ struct bpf_reg_state {
+ ...
+ struct bpf_reg_state *parent;
+ ...
+ enum bpf_reg_liveness live;
+ ...
+ };
+
+ struct bpf_stack_state {
+ struct bpf_reg_state spilled_ptr;
+ ...
+ };
+
+ struct bpf_func_state {
+ struct bpf_reg_state regs[MAX_BPF_REG];
+ ...
+ struct bpf_stack_state *stack;
+ }
+
+ struct bpf_verifier_state {
+ struct bpf_func_state *frame[MAX_CALL_FRAMES];
+ struct bpf_verifier_state *parent;
+ ...
+ }
+
+* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
+ verifier state creation;
+
+* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
+ defined by some instruction verified between this verifier state's parent and
+ verifier state itself;
+
+* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
+ is read by a some child state of this verifier state;
+
+* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
+ processing same verifier state multiple times and for some sanity checks;
+
+* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
+ values using bitwise or.
+
+Register parentage chains
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to propagate information between parent and child states, a *register
+parentage chain* is established. Each register or stack slot is linked to a
+corresponding register or stack slot in its parent state via a ``->parent``
+pointer. This link is established upon state creation in ``is_state_visited()``
+and might be modified by ``set_callee_state()`` called from
+``__check_func_call()``.
+
+The rules for correspondence between registers / stack slots are as follows:
+
+* For the current stack frame, registers and stack slots of the new state are
+ linked to the registers and stack slots of the parent state with the same
+ indices.
+
+* For the outer stack frames, only caller saved registers (r6-r9) and stack
+ slots are linked to the registers and stack slots of the parent state with the
+ same indices.
+
+* When function call is processed a new ``struct bpf_func_state`` instance is
+ allocated, it encapsulates a new set of registers and stack slots. For this
+ new frame, parent links for r6-r9 and stack slots are set to nil, parent links
+ for r1-r5 are set to match caller r1-r5 parent links.
+
+This could be illustrated by the following diagram (arrows stand for
+``->parent`` pointers)::
+
+ ... ; Frame #0, some instructions
+ --- checkpoint #0 ---
+ 1 : r6 = 42 ; Frame #0
+ --- checkpoint #1 ---
+ 2 : call foo() ; Frame #0
+ ... ; Frame #1, instructions from foo()
+ --- checkpoint #2 ---
+ ... ; Frame #1, instructions from foo()
+ --- checkpoint #3 ---
+ exit ; Frame #1, return from foo()
+ 3 : r1 = r6 ; Frame #0 <- current state
+
+ +-------------------------------+-------------------------------+
+ | Frame #0 | Frame #1 |
+ Checkpoint +-------------------------------+-------------------------------+
+ #0 | r0 | r1-r5 | r6-r9 | fp-8 ... |
+ +-------------------------------+
+ ^ ^ ^ ^
+ | | | |
+ Checkpoint +-------------------------------+
+ #1 | r0 | r1-r5 | r6-r9 | fp-8 ... |
+ +-------------------------------+
+ ^ ^ ^
+ |_______|_______|_______________
+ | | |
+ nil nil | | | nil nil
+ | | | | | | |
+ Checkpoint +-------------------------------+-------------------------------+
+ #2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
+ +-------------------------------+-------------------------------+
+ ^ ^ ^ ^ ^
+ nil nil | | | | |
+ | | | | | | |
+ Checkpoint +-------------------------------+-------------------------------+
+ #3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
+ +-------------------------------+-------------------------------+
+ ^ ^
+ nil nil | |
+ | | | |
+ Current +-------------------------------+
+ state | r0 | r1-r5 | r6-r9 | fp-8 ... |
+ +-------------------------------+
+ \
+ r6 read mark is propagated via these links
+ all the way up to checkpoint #1.
+ The checkpoint #1 contains a write mark for r6
+ because of instruction (1), thus read propagation
+ does not reach checkpoint #0 (see section below).
+
+Liveness marks tracking
+~~~~~~~~~~~~~~~~~~~~~~~
+
+For each processed instruction, the verifier tracks read and written registers
+and stack slots. The main idea of the algorithm is that read marks propagate
+back along the state parentage chain until they hit a write mark, which 'screens
+off' earlier states from the read. The information about reads is propagated by
+function ``mark_reg_read()`` which could be summarized as follows::
+
+ mark_reg_read(struct bpf_reg_state *state, ...):
+ parent = state->parent
+ while parent:
+ if state->live & REG_LIVE_WRITTEN:
+ break
+ if parent->live & REG_LIVE_READ64:
+ break
+ parent->live |= REG_LIVE_READ64
+ state = parent
+ parent = state->parent
+
+Notes:
+
+* The read marks are applied to the **parent** state while write marks are
+ applied to the **current** state. The write mark on a register or stack slot
+ means that it is updated by some instruction in the straight-line code leading
+ from the parent state to the current state.
+
+* Details about REG_LIVE_READ32 are omitted.
+
+* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`)
+ might override the first parent link. Please refer to the comments in the
+ ``propagate_liveness()`` and ``mark_reg_read()`` source code for further
+ details.
+
+Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
+applied conservatively: stack slots are marked as written only if write size
+corresponds to the size of the register, e.g. see function ``save_register_state()``.
+
+Consider the following example::
+
+ 0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8
+ --- checkpoint #0 ---
+ 1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes
+ 2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1)
+ 3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0)
+
+As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should
+it be otherwise, the algorithm above wouldn't be able to propagate the read mark
+from (3) to checkpoint #0.
+
+Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is
+called to update the ``->branches`` counter for each verifier state in a chain
+of parent verifier states. When the ``->branches`` counter reaches zero the
+verifier state becomes a valid entry in a set of cached verifier states.
+
+Each entry of the verifier states cache is post-processed by a function
+``clean_live_states()``. This function marks all registers and stack slots
+without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
+Registers/stack slots marked in this way are ignored in function ``stacksafe()``
+called from ``states_equal()`` when a state cache entry is considered for
+equivalence with a current state.
+
+Now it is possible to explain how the example from the beginning of the section
+works::
+
+ 0: call bpf_get_prandom_u32()
+ 1: r1 = 0
+ 2: if r0 == 0 goto +1
+ 3: r0 = 1
+ --- checkpoint[0] ---
+ 4: r0 = r1
+ 5: exit
+
+* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
+ is pushed to states processing queue (pc stands for program counter).
+
+* At instruction #4:
+
+ * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
+ * ``checkpoint[0].r0`` is marked as written;
+ * ``checkpoint[0].r1`` is marked as read;
+
+* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
+ by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
+ read mark and all other registers and stack slots are marked as ``NOT_INIT``
+ or ``STACK_INVALID``
+
+* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
+ and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
+ are considered equivalent.
+
+.. _read_marks_for_cache_hits:
+
+Read marks propagation for cache hits
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Another point is the handling of read marks when a previously verified state is
+found in the states cache. Upon cache hit verifier must behave in the same way
+as if the current state was verified to the program exit. This means that all
+read marks, present on registers and stack slots of the cached state, must be
+propagated over the parentage chain of the current state. Example below shows
+why this is important. Function ``propagate_liveness()`` handles this case.
+
+Consider the following state parentage chain (S is a starting state, A-E are
+derived states, -> arrows show which state is derived from which)::
+
+ r1 read
+ <------------- A[r1] == 0
+ C[r1] == 0
+ S ---> A ---> B ---> exit E[r1] == 1
+ |
+ ` ---> C ---> D
+ |
+ ` ---> E ^
+ |___ suppose all these
+ ^ states are at insn #Y
+ |
+ suppose all these
+ states are at insn #X
+
+* Chain of states ``S -> A -> B -> exit`` is verified first.
+
+* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
+ propagated up to state ``A``.
+
+* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
+ equivalent to state ``B``.
+
+* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
+ ``C`` might get mistakenly marked as equivalent to state ``E`` even though
+ values for register ``r1`` differ between ``C`` and ``E``.
+
Understanding eBPF verifier messages
====================================