diff options
Diffstat (limited to 'Documentation/core-api')
39 files changed, 2652 insertions, 466 deletions
diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst index 792bbf9939e1..19d89f92bf8d 100644 --- a/Documentation/core-api/assoc_array.rst +++ b/Documentation/core-api/assoc_array.rst @@ -92,18 +92,18 @@ There are two functions for dealing with the script: void assoc_array_apply_edit(struct assoc_array_edit *edit); -This will perform the edit functions, interpolating various write barriers -to permit accesses under the RCU read lock to continue. The edit script -will then be passed to ``call_rcu()`` to free it and any dead stuff it points -to. + This will perform the edit functions, interpolating various write barriers + to permit accesses under the RCU read lock to continue. The edit script + will then be passed to ``call_rcu()`` to free it and any dead stuff it + points to. 2. Cancel an edit script:: void assoc_array_cancel_edit(struct assoc_array_edit *edit); -This frees the edit script and all preallocated memory immediately. If -this was for insertion, the new object is _not_ released by this function, -but must rather be released by the caller. + This frees the edit script and all preallocated memory immediately. If + this was for insertion, the new object is *not* released by this function, + but must rather be released by the caller. These functions are guaranteed not to fail. @@ -123,43 +123,43 @@ This points to a number of methods, all of which need to be provided: unsigned long (*get_key_chunk)(const void *index_key, int level); -This should return a chunk of caller-supplied index key starting at the -*bit* position given by the level argument. The level argument will be a -multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return -``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. + This should return a chunk of caller-supplied index key starting at the + *bit* position given by the level argument. The level argument will be a + multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return + ``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. 2. Get a chunk of an object's index key:: unsigned long (*get_object_key_chunk)(const void *object, int level); -As the previous function, but gets its data from an object in the array -rather than from a caller-supplied index key. + As the previous function, but gets its data from an object in the array + rather than from a caller-supplied index key. 3. See if this is the object we're looking for:: bool (*compare_object)(const void *object, const void *index_key); -Compare the object against an index key and return ``true`` if it matches and -``false`` if it doesn't. + Compare the object against an index key and return ``true`` if it matches + and ``false`` if it doesn't. 4. Diff the index keys of two objects:: int (*diff_objects)(const void *object, const void *index_key); -Return the bit position at which the index key of the specified object -differs from the given index key or -1 if they are the same. + Return the bit position at which the index key of the specified object + differs from the given index key or -1 if they are the same. 5. Free an object:: void (*free_object)(void *object); -Free the specified object. Note that this may be called an RCU grace period -after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be -necessary on module unloading. + Free the specified object. Note that this may be called an RCU grace period + after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may + be necessary on module unloading. Manipulation Functions @@ -171,7 +171,7 @@ There are a number of functions for manipulating an associative array: void assoc_array_init(struct assoc_array *array); -This initialises the base structure for an associative array. It can't fail. + This initialises the base structure for an associative array. It can't fail. 2. Insert/replace an object in an associative array:: @@ -182,21 +182,21 @@ This initialises the base structure for an associative array. It can't fail. const void *index_key, void *object); -This inserts the given object into the array. Note that the least -significant bit of the pointer must be zero as it's used to type-mark -pointers internally. + This inserts the given object into the array. Note that the least + significant bit of the pointer must be zero as it's used to type-mark + pointers internally. -If an object already exists for that key then it will be replaced with the -new object and the old one will be freed automatically. + If an object already exists for that key then it will be replaced with the + new object and the old one will be freed automatically. -The ``index_key`` argument should hold index key information and is -passed to the methods in the ops table when they are called. + The ``index_key`` argument should hold index key information and is + passed to the methods in the ops table when they are called. -This function makes no alteration to the array itself, but rather returns -an edit script that must be applied. ``-ENOMEM`` is returned in the case of -an out-of-memory error. + This function makes no alteration to the array itself, but rather returns + an edit script that must be applied. ``-ENOMEM`` is returned in the case of + an out-of-memory error. -The caller should lock exclusively against other modifiers of the array. + The caller should lock exclusively against other modifiers of the array. 3. Delete an object from an associative array:: @@ -206,15 +206,15 @@ The caller should lock exclusively against other modifiers of the array. const struct assoc_array_ops *ops, const void *index_key); -This deletes an object that matches the specified data from the array. + This deletes an object that matches the specified data from the array. -The ``index_key`` argument should hold index key information and is -passed to the methods in the ops table when they are called. + The ``index_key`` argument should hold index key information and is + passed to the methods in the ops table when they are called. -This function makes no alteration to the array itself, but rather returns -an edit script that must be applied. ``-ENOMEM`` is returned in the case of -an out-of-memory error. ``NULL`` will be returned if the specified object is -not found within the array. + This function makes no alteration to the array itself, but rather returns + an edit script that must be applied. ``-ENOMEM`` is returned in the case of + an out-of-memory error. ``NULL`` will be returned if the specified object + is not found within the array. The caller should lock exclusively against other modifiers of the array. @@ -225,14 +225,14 @@ The caller should lock exclusively against other modifiers of the array. assoc_array_clear(struct assoc_array *array, const struct assoc_array_ops *ops); -This deletes all the objects from an associative array and leaves it -completely empty. + This deletes all the objects from an associative array and leaves it + completely empty. -This function makes no alteration to the array itself, but rather returns -an edit script that must be applied. ``-ENOMEM`` is returned in the case of -an out-of-memory error. + This function makes no alteration to the array itself, but rather returns + an edit script that must be applied. ``-ENOMEM`` is returned in the case of + an out-of-memory error. -The caller should lock exclusively against other modifiers of the array. + The caller should lock exclusively against other modifiers of the array. 5. Destroy an associative array, deleting all objects:: @@ -240,14 +240,14 @@ The caller should lock exclusively against other modifiers of the array. void assoc_array_destroy(struct assoc_array *array, const struct assoc_array_ops *ops); -This destroys the contents of the associative array and leaves it -completely empty. It is not permitted for another thread to be traversing -the array under the RCU read lock at the same time as this function is -destroying it as no RCU deferral is performed on memory release - -something that would require memory to be allocated. + This destroys the contents of the associative array and leaves it + completely empty. It is not permitted for another thread to be traversing + the array under the RCU read lock at the same time as this function is + destroying it as no RCU deferral is performed on memory release - + something that would require memory to be allocated. -The caller should lock exclusively against other modifiers and accessors -of the array. + The caller should lock exclusively against other modifiers and accessors + of the array. 6. Garbage collect an associative array:: @@ -257,24 +257,24 @@ of the array. bool (*iterator)(void *object, void *iterator_data), void *iterator_data); -This iterates over the objects in an associative array and passes each one to -``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it -returns ``false``, the object will be freed. If the ``iterator()`` function -returns ``true``, it must perform any appropriate refcount incrementing on the -object before returning. + This iterates over the objects in an associative array and passes each one + to ``iterator()``. If ``iterator()`` returns ``true``, the object is kept. + If it returns ``false``, the object will be freed. If the ``iterator()`` + function returns ``true``, it must perform any appropriate refcount + incrementing on the object before returning. -The internal tree will be packed down if possible as part of the iteration -to reduce the number of nodes in it. + The internal tree will be packed down if possible as part of the iteration + to reduce the number of nodes in it. -The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise -ignored by the function. + The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise + ignored by the function. -The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't -enough memory. + The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't + enough memory. -It is possible for other threads to iterate over or search the array under -the RCU read lock while this function is in progress. The caller should -lock exclusively against other modifiers of the array. + It is possible for other threads to iterate over or search the array under + the RCU read lock while this function is in progress. The caller should + lock exclusively against other modifiers of the array. Access Functions @@ -289,19 +289,19 @@ There are two functions for accessing an associative array: void *iterator_data), void *iterator_data); -This passes each object in the array to the iterator callback function. -``iterator_data`` is private data for that function. + This passes each object in the array to the iterator callback function. + ``iterator_data`` is private data for that function. -This may be used on an array at the same time as the array is being -modified, provided the RCU read lock is held. Under such circumstances, -it is possible for the iteration function to see some objects twice. If -this is a problem, then modification should be locked against. The -iteration algorithm should not, however, miss any objects. + This may be used on an array at the same time as the array is being + modified, provided the RCU read lock is held. Under such circumstances, + it is possible for the iteration function to see some objects twice. If + this is a problem, then modification should be locked against. The + iteration algorithm should not, however, miss any objects. -The function will return ``0`` if no objects were in the array or else it will -return the result of the last iterator function called. Iteration stops -immediately if any call to the iteration function results in a non-zero -return. + The function will return ``0`` if no objects were in the array or else it + will return the result of the last iterator function called. Iteration + stops immediately if any call to the iteration function results in a + non-zero return. 2. Find an object in an associative array:: @@ -310,14 +310,14 @@ return. const struct assoc_array_ops *ops, const void *index_key); -This walks through the array's internal tree directly to the object -specified by the index key.. + This walks through the array's internal tree directly to the object + specified by the index key. -This may be used on an array at the same time as the array is being -modified, provided the RCU read lock is held. + This may be used on an array at the same time as the array is being + modified, provided the RCU read lock is held. -The function will return the object if found (and set ``*_type`` to the object -type) or will return ``NULL`` if the object was not found. + The function will return the object if found (and set ``*_type`` to the + object type) or will return ``NULL`` if the object was not found. Index Key Form @@ -399,10 +399,11 @@ fixed levels. For example:: In the above example, there are 7 nodes (A-G), each with 16 slots (0-f). Assuming no other meta data nodes in the tree, the key space is divided -thusly:: +thusly: + =========== ==== KEY PREFIX NODE - ========== ==== + =========== ==== 137* D 138* E 13[0-69-f]* C @@ -410,10 +411,12 @@ thusly:: e6* G e[0-57-f]* F [02-df]* A + =========== ==== So, for instance, keys with the following example index keys will be found in -the appropriate nodes:: +the appropriate nodes: + =============== ======= ==== INDEX KEY PREFIX NODE =============== ======= ==== 13694892892489 13 C @@ -422,12 +425,13 @@ the appropriate nodes:: 138bbb89003093 138 E 1394879524789 12 C 1458952489 1 B - 9431809de993ba - A - b4542910809cd - A + 9431809de993ba \- A + b4542910809cd \- A e5284310def98 e F e68428974237 e6 G e7fffcbd443 e F - f3842239082 - A + f3842239082 \- A + =============== ======= ==== To save memory, if a node can hold all the leaves in its portion of keyspace, then the node will have all those leaves in it and will not have any metadata @@ -441,8 +445,9 @@ metadata pointer. If the metadata pointer is there, any leaf whose key matches the metadata key prefix must be in the subtree that the metadata pointer points to. -In the above example list of index keys, node A will contain:: +In the above example list of index keys, node A will contain: + ==== =============== ================== SLOT CONTENT INDEX KEY (PREFIX) ==== =============== ================== 1 PTR TO NODE B 1* @@ -450,11 +455,16 @@ In the above example list of index keys, node A will contain:: any LEAF b4542910809cd e PTR TO NODE F e* any LEAF f3842239082 + ==== =============== ================== -and node B:: +and node B: - 3 PTR TO NODE C 13* - any LEAF 1458952489 + ==== =============== ================== + SLOT CONTENT INDEX KEY (PREFIX) + ==== =============== ================== + 3 PTR TO NODE C 13* + any LEAF 1458952489 + ==== =============== ================== Shortcuts diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index e1b0eeabbb5e..9b4afca9fd09 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -8,7 +8,7 @@ CPU hotplug in the Kernel Srivatsa Vaddagiri <vatsa@in.ibm.com>, Ashok Raj <ashok.raj@intel.com>, Joel Schopp <jschopp@austin.ibm.com>, - Thomas Gleixner <tglx@linutronix.de> + Thomas Gleixner <tglx@kernel.org> Introduction ============ diff --git a/Documentation/core-api/dma-api-howto.rst b/Documentation/core-api/dma-api-howto.rst index 0bf31b6c4383..e97743ab0f26 100644 --- a/Documentation/core-api/dma-api-howto.rst +++ b/Documentation/core-api/dma-api-howto.rst @@ -146,6 +146,58 @@ What about block I/O and networking buffers? The block I/O and networking subsystems make sure that the buffers they use are valid for you to DMA from/to. +__dma_from_device_group_begin/end annotations +============================================= + +As explained previously, when a structure contains a DMA_FROM_DEVICE / +DMA_BIDIRECTIONAL buffer (device writes to memory) alongside fields that the +CPU writes to, cache line sharing between the DMA buffer and CPU-written fields +can cause data corruption on CPUs with DMA-incoherent caches. + +The ``__dma_from_device_group_begin(GROUP)/__dma_from_device_group_end(GROUP)`` +macros ensure proper alignment to prevent this:: + + struct my_device { + spinlock_t lock1; + __dma_from_device_group_begin(); + char dma_buffer1[16]; + char dma_buffer2[16]; + __dma_from_device_group_end(); + spinlock_t lock2; + }; + +To isolate a DMA buffer from adjacent fields, use +``__dma_from_device_group_begin(GROUP)`` before the first DMA buffer +field and ``__dma_from_device_group_end(GROUP)`` after the last DMA +buffer field (with the same GROUP name). This protects both the head +and tail of the buffer from cache line sharing. + +The GROUP parameter is an optional identifier that names the DMA buffer group +(in case you have several in the same structure):: + + struct my_device { + spinlock_t lock1; + __dma_from_device_group_begin(buffer1); + char dma_buffer1[16]; + __dma_from_device_group_end(buffer1); + spinlock_t lock2; + __dma_from_device_group_begin(buffer2); + char dma_buffer2[16]; + __dma_from_device_group_end(buffer2); + }; + +On cache-coherent platforms these macros expand to zero-length array markers. +On non-coherent platforms, they also ensure the minimal DMA alignment, which +can be as large as 128 bytes. + +.. note:: + + It is allowed (though somewhat fragile) to include extra fields, not + intended for DMA from the device, within the group (in order to pack the + structure tightly) - but only as long as the CPU does not write these + fields while any fields in the group are mapped for DMA_FROM_DEVICE or + DMA_BIDIRECTIONAL. + DMA addressing capabilities =========================== @@ -155,7 +207,7 @@ a device with limitations, it needs to be decreased. Special note about PCI: PCI-X specification requires PCI-X devices to support 64-bit addressing (DAC) for all transactions. And at least one platform (SGI -SN2) requires 64-bit consistent allocations to operate correctly when the IO +SN2) requires 64-bit coherent allocations to operate correctly when the IO bus is in PCI-X mode. For correct operation, you must set the DMA mask to inform the kernel about @@ -174,7 +226,7 @@ used instead: int dma_set_mask(struct device *dev, u64 mask); - The setup for consistent allocations is performed via a call + The setup for coherent allocations is performed via a call to dma_set_coherent_mask():: int dma_set_coherent_mask(struct device *dev, u64 mask); @@ -241,7 +293,7 @@ it would look like this:: The coherent mask will always be able to set the same or a smaller mask as the streaming mask. However for the rare case that a device driver only -uses consistent allocations, one would have to check the return value from +uses coherent allocations, one would have to check the return value from dma_set_coherent_mask(). Finally, if your device can only drive the low 24-bits of @@ -298,20 +350,20 @@ Types of DMA mappings There are two types of DMA mappings: -- Consistent DMA mappings which are usually mapped at driver +- Coherent DMA mappings which are usually mapped at driver initialization, unmapped at the end and for which the hardware should guarantee that the device and the CPU can access the data in parallel and will see updates made by each other without any explicit software flushing. - Think of "consistent" as "synchronous" or "coherent". + Think of "coherent" as "synchronous". - The current default is to return consistent memory in the low 32 + The current default is to return coherent memory in the low 32 bits of the DMA space. However, for future compatibility you should - set the consistent mask even if this default is fine for your + set the coherent mask even if this default is fine for your driver. - Good examples of what to use consistent mappings for are: + Good examples of what to use coherent mappings for are: - Network card DMA ring descriptors. - SCSI adapter mailbox command data structures. @@ -320,13 +372,13 @@ There are two types of DMA mappings: The invariant these examples all require is that any CPU store to memory is immediately visible to the device, and vice - versa. Consistent mappings guarantee this. + versa. Coherent mappings guarantee this. .. important:: - Consistent DMA memory does not preclude the usage of + Coherent DMA memory does not preclude the usage of proper memory barriers. The CPU may reorder stores to - consistent memory just as it may normal memory. Example: + coherent memory just as it may normal memory. Example: if it is important for the device to see the first word of a descriptor updated before the second, you must do something like:: @@ -365,10 +417,10 @@ Also, systems with caches that aren't DMA-coherent will work better when the underlying buffers don't share cache lines with other data. -Using Consistent DMA mappings -============================= +Using Coherent DMA mappings +=========================== -To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +To allocate and map large (PAGE_SIZE or so) coherent DMA regions, you should do:: dma_addr_t dma_handle; @@ -385,10 +437,10 @@ __get_free_pages() (but takes size instead of a page order). If your driver needs regions sized smaller than a page, you may prefer using the dma_pool interface, described below. -The consistent DMA mapping interfaces, will by default return a DMA address +The coherent DMA mapping interfaces, will by default return a DMA address which is 32-bit addressable. Even if the device indicates (via the DMA mask) -that it may address the upper 32-bits, consistent allocation will only -return > 32-bit addresses for DMA if the consistent DMA mask has been +that it may address the upper 32-bits, coherent allocation will only +return > 32-bit addresses for DMA if the coherent DMA mask has been explicitly changed via dma_set_coherent_mask(). This is true of the dma_pool interface as well. @@ -497,7 +549,7 @@ program address space. Such platforms can and do report errors in the kernel logs when the DMA controller hardware detects violation of the permission setting. -Only streaming mappings specify a direction, consistent mappings +Only streaming mappings specify a direction, coherent mappings implicitly have a direction attribute setting of DMA_BIDIRECTIONAL. diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 8e3cce3d0a23..ca75b3541679 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -8,15 +8,15 @@ This document describes the DMA API. For a more gentle introduction of the API (and actual examples), see Documentation/core-api/dma-api-howto.rst. This API is split into two pieces. Part I describes the basic API. -Part II describes extensions for supporting non-consistent memory +Part II describes extensions for supporting non-coherent memory machines. Unless you know that your driver absolutely has to support -non-consistent platforms (this is usually only legacy platforms) you +non-coherent platforms (this is usually only legacy platforms) you should only use the API described in part I. -Part I - dma_API +Part I - DMA API ---------------- -To get the dma_API, you must #include <linux/dma-mapping.h>. This +To get the DMA API, you must #include <linux/dma-mapping.h>. This provides dma_addr_t and the interfaces described below. A dma_addr_t can hold any valid DMA address for the platform. It can be @@ -33,13 +33,13 @@ Part Ia - Using large DMA-coherent buffers dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) -Consistent memory is memory for which a write by either the device or +Coherent memory is memory for which a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects. (You may however need to make sure to flush the processor's write buffers before telling devices to read that memory.) -This routine allocates a region of <size> bytes of consistent memory. +This routine allocates a region of <size> bytes of coherent memory. It returns a pointer to the allocated region (in the processor's virtual address space) or NULL if the allocation failed. @@ -48,15 +48,14 @@ It also returns a <dma_handle> which may be cast to an unsigned integer the same width as the bus and given to the device as the DMA address base of the region. -Note: consistent memory can be expensive on some platforms, and the +Note: coherent memory can be expensive on some platforms, and the minimum allocation length may be as big as a page, so you should -consolidate your requests for consistent memory as much as possible. +consolidate your requests for coherent memory as much as possible. The simplest way to do that is to use the dma_pool calls (see below). -The flag parameter (dma_alloc_coherent() only) allows the caller to -specify the ``GFP_`` flags (see kmalloc()) for the allocation (the -implementation may choose to ignore flags that affect the location of -the returned memory, like GFP_DMA). +The flag parameter allows the caller to specify the ``GFP_`` flags (see +kmalloc()) for the allocation (the implementation may ignore flags that affect +the location of the returned memory, like GFP_DMA). :: @@ -64,19 +63,18 @@ the returned memory, like GFP_DMA). dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) -Free a region of consistent memory you previously allocated. dev, -size and dma_handle must all be the same as those passed into -dma_alloc_coherent(). cpu_addr must be the virtual address returned by -the dma_alloc_coherent(). +Free a previously allocated region of coherent memory. dev, size and dma_handle +must all be the same as those passed into dma_alloc_coherent(). cpu_addr must +be the virtual address returned by dma_alloc_coherent(). -Note that unlike their sibling allocation calls, these routines -may only be called with IRQs enabled. +Note that unlike the sibling allocation call, this routine may only be called +with IRQs enabled. Part Ib - Using small DMA-coherent buffers ------------------------------------------ -To get this part of the dma_API, you must #include <linux/dmapool.h> +To get this part of the DMA API, you must #include <linux/dmapool.h> Many drivers need lots of small DMA-coherent memory regions for DMA descriptors or I/O buffers. Rather than allocating in units of a page @@ -85,78 +83,29 @@ much like a struct kmem_cache, except that they use the DMA-coherent allocator, not __get_free_pages(). Also, they understand common hardware constraints for alignment, like queue heads needing to be aligned on N-byte boundaries. +.. kernel-doc:: mm/dmapool.c + :export: -:: - - struct dma_pool * - dma_pool_create(const char *name, struct device *dev, - size_t size, size_t align, size_t alloc); - -dma_pool_create() initializes a pool of DMA-coherent buffers -for use with a given device. It must be called in a context which -can sleep. - -The "name" is for diagnostics (like a struct kmem_cache name); dev and size -are like what you'd pass to dma_alloc_coherent(). The device's hardware -alignment requirement for this type of data is "align" (which is expressed -in bytes, and must be a power of two). If your device has no boundary -crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated -from this pool must not cross 4KByte boundaries. - -:: - - void * - dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, - dma_addr_t *handle) - -Wraps dma_pool_alloc() and also zeroes the returned memory if the -allocation attempt succeeded. - - -:: - - void * - dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, - dma_addr_t *dma_handle); - -This allocates memory from the pool; the returned memory will meet the -size and alignment requirements specified at creation time. Pass -GFP_ATOMIC to prevent blocking, or if it's permitted (not -in_interrupt, not holding SMP locks), pass GFP_KERNEL to allow -blocking. Like dma_alloc_coherent(), this returns two values: an -address usable by the CPU, and the DMA address usable by the pool's -device. - -:: - - void - dma_pool_free(struct dma_pool *pool, void *vaddr, - dma_addr_t addr); - -This puts memory back into the pool. The pool is what was passed to -dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what -were returned when that routine allocated the memory being freed. - -:: - - void - dma_pool_destroy(struct dma_pool *pool); - -dma_pool_destroy() frees the resources of the pool. It must be -called in a context which can sleep. Make sure you've freed all allocated -memory back to the pool before you destroy it. +.. kernel-doc:: include/linux/dmapool.h Part Ic - DMA addressing limitations ------------------------------------ +DMA mask is a bit mask of the addressable region for the device. In other words, +if applying the DMA mask (a bitwise AND operation) to the DMA address of a +memory region does not clear any bits in the address, then the device can +perform DMA to that memory region. + +All the below functions which set a DMA mask may fail if the requested mask +cannot be used with the device, or if the device is not capable of doing DMA. + :: int dma_set_mask_and_coherent(struct device *dev, u64 mask) -Checks to see if the mask is possible and updates the device -streaming and coherent DMA mask parameters if it is. +Updates both streaming and coherent DMA masks. Returns: 0 if successful and a negative error if not. @@ -165,8 +114,7 @@ Returns: 0 if successful and a negative error if not. int dma_set_mask(struct device *dev, u64 mask) -Checks to see if the mask is possible and updates the device -parameters if it is. +Updates only the streaming DMA mask. Returns: 0 if successful and a negative error if not. @@ -175,8 +123,7 @@ Returns: 0 if successful and a negative error if not. int dma_set_coherent_mask(struct device *dev, u64 mask) -Checks to see if the mask is possible and updates the device -parameters if it is. +Updates only the coherent DMA mask. Returns: 0 if successful and a negative error if not. @@ -231,12 +178,32 @@ transfer memory ownership. Returns %false if those calls can be skipped. unsigned long dma_get_merge_boundary(struct device *dev); -Returns the DMA merge boundary. If the device cannot merge any the DMA address +Returns the DMA merge boundary. If the device cannot merge any DMA address segments, the function returns 0. Part Id - Streaming DMA mappings -------------------------------- +Streaming DMA allows to map an existing buffer for DMA transfers and then +unmap it when finished. Map functions are not guaranteed to succeed, so the +return value must be checked. + +.. note:: + + In particular, mapping may fail for memory not addressable by the + device, e.g. if it is not within the DMA mask of the device and/or a + connecting bus bridge. Streaming DMA functions try to overcome such + addressing constraints, either by using an IOMMU (a device which maps + I/O DMA addresses to physical memory addresses), or by copying the + data to/from a bounce buffer if the kernel is configured with a + :doc:`SWIOTLB <swiotlb>`. However, these methods are not always + available, and even if they are, they may still fail for a number of + reasons. + + In short, a device driver may need to be wary of where buffers are + located in physical memory, especially if the DMA mask is less than 32 + bits. + :: dma_addr_t @@ -246,9 +213,7 @@ Part Id - Streaming DMA mappings Maps a piece of processor virtual memory so it can be accessed by the device and returns the DMA address of the memory. -The direction for both APIs may be converted freely by casting. -However the dma_API uses a strongly typed enumerator for its -direction: +The DMA API uses a strongly typed enumerator for its direction: ======================= ============================================= DMA_NONE no direction (used for debugging) @@ -259,31 +224,13 @@ DMA_BIDIRECTIONAL direction isn't known .. note:: - Not all memory regions in a machine can be mapped by this API. - Further, contiguous kernel virtual space may not be contiguous as + Contiguous kernel virtual space may not be contiguous as physical memory. Since this API does not provide any scatter/gather capability, it will fail if the user tries to map a non-physically contiguous piece of memory. For this reason, memory to be mapped by this API should be obtained from sources which guarantee it to be physically contiguous (like kmalloc). - Further, the DMA address of the memory must be within the - dma_mask of the device (the dma_mask is a bit mask of the - addressable region for the device, i.e., if the DMA address of - the memory ANDed with the dma_mask is still equal to the DMA - address, then the device can perform DMA to the memory). To - ensure that the memory allocated by kmalloc is within the dma_mask, - the driver may specify various platform-dependent flags to restrict - the DMA address range of the allocation (e.g., on x86, GFP_DMA - guarantees to be within the first 16MB of available DMA addresses, - as required by ISA devices). - - Note also that the above constraints on physical contiguity and - dma_mask may not apply if the platform has an IOMMU (a device which - maps an I/O DMA address to a physical memory address). However, to be - portable, device driver writers may *not* assume that such an IOMMU - exists. - .. warning:: Memory coherency operates at a granularity called the cache @@ -325,8 +272,7 @@ DMA_BIDIRECTIONAL direction isn't known enum dma_data_direction direction) Unmaps the region previously mapped. All the parameters passed in -must be identical to those passed in (and returned) by the mapping -API. +must be identical to those passed to (and returned by) dma_map_single(). :: @@ -376,10 +322,10 @@ action (e.g. reduce current DMA mapping usage or delay and try again later). dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction) -Returns: the number of DMA address segments mapped (this may be shorter -than <nents> passed in if some elements of the scatter/gather list are -physically or virtually adjacent and an IOMMU maps them with a single -entry). +Maps a scatter/gather list for DMA. Returns the number of DMA address segments +mapped, which may be smaller than <nents> passed in if several consecutive +sglist entries are merged (e.g. with an IOMMU, or if some adjacent segments +just happen to be physically contiguous). Please note that the sg cannot be mapped again if it has been mapped once. The mapping process is allowed to destroy information in the sg. @@ -403,9 +349,8 @@ With scatterlists, you use the resulting mapping like this:: where nents is the number of entries in the sglist. The implementation is free to merge several consecutive sglist entries -into one (e.g. with an IOMMU, or if several pages just happen to be -physically contiguous) and returns the actual number of sg entries it -mapped them to. On failure 0, is returned. +into one. The returned number is the actual number of sg entries it +mapped them to. On failure, 0 is returned. Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously @@ -530,6 +475,77 @@ routines, e.g.::: .... } +Part Ie - IOVA-based DMA mappings +--------------------------------- + +These APIs allow a very efficient mapping when using an IOMMU. They are an +optional path that requires extra code and are only recommended for drivers +where DMA mapping performance, or the space usage for storing the DMA addresses +matter. All the considerations from the previous section apply here as well. + +:: + + bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state, + phys_addr_t phys, size_t size); + +Is used to try to allocate IOVA space for mapping operation. If it returns +false this API can't be used for the given device and the normal streaming +DMA mapping API should be used. The ``struct dma_iova_state`` is allocated +by the driver and must be kept around until unmap time. + +:: + + static inline bool dma_use_iova(struct dma_iova_state *state) + +Can be used by the driver to check if the IOVA-based API is used after a +call to dma_iova_try_alloc. This can be useful in the unmap path. + +:: + + int dma_iova_link(struct device *dev, struct dma_iova_state *state, + phys_addr_t phys, size_t offset, size_t size, + enum dma_data_direction dir, unsigned long attrs); + +Is used to link ranges to the IOVA previously allocated. The start of all +but the first call to dma_iova_link for a given state must be aligned +to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and +the size of all but the last range must be aligned to the DMA merge boundary +as well. + +:: + + int dma_iova_sync(struct device *dev, struct dma_iova_state *state, + size_t offset, size_t size); + +Must be called to sync the IOMMU page tables for IOVA-range mapped by one or +more calls to ``dma_iova_link()``. + +For drivers that use a one-shot mapping, all ranges can be unmapped and the +IOVA freed by calling: + +:: + + void dma_iova_destroy(struct device *dev, struct dma_iova_state *state, + size_t mapped_len, enum dma_data_direction dir, + unsigned long attrs); + +Alternatively drivers can dynamically manage the IOVA space by unmapping +and mapping individual regions. In that case + +:: + + void dma_iova_unlink(struct device *dev, struct dma_iova_state *state, + size_t offset, size_t size, enum dma_data_direction dir, + unsigned long attrs); + +is used to unmap a range previously mapped, and + +:: + + void dma_iova_free(struct device *dev, struct dma_iova_state *state); + +is used to free the IOVA space. All regions must have been unmapped using +``dma_iova_unlink()`` before calling ``dma_iova_free()``. Part II - Non-coherent DMA allocations -------------------------------------- @@ -704,19 +720,19 @@ memory or doing partial flushes. of two for easy alignment. -Part III - Debug drivers use of the DMA-API +Part III - Debug drivers use of the DMA API ------------------------------------------- -The DMA-API as described above has some constraints. DMA addresses must be +The DMA API as described above has some constraints. DMA addresses must be released with the corresponding function with the same size for example. With the advent of hardware IOMMUs it becomes more and more important that drivers do not violate those constraints. In the worst case such a violation can result in data corruption up to destroyed filesystems. -To debug drivers and find bugs in the usage of the DMA-API checking code can +To debug drivers and find bugs in the usage of the DMA API checking code can be compiled into the kernel which will tell the developer about those violations. If your architecture supports it you can select the "Enable -debugging of DMA-API usage" option in your kernel configuration. Enabling this +debugging of DMA API usage" option in your kernel configuration. Enabling this option has a performance impact. Do not enable it in production kernels. If you boot the resulting kernel will contain code which does some bookkeeping @@ -745,7 +761,7 @@ example warning message may look like this:: [<ffffffff80235177>] find_busiest_group+0x207/0x8a0 [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50 [<ffffffff803c7ea3>] check_unmap+0x203/0x490 - [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50 + [<ffffffff803c8259>] debug_dma_unmap_phys+0x49/0x50 [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0 [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0 [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70 @@ -755,7 +771,7 @@ example warning message may look like this:: <EOI> <4>---[ end trace f6435a98e2a38c0e ]--- The driver developer can find the driver and the device including a stacktrace -of the DMA-API call which caused this warning. +of the DMA API call which caused this warning. Per default only the first error will result in a warning message. All other errors will only silently counted. This limitation exist to prevent the code @@ -763,7 +779,7 @@ from flooding your kernel log. To support debugging a device driver this can be disabled via debugfs. See the debugfs interface documentation below for details. -The debugfs directory for the DMA-API debugging code is called dma-api/. In +The debugfs directory for the DMA API debugging code is called dma-api/. In this directory the following files can currently be found: =============================== =============================================== @@ -811,7 +827,7 @@ dma-api/driver_filter You can write a name of a driver into this file If you have this code compiled into your kernel it will be enabled by default. If you want to boot without the bookkeeping anyway you can provide -'dma_debug=off' as a boot parameter. This will disable DMA-API debugging. +'dma_debug=off' as a boot parameter. This will disable DMA API debugging. Notice that you can not enable it again at runtime. You have to reboot to do so. @@ -839,8 +855,14 @@ that a driver may be leaking mappings. dma-debug interface debug_dma_mapping_error() to debug drivers that fail to check DMA mapping errors on addresses returned by dma_map_single() and dma_map_page() interfaces. This interface clears a flag set by -debug_dma_map_page() to indicate that dma_mapping_error() has been called by +debug_dma_map_phys() to indicate that dma_mapping_error() has been called by the driver. When driver does unmap, debug_dma_unmap() checks the flag and if this flag is still set, prints warning message that includes call trace that leads up to the unmap. This interface can be called from dma_mapping_error() routines to enable DMA mapping error check debugging. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/scatterlist.h +.. kernel-doc:: lib/scatterlist.c diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst index 1887d92e8e92..123c8468d58f 100644 --- a/Documentation/core-api/dma-attributes.rst +++ b/Documentation/core-api/dma-attributes.rst @@ -130,3 +130,52 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged subsystem that the buffer is fully accessible at the elevated privilege level (and ideally inaccessible or at least read-only at the lesser-privileged levels). + +DMA_ATTR_MMIO +------------- + +This attribute indicates the physical address is not normal system +memory. It may not be used with kmap*()/phys_to_virt()/phys_to_page() +functions, it may not be cacheable, and access using CPU load/store +instructions may not be allowed. + +Usually this will be used to describe MMIO addresses, or other non-cacheable +register addresses. When DMA mapping this sort of address we call +the operation Peer to Peer as a one device is DMA'ing to another device. +For PCI devices the p2pdma APIs must be used to determine if +DMA_ATTR_MMIO is appropriate. + +For architectures that require cache flushing for DMA coherence +DMA_ATTR_MMIO will not perform any cache flushing. The address +provided must never be mapped cacheable into the CPU. + +DMA_ATTR_DEBUGGING_IGNORE_CACHELINES +------------------------------------ + +This attribute indicates that CPU cache lines may overlap for buffers mapped +with DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. + +Such overlap may occur when callers map multiple small buffers that reside +within the same cache line. In this case, callers must guarantee that the CPU +will not dirty these cache lines after the mappings are established. When this +condition is met, multiple buffers can safely share a cache line without risking +data corruption. + +All mappings that share a cache line must set this attribute to suppress DMA +debug warnings about overlapping mappings. + +DMA_ATTR_REQUIRE_COHERENT +------------------------- + +DMA mapping requests with the DMA_ATTR_REQUIRE_COHERENT fail on any +system where SWIOTLB or cache management is required. This should only +be used to support uAPI designs that require continuous HW DMA +coherence with userspace processes, for example RDMA and DRM. At a +minimum the memory being mapped must be userspace memory from +pin_user_pages() or similar. + +Drivers should consider using dma_mmap_pages() instead of this +interface when building their uAPIs, when possible. + +It must never be used in an in-kernel driver that only works with +kernel memory. diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst index a15f9b1767a2..71d8eedc0549 100644 --- a/Documentation/core-api/entry.rst +++ b/Documentation/core-api/entry.rst @@ -105,7 +105,7 @@ has to do extra work between the various steps. In such cases it has to ensure that enter_from_user_mode() is called first on entry and exit_to_user_mode() is called last on exit. -Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking +Do not nest syscalls. Nested syscalls will cause RCU and/or context tracking to print a warning. KVM @@ -115,8 +115,8 @@ Entering or exiting guest mode is very similar to syscalls. From the host kernel point of view the CPU goes off into user space when entering the guest and returns to the kernel on exit. -kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() -and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). +guest_state_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() +and guest_state_exit_irqoff() is the KVM variant of enter_from_user_mode(). The state operations have the same ordering. Task work handling is done separately for guest at the boundary of the diff --git a/Documentation/core-api/folio_queue.rst b/Documentation/core-api/folio_queue.rst index 1fe7a9bc4b8d..b7628896d2b6 100644 --- a/Documentation/core-api/folio_queue.rst +++ b/Documentation/core-api/folio_queue.rst @@ -44,7 +44,7 @@ Each segment in the list also stores: * the size of each folio and * three 1-bit marks per folio, -but hese should not be accessed directly as the underlying data structure may +but these should not be accessed directly as the underlying data structure may change, but rather the access functions outlined below should be used. The facility can be made accessible by:: @@ -151,19 +151,16 @@ The marks can be set by:: void folioq_mark(struct folio_queue *folioq, unsigned int slot); void folioq_mark2(struct folio_queue *folioq, unsigned int slot); - void folioq_mark3(struct folio_queue *folioq, unsigned int slot); Cleared by:: void folioq_unmark(struct folio_queue *folioq, unsigned int slot); void folioq_unmark2(struct folio_queue *folioq, unsigned int slot); - void folioq_unmark3(struct folio_queue *folioq, unsigned int slot); And the marks can be queried by:: bool folioq_is_marked(const struct folio_queue *folioq, unsigned int slot); bool folioq_is_marked2(const struct folio_queue *folioq, unsigned int slot); - bool folioq_is_marked3(const struct folio_queue *folioq, unsigned int slot); The marks can be used for any purpose and are not interpreted by this API. diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst index 25f94dfd66fa..b16d751d4b98 100644 --- a/Documentation/core-api/genericirq.rst +++ b/Documentation/core-api/genericirq.rst @@ -410,8 +410,6 @@ which are used in the generic IRQ layer. .. kernel-doc:: include/linux/interrupt.h :internal: -.. kernel-doc:: include/linux/irqdomain.h - Public Functions Provided ========================= @@ -441,6 +439,6 @@ Credits The following people have contributed to this document: -1. Thomas Gleixner tglx@linutronix.de +1. Thomas Gleixner tglx@kernel.org 2. Ingo Molnar mingo@elte.hu diff --git a/Documentation/core-api/housekeeping.rst b/Documentation/core-api/housekeeping.rst new file mode 100644 index 000000000000..92c6e53cea75 --- /dev/null +++ b/Documentation/core-api/housekeeping.rst @@ -0,0 +1,111 @@ +====================================== +Housekeeping +====================================== + + +CPU Isolation moves away kernel work that may otherwise run on any CPU. +The purpose of its related features is to reduce the OS jitter that some +extreme workloads can't stand, such as in some DPDK usecases. + +The kernel work moved away by CPU isolation is commonly described as +"housekeeping" because it includes ground work that performs cleanups, +statistics maintainance and actions relying on them, memory release, +various deferrals etc... + +Sometimes housekeeping is just some unbound work (unbound workqueues, +unbound timers, ...) that gets easily assigned to non-isolated CPUs. +But sometimes housekeeping is tied to a specific CPU and requires +elaborate tricks to be offloaded to non-isolated CPUs (RCU_NOCB, remote +scheduler tick, etc...). + +Thus, a housekeeping CPU can be considered as the reverse of an isolated +CPU. It is simply a CPU that can execute housekeeping work. There must +always be at least one online housekeeping CPU at any time. The CPUs that +are not isolated are automatically assigned as housekeeping. + +Housekeeping is currently divided in four features described +by the ``enum hk_type type``: + +1. HK_TYPE_DOMAIN matches the work moved away by scheduler domain + isolation performed through ``isolcpus=domain`` boot parameter or + isolated cpuset partitions in cgroup v2. This includes scheduler + load balancing, unbound workqueues and timers. + +2. HK_TYPE_KERNEL_NOISE matches the work moved away by tick isolation + performed through ``nohz_full=`` or ``isolcpus=nohz`` boot + parameters. This includes remote scheduler tick, vmstat and lockup + watchdog. + +3. HK_TYPE_MANAGED_IRQ matches the IRQ handlers moved away by managed + IRQ isolation performed through ``isolcpus=managed_irq``. + +4. HK_TYPE_DOMAIN_BOOT matches the work moved away by scheduler domain + isolation performed through ``isolcpus=domain`` only. It is similar + to HK_TYPE_DOMAIN except it ignores the isolation performed by + cpusets. + + +Housekeeping cpumasks +================================= + +Housekeeping cpumasks include the CPUs that can execute the work moved +away by the matching isolation feature. These cpumasks are returned by +the following function:: + + const struct cpumask *housekeeping_cpumask(enum hk_type type) + +By default, if neither ``nohz_full=``, nor ``isolcpus``, nor cpuset's +isolated partitions are used, which covers most usecases, this function +returns the cpu_possible_mask. + +Otherwise the function returns the cpumask complement of the isolation +feature. For example: + +With isolcpus=domain,7 the following will return a mask with all possible +CPUs except 7:: + + housekeeping_cpumask(HK_TYPE_DOMAIN) + +Similarly with nohz_full=5,6 the following will return a mask with all +possible CPUs except 5,6:: + + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE) + + +Synchronization against cpusets +================================= + +Cpuset can modify the HK_TYPE_DOMAIN housekeeping cpumask while creating, +modifying or deleting an isolated partition. + +The users of HK_TYPE_DOMAIN cpumask must then make sure to synchronize +properly against cpuset in order to make sure that: + +1. The cpumask snapshot stays coherent. + +2. No housekeeping work is queued on a newly made isolated CPU. + +3. Pending housekeeping work that was queued to a non isolated + CPU which just turned isolated through cpuset must be flushed + before the related created/modified isolated partition is made + available to userspace. + +This synchronization is maintained by an RCU based scheme. The cpuset update +side waits for an RCU grace period after updating the HK_TYPE_DOMAIN +cpumask and before flushing pending works. On the read side, care must be +taken to gather the housekeeping target election and the work enqueue within +the same RCU read side critical section. + +A typical layout example would look like this on the update side +(``housekeeping_update()``):: + + rcu_assign_pointer(housekeeping_cpumasks[type], trial); + synchronize_rcu(); + flush_workqueue(example_workqueue); + +And then on the read side:: + + rcu_read_lock(); + cpu = housekeeping_any_cpu(HK_TYPE_DOMAIN); + queue_work_on(cpu, example_workqueue, work); + rcu_read_unlock(); diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index e9789bd381d8..13769d5c40bf 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -24,6 +24,8 @@ it. printk-index symbol-namespaces asm-annotations + real-time/index + housekeeping.rst Data structures and low-level utilities ======================================= @@ -54,6 +56,7 @@ Library functionality that is used throughout the kernel. union_find min_heap parser + list Low level entry and exit ======================== @@ -115,6 +118,7 @@ more memory-management documentation in Documentation/mm/index.rst. pin_user_pages boot-time-mm gfp_mask-from-fs-io + kho/index Interfaces for kernel debugging =============================== @@ -135,11 +139,5 @@ Documents that don't fit elsewhere or which have yet to be categorized. :maxdepth: 1 librs + liveupdate netlink - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/core-api/irq/concepts.rst b/Documentation/core-api/irq/concepts.rst index 4273806a606b..7c4564f3cbdf 100644 --- a/Documentation/core-api/irq/concepts.rst +++ b/Documentation/core-api/irq/concepts.rst @@ -2,23 +2,24 @@ What is an IRQ? =============== -An IRQ is an interrupt request from a device. -Currently they can come in over a pin, or over a packet. -Several devices may be connected to the same pin thus -sharing an IRQ. +An IRQ is an interrupt request from a device. Currently, they can come +in over a pin, or over a packet. Several devices may be connected to +the same pin thus sharing an IRQ. Such as on legacy PCI bus: All devices +typically share 4 lanes/pins. Note that each device can request an +interrupt on each of the lanes. An IRQ number is a kernel identifier used to talk about a hardware -interrupt source. Typically this is an index into the global irq_desc -array, but except for what linux/interrupt.h implements the details -are architecture specific. +interrupt source. Typically, this is an index into the global irq_desc +array or sparse_irqs tree. But except for what linux/interrupt.h +implements, the details are architecture specific. An IRQ number is an enumeration of the possible interrupt sources on a -machine. Typically what is enumerated is the number of input pins on -all of the interrupt controller in the system. In the case of ISA -what is enumerated are the 16 input pins on the two i8259 interrupt -controllers. +machine. Typically, what is enumerated is the number of input pins on +all of the interrupt controllers in the system. In the case of ISA, +what is enumerated are the 8 input pins on each of the two i8259 +interrupt controllers. Architectures can assign additional meaning to the IRQ numbers, and -are encouraged to in the case where there is any manual configuration -of the hardware involved. The ISA IRQs are a classic example of +are encouraged to in the case where there is any manual configuration +of the hardware involved. The ISA IRQs are a classic example of assigning this kind of additional meaning. diff --git a/Documentation/core-api/irq/index.rst b/Documentation/core-api/irq/index.rst index 0d65d11e5420..13bd24dd2b1c 100644 --- a/Documentation/core-api/irq/index.rst +++ b/Documentation/core-api/irq/index.rst @@ -9,3 +9,4 @@ IRQs irq-affinity irq-domain irqflags-tracing + managed_irq diff --git a/Documentation/core-api/irq/irq-affinity.rst b/Documentation/core-api/irq/irq-affinity.rst index 29da5000836a..9cb460cf60b6 100644 --- a/Documentation/core-api/irq/irq-affinity.rst +++ b/Documentation/core-api/irq/irq-affinity.rst @@ -9,9 +9,9 @@ ChangeLog: /proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify which target CPUs are permitted for a given IRQ source. It's a bitmask -(smp_affinity) or cpu list (smp_affinity_list) of allowed CPUs. It's not +(smp_affinity) or CPU list (smp_affinity_list) of allowed CPUs. It's not allowed to turn off all CPUs, and if an IRQ controller does not support -IRQ affinity then the value will not change from the default of all cpus. +IRQ affinity then the value will not change from the default of all CPUs. /proc/irq/default_smp_affinity specifies default affinity mask that applies to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask @@ -60,7 +60,7 @@ Now lets restrict that IRQ to CPU(4-7). This time around IRQ44 was delivered only to the last four processors. i.e counters for the CPU0-3 did not change. -Here is an example of limiting that same irq (44) to cpus 1024 to 1031:: +Here is an example of limiting that same IRQ (44) to CPUs 1024 to 1031:: [root@moon 44]# echo 1024-1031 > smp_affinity_list [root@moon 44]# cat smp_affinity_list diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst index f88a6ee67a35..68eb2612e8a4 100644 --- a/Documentation/core-api/irq/irq-domain.rst +++ b/Documentation/core-api/irq/irq-domain.rst @@ -1,75 +1,91 @@ =============================================== -The irq_domain interrupt number mapping library +The irq_domain Interrupt Number Mapping Library =============================================== The current design of the Linux kernel uses a single large number -space where each separate IRQ source is assigned a different number. -This is simple when there is only one interrupt controller, but in -systems with multiple interrupt controllers the kernel must ensure +space where each separate IRQ source is assigned a unique number. +This is simple when there is only one interrupt controller. But in +systems with multiple interrupt controllers, the kernel must ensure that each one gets assigned non-overlapping allocations of Linux IRQ numbers. The number of interrupt controllers registered as unique irqchips -show a rising tendency: for example subdrivers of different kinds +shows a rising tendency. For example, subdrivers of different kinds such as GPIO controllers avoid reimplementing identical callback mechanisms as the IRQ core system by modelling their interrupt -handlers as irqchips, i.e. in effect cascading interrupt controllers. +handlers as irqchips. I.e. in effect cascading interrupt controllers. -Here the interrupt number loose all kind of correspondence to -hardware interrupt numbers: whereas in the past, IRQ numbers could -be chosen so they matched the hardware IRQ line into the root -interrupt controller (i.e. the component actually fireing the -interrupt line to the CPU) nowadays this number is just a number. +So in the past, IRQ numbers could be chosen so that they match the +hardware IRQ line into the root interrupt controller (i.e. the +component actually firing the interrupt line to the CPU). Nowadays, +this number is just a number and the number has no +relationship to hardware interrupt numbers. -For this reason we need a mechanism to separate controller-local -interrupt numbers, called hardware irq's, from Linux IRQ numbers. +For this reason, we need a mechanism to separate controller-local +interrupt numbers, called hardware IRQs, from Linux IRQ numbers. The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of -irq numbers, but they don't provide any support for reverse mapping of +IRQ numbers, but they don't provide any support for reverse mapping of the controller-local IRQ (hwirq) number into the Linux IRQ number space. -The irq_domain library adds mapping between hwirq and IRQ numbers on -top of the irq_alloc_desc*() API. An irq_domain to manage mapping is -preferred over interrupt controller drivers open coding their own +The irq_domain library adds a mapping between hwirq and IRQ numbers on +top of the irq_alloc_desc*() API. An irq_domain to manage the mapping +is preferred over interrupt controller drivers open coding their own reverse mapping scheme. -irq_domain also implements translation from an abstract irq_fwspec -structure to hwirq numbers (Device Tree and ACPI GSI so far), and can -be easily extended to support other IRQ topology data sources. +irq_domain also implements a translation from an abstract struct +irq_fwspec to hwirq numbers (Device Tree, non-DT firmware node, ACPI +GSI, and software node so far), and can be easily extended to support +other IRQ topology data sources. The implementation is performed +without any extra platform support code. -irq_domain usage +irq_domain Usage ================ - -An interrupt controller driver creates and registers an irq_domain by -calling one of the irq_domain_add_*() or irq_domain_create_*() functions -(each mapping method has a different allocator function, more on that later). -The function will return a pointer to the irq_domain on success. The caller -must provide the allocator function with an irq_domain_ops structure. +struct irq_domain could be defined as an irq domain controller. That +is, it handles the mapping between hardware and virtual interrupt +numbers for a given interrupt domain. The domain structure is +generally created by the PIC code for a given PIC instance (though a +domain can cover more than one PIC if they have a flat number model). +It is the domain callbacks that are responsible for setting the +irq_chip on a given irq_desc after it has been mapped. + +The host code and data structures use a fwnode_handle pointer to +identify the domain. In some cases, and in order to preserve source +code compatibility, this fwnode pointer is "upgraded" to a DT +device_node. For those firmware infrastructures that do not provide a +unique identifier for an interrupt controller, the irq_domain code +offers a fwnode allocator. + +An interrupt controller driver creates and registers a struct irq_domain +by calling one of the irq_domain_create_*() functions (each mapping +method has a different allocator function, more on that later). The +function will return a pointer to the struct irq_domain on success. The +caller must provide the allocator function with a struct irq_domain_ops +pointer. In most cases, the irq_domain will begin empty without any mappings between hwirq and IRQ numbers. Mappings are added to the irq_domain by calling irq_create_mapping() which accepts the irq_domain and a -hwirq number as arguments. If a mapping for the hwirq doesn't already -exist then it will allocate a new Linux irq_desc, associate it with -the hwirq, and call the .map() callback so the driver can perform any -required hardware setup. +hwirq number as arguments. If a mapping for the hwirq doesn't already +exist, irq_create_mapping() allocates a new Linux irq_desc, associates +it with the hwirq, and calls the :c:member:`irq_domain_ops.map()` +callback. In there, the driver can perform any required hardware +setup. Once a mapping has been established, it can be retrieved or used via a variety of methods: - irq_resolve_mapping() returns a pointer to the irq_desc structure - for a given domain and hwirq number, and NULL if there was no + for a given domain and hwirq number, or NULL if there was no mapping. - irq_find_mapping() returns a Linux IRQ number for a given domain and - hwirq number, and 0 if there was no mapping -- irq_linear_revmap() is now identical to irq_find_mapping(), and is - deprecated + hwirq number, or 0 if there was no mapping - generic_handle_domain_irq() handles an interrupt described by a domain and a hwirq number -Note that irq domain lookups must happen in contexts that are -compatible with a RCU read-side critical section. +Note that irq_domain lookups must happen in contexts that are +compatible with an RCU read-side critical section. The irq_create_mapping() function must be called *at least once* before any call to irq_find_mapping(), lest the descriptor will not @@ -77,13 +93,14 @@ be allocated. If the driver has the Linux IRQ number or the irq_data pointer, and needs to know the associated hwirq number (such as in the irq_chip -callbacks) then it can be directly obtained from irq_data->hwirq. +callbacks) then it can be directly obtained from +:c:member:`irq_data.hwirq`. -Types of irq_domain mappings +Types of irq_domain Mappings ============================ There are several mechanisms available for reverse mapping from hwirq -to Linux irq, and each mechanism uses a different allocation function. +to Linux IRQ, and each mechanism uses a different allocation function. Which reverse map type should be used depends on the use case. Each of the reverse map types are described below: @@ -92,48 +109,36 @@ Linear :: - irq_domain_add_linear() irq_domain_create_linear() -The linear reverse map maintains a fixed size table indexed by the +The linear reverse map maintains a fixed-size table indexed by the hwirq number. When a hwirq is mapped, an irq_desc is allocated for the hwirq, and the IRQ number is stored in the table. The Linear map is a good choice when the maximum number of hwirqs is fixed and a relatively small number (~ < 256). The advantages of this -map are fixed time lookup for IRQ numbers, and irq_descs are only +map are fixed-time lookup for IRQ numbers, and irq_descs are only allocated for in-use IRQs. The disadvantage is that the table must be as large as the largest possible hwirq number. -irq_domain_add_linear() and irq_domain_create_linear() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - -The majority of drivers should use the linear map. +The majority of drivers should use the Linear map. Tree ---- :: - irq_domain_add_tree() irq_domain_create_tree() The irq_domain maintains a radix tree map from hwirq numbers to Linux IRQs. When an hwirq is mapped, an irq_desc is allocated and the hwirq is used as the lookup key for the radix tree. -The tree map is a good choice if the hwirq number can be very large +The Tree map is a good choice if the hwirq number can be very large since it doesn't need to allocate a table as large as the largest hwirq number. The disadvantage is that hwirq to IRQ number lookup is dependent on how many entries are in the table. -irq_domain_add_tree() and irq_domain_create_tree() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - Very few drivers should need this mapping. No Map @@ -141,7 +146,7 @@ No Map :: - irq_domain_add_nomap() + irq_domain_create_nomap() The No Map mapping is to be used when the hwirq number is programmable in the hardware. In this case it is best to program the @@ -159,17 +164,15 @@ Legacy :: - irq_domain_add_simple() - irq_domain_add_legacy() irq_domain_create_simple() irq_domain_create_legacy() The Legacy mapping is a special case for drivers that already have a range of irq_descs allocated for the hwirqs. It is used when the -driver cannot be immediately converted to use the linear mapping. For +driver cannot be immediately converted to use the Linear mapping. For example, many embedded system board support files use a set of #defines for IRQ numbers that are passed to struct device registrations. In that -case the Linux IRQ numbers cannot be dynamically assigned and the legacy +case the Linux IRQ numbers cannot be dynamically assigned and the Legacy mapping should be used. As the name implies, the \*_legacy() functions are deprecated and only @@ -177,25 +180,25 @@ exist to ease the support of ancient platforms. No new users should be added. Same goes for the \*_simple() functions when their use results in the legacy behaviour. -The legacy map assumes a contiguous range of IRQ numbers has already +The Legacy map assumes a contiguous range of IRQ numbers has already been allocated for the controller and that the IRQ number can be calculated by adding a fixed offset to the hwirq number, and visa-versa. The disadvantage is that it requires the interrupt controller to manage IRQ allocations and it requires an irq_desc to be allocated for every hwirq, even if it is unused. -The legacy map should only be used if fixed IRQ mappings must be -supported. For example, ISA controllers would use the legacy map for +The Legacy map should only be used if fixed IRQ mappings must be +supported. For example, ISA controllers would use the Legacy map for mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ numbers. -Most users of legacy mappings should use irq_domain_add_simple() or -irq_domain_create_simple() which will use a legacy domain only if an IRQ range -is supplied by the system and will otherwise use a linear domain mapping. -The semantics of this call are such that if an IRQ range is specified then -descriptors will be allocated on-the-fly for it, and if no range is -specified it will fall through to irq_domain_add_linear() or -irq_domain_create_linear() which means *no* irq descriptors will be allocated. +Most users of legacy mappings should use irq_domain_create_simple() +which will use a legacy domain only if an IRQ range is supplied by the +system and will otherwise use a linear domain mapping. The semantics of +this call are such that if an IRQ range is specified then descriptors +will be allocated on-the-fly for it, and if no range is specified it +will fall through to irq_domain_create_linear() which means *no* IRQ +descriptors will be allocated. A typical use case for simple domains is where an irqchip provider is supporting both dynamic and static IRQ assignments. @@ -206,18 +209,12 @@ that the driver using the simple domain call irq_create_mapping() before any irq_find_mapping() since the latter will actually work for the static IRQ assignment case. -irq_domain_add_simple() and irq_domain_create_simple() as well as -irq_domain_add_legacy() and irq_domain_create_legacy() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - -Hierarchy IRQ domain +Hierarchy IRQ Domain -------------------- On some architectures, there may be multiple interrupt controllers involved in delivering an interrupt from the device to the target CPU. -Let's look at a typical interrupt delivering path on x86 platforms:: +Let's look at a typical interrupt delivery path on x86 platforms:: Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU @@ -230,8 +227,8 @@ There are three interrupt controllers involved: To support such a hardware topology and make software architecture match hardware architecture, an irq_domain data structure is built for each interrupt controller and those irq_domains are organized into hierarchy. -When building irq_domain hierarchy, the irq_domain near to the device is -child and the irq_domain near to CPU is parent. So a hierarchy structure +When building irq_domain hierarchy, the irq_domain nearest the device is +child and the irq_domain nearest the CPU is parent. So a hierarchy structure as below will be built for the example above:: CPU Vector irq_domain (root irq_domain to manage CPU vectors) @@ -253,20 +250,40 @@ There are four major interfaces to use hierarchy irq_domain: 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware to stop delivering the interrupt. -Following changes are needed to support hierarchy irq_domain: +The following is needed to support hierarchy irq_domain: -1) a new field 'parent' is added to struct irq_domain; it's used to +1) The :c:member:`parent` field in struct irq_domain is used to maintain irq_domain hierarchy information. -2) a new field 'parent_data' is added to struct irq_data; it's used to - build hierarchy irq_data to match hierarchy irq_domains. The irq_data - is used to store irq_domain pointer and hardware irq number. -3) new callbacks are added to struct irq_domain_ops to support hierarchy - irq_domain operations. - -With support of hierarchy irq_domain and hierarchy irq_data ready, an -irq_domain structure is built for each interrupt controller, and an +2) The :c:member:`parent_data` field in struct irq_data is used to + build hierarchy irq_data to match hierarchy irq_domains. The + irq_data is used to store irq_domain pointer and hardware irq + number. +3) The :c:member:`alloc()`, :c:member:`free()`, and other callbacks in + struct irq_domain_ops to support hierarchy irq_domain operations. + +With the support of hierarchy irq_domain and hierarchy irq_data ready, +an irq_domain structure is built for each interrupt controller, and an irq_data structure is allocated for each irq_domain associated with an -IRQ. Now we could go one step further to support stacked(hierarchy) +IRQ. + +For an interrupt controller driver to support hierarchy irq_domain, it +needs to: + +1) Implement irq_domain_ops.alloc() and irq_domain_ops.free() +2) Optionally, implement irq_domain_ops.activate() and + irq_domain_ops.deactivate(). +3) Optionally, implement an irq_chip to manage the interrupt controller + hardware. +4) There is no need to implement irq_domain_ops.map() and + irq_domain_ops.unmap(). They are unused with hierarchy irq_domain. + +Note the hierarchy irq_domain is in no way x86-specific, and is +heavily used to support other architectures, such as ARM, ARM64 etc. + +Stacked irq_chip +~~~~~~~~~~~~~~~~ + +Now, we could go one step further to support stacked (hierarchy) irq_chip. That is, an irq_chip is associated with each irq_data along the hierarchy. A child irq_chip may implement a required action by itself or by cooperating with its parent irq_chip. @@ -276,22 +293,28 @@ with the hardware managed by itself and may ask for services from its parent irq_chip when needed. So we could achieve a much cleaner software architecture. -For an interrupt controller driver to support hierarchy irq_domain, it -needs to: - -1) Implement irq_domain_ops.alloc and irq_domain_ops.free -2) Optionally implement irq_domain_ops.activate and - irq_domain_ops.deactivate. -3) Optionally implement an irq_chip to manage the interrupt controller - hardware. -4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, - they are unused with hierarchy irq_domain. - -Hierarchy irq_domain is in no way x86 specific, and is heavily used to -support other architectures, such as ARM, ARM64 etc. - Debugging ========= Most of the internals of the IRQ subsystem are exposed in debugfs by turning CONFIG_GENERIC_IRQ_DEBUGFS on. + +Structures and Public Functions Provided +======================================== + +This chapter contains the autogenerated documentation of the structures +and exported kernel API functions which are used for IRQ domains. + +.. kernel-doc:: include/linux/irqdomain.h + +.. kernel-doc:: kernel/irq/irqdomain.c + :export: + +Internal Functions Provided +=========================== + +This chapter contains the autogenerated documentation of the internal +functions. + +.. kernel-doc:: kernel/irq/irqdomain.c + :internal: diff --git a/Documentation/core-api/irq/managed_irq.rst b/Documentation/core-api/irq/managed_irq.rst new file mode 100644 index 000000000000..05e295f3c289 --- /dev/null +++ b/Documentation/core-api/irq/managed_irq.rst @@ -0,0 +1,116 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Affinity managed interrupts +=========================== + +The IRQ core provides support for managing interrupts according to a specified +CPU affinity. Under normal operation, an interrupt is associated with a +particular CPU. If that CPU is taken offline, the interrupt is migrated to +another online CPU. + +Devices with large numbers of interrupt vectors can stress the available vector +space. For example, an NVMe device with 128 I/O queues typically requests one +interrupt per queue on systems with at least 128 CPUs. Two such devices +therefore request 256 interrupts. On x86, the interrupt vector space is +notoriously low, providing only 256 vectors per CPU, and the kernel reserves a +subset of these, further reducing the number available for device interrupts. +In practice this is not an issue because the interrupts are distributed across +many CPUs, so each CPU only receives a small number of vectors. + +During system suspend, however, all secondary CPUs are taken offline and all +interrupts are migrated to the single CPU that remains online. This can exhaust +the available interrupt vectors on that CPU and cause the suspend operation to +fail. + +Affinity‑managed interrupts address this limitation. Each interrupt is assigned +a CPU affinity mask that specifies the set of CPUs on which the interrupt may +be targeted. When a CPU in the mask goes offline, the interrupt is moved to the +next CPU in the mask. If the last CPU in the mask goes offline, the interrupt +is shut down. Drivers using affinity‑managed interrupts must ensure that the +associated queue is quiesced before the interrupt is disabled so that no +further interrupts are generated. When a CPU in the affinity mask comes back +online, the interrupt is re‑enabled. + +Implementation +-------------- + +Devices must provide per‑instance interrupts, such as per‑I/O‑queue interrupts +for storage devices like NVMe. The driver allocates interrupt vectors with the +required affinity settings using struct irq_affinity. For MSI‑X devices, this +is done via pci_alloc_irq_vectors_affinity() with the PCI_IRQ_AFFINITY flag +set. + +Based on the provided affinity information, the IRQ core attempts to spread the +interrupts evenly across the system. The affinity masks are computed during +this allocation step, but the final IRQ assignment is performed when +request_irq() is invoked. + +Isolated CPUs +------------- + +The affinity of managed interrupts is handled entirely in the kernel and cannot +be modified from user space through the /proc interfaces. The managed_irq +sub‑parameter of the isolcpus boot option specifies a CPU mask that managed +interrupts should attempt to avoid. This isolation is best‑effort and only +applies if the automatically assigned interrupt mask also contains online CPUs +outside the avoided mask. If the requested mask contains only isolated CPUs, +the setting has no effect. + +CPUs listed in the avoided mask remain part of the interrupt’s affinity mask. +This means that if all non‑isolated CPUs go offline while isolated CPUs remain +online, the interrupt will be assigned to one of the isolated CPUs. + +The following examples assume a system with 8 CPUs. + +- A QEMU instance is booted with "-device virtio-scsi-pci". + The MSI‑X device exposes 11 interrupts: 3 "management" interrupts and 8 + "queue" interrupts. The driver requests the 8 queue interrupts, each of which + is affine to exactly one CPU. If that CPU goes offline, the interrupt is shut + down. + + Assuming interrupt 48 is one of the queue interrupts, the following appears:: + + /proc/irq/48/effective_affinity_list:7 + /proc/irq/48/smp_affinity_list:7 + + This indicates that the interrupt is served only by CPU7. Shutting down CPU7 + does not migrate the interrupt to another CPU:: + + /proc/irq/48/effective_affinity_list:0 + /proc/irq/48/smp_affinity_list:7 + + This can be verified via the debugfs interface + (/sys/kernel/debug/irq/irqs/48). The dstate field will include + IRQD_IRQ_DISABLED, IRQD_IRQ_MASKED and IRQD_MANAGED_SHUTDOWN. + +- A QEMU instance is booted with "-device virtio-scsi-pci,num_queues=2" + and the kernel command line includes: + "irqaffinity=0,1 isolcpus=domain,2-7 isolcpus=managed_irq,1-3,5-7". + The MSI‑X device exposes 5 interrupts: 3 management interrupts and 2 queue + interrupts. The management interrupts follow the irqaffinity= setting. The + queue interrupts are spread across available CPUs:: + + /proc/irq/47/effective_affinity_list:0 + /proc/irq/47/smp_affinity_list:0-3 + /proc/irq/48/effective_affinity_list:4 + /proc/irq/48/smp_affinity_list:4-7 + + The two queue interrupts are evenly distributed. Interrupt 48 is placed on CPU4 + because the managed_irq mask avoids CPUs 5–7 when possible. + + Replacing the managed_irq argument with "isolcpus=managed_irq,1-3,4-5,7" + results in:: + + /proc/irq/48/effective_affinity_list:6 + /proc/irq/48/smp_affinity_list:4-7 + + Interrupt 48 is now served on CPU6 because the system avoids CPUs 4, 5 and + 7. If CPU6 is taken offline, the interrupt migrates to one of the "isolated" + CPUs:: + + /proc/irq/48/effective_affinity_list:7 + /proc/irq/48/smp_affinity_list:4-7 + + The interrupt is shut down once all CPUs listed in its smp_affinity mask are + offline. diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index ae92a2571388..e8211c4ca662 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -3,12 +3,6 @@ The Linux Kernel API ==================== -List Management Functions -========================= - -.. kernel-doc:: include/linux/list.h - :internal: - Basic C Library Functions ========================= @@ -136,26 +130,28 @@ Arithmetic Overflow Checking CRC Functions ------------- -.. kernel-doc:: lib/crc4.c +.. kernel-doc:: lib/crc/crc4.c :export: -.. kernel-doc:: lib/crc7.c +.. kernel-doc:: lib/crc/crc7.c :export: -.. kernel-doc:: lib/crc8.c +.. kernel-doc:: lib/crc/crc8.c :export: -.. kernel-doc:: lib/crc16.c +.. kernel-doc:: lib/crc/crc16.c :export: -.. kernel-doc:: lib/crc32.c - -.. kernel-doc:: lib/crc-ccitt.c +.. kernel-doc:: lib/crc/crc-ccitt.c :export: -.. kernel-doc:: lib/crc-itu-t.c +.. kernel-doc:: lib/crc/crc-itu-t.c :export: +.. kernel-doc:: include/linux/crc32.h + +.. kernel-doc:: include/linux/crc64.h + Base 2 log and power Functions ------------------------------ diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst new file mode 100644 index 000000000000..799d743105a6 --- /dev/null +++ b/Documentation/core-api/kho/abi.rst @@ -0,0 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +================== +Kexec Handover ABI +================== + +Core Kexec Handover ABI +======================== + +.. kernel-doc:: include/linux/kho/abi/kexec_handover.h + :doc: Kexec Handover ABI + +vmalloc preservation ABI +======================== + +.. kernel-doc:: include/linux/kho/abi/kexec_handover.h + :doc: Kexec Handover ABI for vmalloc Preservation + +memblock preservation ABI +========================= + +.. kernel-doc:: include/linux/kho/abi/memblock.h + :doc: memblock kexec handover ABI + +KHO persistent memory tracker ABI +================================= + +.. kernel-doc:: include/linux/kho/abi/kexec_handover.h + :doc: KHO persistent memory tracker + +See Also +======== + +- :doc:`/admin-guide/mm/kho` diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst new file mode 100644 index 000000000000..0a2dee4f8e7d --- /dev/null +++ b/Documentation/core-api/kho/index.rst @@ -0,0 +1,89 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +.. _kho-concepts: + +======================== +Kexec Handover Subsystem +======================== + +Overview +======== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +KHO uses :ref:`flattened device tree (FDT) <kho_fdt>` to pass information about +the preserved state from pre-exec kernel to post-kexec kernel and :ref:`scratch +memory regions <kho_scratch>` to ensure integrity of the preserved memory. + +.. _kho_fdt: + +KHO FDT +======= +Every KHO kexec carries a KHO specific flattened device tree (FDT) blob that +describes the preserved state. The FDT includes properties describing preserved +memory regions and nodes that hold subsystem specific state. + +The preserved memory regions contain either serialized subsystem states, or +in-memory data that shall not be touched across kexec. After KHO, subsystems +can retrieve and restore the preserved state from KHO FDT. + +Subsystems participating in KHO can define their own format for state +serialization and preservation. + +KHO FDT and structures defined by the subsystems form an ABI between pre-kexec +and post-kexec kernels. This ABI is defined by header files in +``include/linux/kho/abi`` directory. + +.. toctree:: + :maxdepth: 1 + + abi.rst + +.. _kho_scratch: + +Scratch Regions +=============== + +To boot into kexec, we need to have a physically contiguous memory range that +contains no handed over memory. Kexec then places the target kernel and initrd +into that region. The new kernel exclusively uses this region for memory +allocations before during boot up to the initialization of the page allocator. + +We guarantee that we always have such regions through the scratch regions: On +first boot KHO allocates several physically contiguous memory regions. Since +after kexec these regions will be used by early memory allocations, there is a +scratch region per NUMA node plus a scratch region to satisfy allocations +requests that do not require particular NUMA node assignment. +By default, size of the scratch region is calculated based on amount of memory +allocated during boot. The ``kho_scratch`` kernel command line option may be +used to explicitly define size of the scratch regions. +The scratch regions are declared as CMA when page allocator is initialized so +that their memory can be used during system lifetime. CMA gives us the +guarantee that no handover pages land in that region, because handover pages +must be at a static physical memory location and CMA enforces that only +movable pages can be located inside. + +After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and +instead reuse the exact same region that was originally allocated. This allows +us to recursively execute any amount of KHO kexecs. Because we used this region +for boot memory allocations and as target memory for kexec blobs, some parts +of that memory region may be reserved. These reservations are irrelevant for +the next KHO, because kexec can overwrite even the original kernel. + +Kexec Handover Radix Tree +========================= + +.. kernel-doc:: include/linux/kho_radix_tree.h + :doc: Kexec Handover Radix Tree + +Public API +========== + +.. kernel-doc:: kernel/liveupdate/kexec_handover.c + :export: + +See Also +======== + +- :doc:`/admin-guide/mm/kho` diff --git a/Documentation/core-api/kobject.rst b/Documentation/core-api/kobject.rst index 7310247310a0..5f6c61bc03bf 100644 --- a/Documentation/core-api/kobject.rst +++ b/Documentation/core-api/kobject.rst @@ -78,7 +78,7 @@ just a matter of using the kobj member. Code that works with kobjects will often have the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) -and, instead, use the container_of() macro, found in ``<linux/kernel.h>``:: +and, instead, use the container_of() macro, found in ``<linux/container_of.h>``:: container_of(ptr, type, member) diff --git a/Documentation/core-api/librs.rst b/Documentation/core-api/librs.rst index 6010f5bc5bf9..0d88893dbc03 100644 --- a/Documentation/core-api/librs.rst +++ b/Documentation/core-api/librs.rst @@ -209,4 +209,4 @@ testing. Thanks a lot. The following people have contributed to this document: -Thomas Gleixner\ tglx@linutronix.de +Thomas Gleixner\ tglx@kernel.org diff --git a/Documentation/core-api/list.rst b/Documentation/core-api/list.rst new file mode 100644 index 000000000000..241464ca0549 --- /dev/null +++ b/Documentation/core-api/list.rst @@ -0,0 +1,785 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +===================== +Linked Lists in Linux +===================== + +:Author: Nicolas Frattaroli <nicolas.frattaroli@collabora.com> + +.. contents:: + +Introduction +============ + +Linked lists are one of the most basic data structures used in many programs. +The Linux kernel implements several different flavours of linked lists. The +purpose of this document is not to explain linked lists in general, but to show +new kernel developers how to use the Linux kernel implementations of linked +lists. + +Please note that while linked lists certainly are ubiquitous, they are rarely +the best data structure to use in cases where a simple array doesn't already +suffice. In particular, due to their poor data locality, linked lists are a bad +choice in situations where performance may be of consideration. Familiarizing +oneself with other in-kernel generic data structures, especially for concurrent +accesses, is highly encouraged. + +Linux implementation of doubly linked lists +=========================================== + +Linux's linked list implementations can be used by including the header file +``<linux/list.h>``. + +The doubly-linked list will likely be the most familiar to many readers. It's a +list that can efficiently be traversed forwards and backwards. + +The Linux kernel's doubly-linked list is circular in nature. This means that to +get from the head node to the tail, we can just travel one edge backwards. +Similarly, to get from the tail node to the head, we can simply travel forwards +"beyond" the tail and arrive back at the head. + +Declaring a node +---------------- + +A node in a doubly-linked list is declared by adding a struct list_head +member to the data structure you wish to be contained in the list: + +.. code-block:: c + + struct clown { + unsigned long long shoe_size; + const char *name; + struct list_head node; /* the aforementioned member */ + }; + +This may be an unfamiliar approach to some, as the classical explanation of a +linked list is a list node data structure with pointers to the previous and next +list node, as well the payload data. Linux chooses this approach because it +allows for generic list modification code regardless of what data structure is +contained within the list. Since the struct list_head member is not a pointer +but part of the data structure proper, the container_of() pattern can be used by +the list implementation to access the payload data regardless of its type, while +staying oblivious to what said type actually is. + +Declaring and initializing a list +--------------------------------- + +A doubly-linked list can then be declared as just another struct list_head, +and initialized with the LIST_HEAD_INIT() macro during initial assignment, or +with the INIT_LIST_HEAD() function later: + +.. code-block:: c + + struct clown_car { + int tyre_pressure[4]; + struct list_head clowns; /* Looks like a node! */ + }; + + /* ... Somewhere later in our driver ... */ + + static int circus_init(struct circus_priv *circus) + { + struct clown_car other_car = { + .tyre_pressure = {10, 12, 11, 9}, + .clowns = LIST_HEAD_INIT(other_car.clowns) + }; + + INIT_LIST_HEAD(&circus->car.clowns); + + return 0; + } + +A further point of confusion to some may be that the list itself doesn't really +have its own type. The concept of the entire linked list and a +struct list_head member that points to other entries in the list are one and +the same. + +Adding nodes to the list +------------------------ + +Adding a node to the linked list is done through the list_add() macro. + +We'll return to our clown car example to illustrate how nodes get added to the +list: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + struct clown_car *car = &circus->car; + struct clown *grock; + struct clown *dimitri; + + /* State 1 */ + + grock = kzalloc(sizeof(*grock), GFP_KERNEL); + if (!grock) + return -ENOMEM; + grock->name = "Grock"; + grock->shoe_size = 1000; + + /* Note that we're adding the "node" member */ + list_add(&grock->node, &car->clowns); + + /* State 2 */ + + dimitri = kzalloc(sizeof(*dimitri), GFP_KERNEL); + if (!dimitri) + return -ENOMEM; + dimitri->name = "Dimitri"; + dimitri->shoe_size = 50; + + list_add(&dimitri->node, &car->clowns); + + /* State 3 */ + + return 0; + } + +In State 1, our list of clowns is still empty:: + + .------. + v | + .--------. | + | clowns |--' + '--------' + +This diagram shows the singular "clowns" node pointing at itself. In this +diagram, and all following diagrams, only the forward edges are shown, to aid in +clarity. + +In State 2, we've added Grock after the list head:: + + .--------------------. + v | + .--------. .-------. | + | clowns |---->| Grock |--' + '--------' '-------' + +This diagram shows the "clowns" node pointing at a new node labeled "Grock". +The Grock node is pointing back at the "clowns" node. + +In State 3, we've added Dimitri after the list head, resulting in the following:: + + .------------------------------------. + v | + .--------. .---------. .-------. | + | clowns |---->| Dimitri |---->| Grock |--' + '--------' '---------' '-------' + +This diagram shows the "clowns" node pointing at a new node labeled "Dimitri", +which then points at the node labeled "Grock". The "Grock" node still points +back at the "clowns" node. + +If we wanted to have Dimitri inserted at the end of the list instead, we'd use +list_add_tail(). Our code would then look like this: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + /* ... */ + + list_add_tail(&dimitri->node, &car->clowns); + + /* State 3b */ + + return 0; + } + +This results in the following list:: + + .------------------------------------. + v | + .--------. .-------. .---------. | + | clowns |---->| Grock |---->| Dimitri |--' + '--------' '-------' '---------' + +This diagram shows the "clowns" node pointing at the node labeled "Grock", +which points at the new node labeled "Dimitri". The node labeled "Dimitri" +points back at the "clowns" node. + +Traversing the list +------------------- + +To iterate the list, we can loop through all nodes within the list with +list_for_each(). + +In our clown example, this results in the following somewhat awkward code: + +.. code-block:: c + + static unsigned long long circus_get_max_shoe_size(struct circus_priv *circus) + { + unsigned long long res = 0; + struct clown *e; + struct list_head *cur; + + list_for_each(cur, &circus->car.clowns) { + e = list_entry(cur, struct clown, node); + if (e->shoe_size > res) + res = e->shoe_size; + } + + return res; + } + +The list_entry() macro internally uses the aforementioned container_of() to +retrieve the data structure instance that ``node`` is a member of. + +Note how the additional list_entry() call is a little awkward here. It's only +there because we're iterating through the ``node`` members, but we really want +to iterate through the payload, i.e. the ``struct clown`` that contains each +node's struct list_head. For this reason, there is a second macro: +list_for_each_entry() + +Using it would change our code to something like this: + +.. code-block:: c + + static unsigned long long circus_get_max_shoe_size(struct circus_priv *circus) + { + unsigned long long res = 0; + struct clown *e; + + list_for_each_entry(e, &circus->car.clowns, node) { + if (e->shoe_size > res) + res = e->shoe_size; + } + + return res; + } + +This eliminates the need for the list_entry() step, and our loop cursor is now +of the type of our payload. The macro is given the member name that corresponds +to the list's struct list_head within the clown data structure so that it can +still walk the list. + +Removing nodes from the list +---------------------------- + +The list_del() function can be used to remove entries from the list. It not only +removes the given entry from the list, but poisons the entry's ``prev`` and +``next`` pointers, so that unintended use of the entry after removal does not +go unnoticed. + +We can extend our previous example to remove one of the entries: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + /* ... */ + + list_add(&dimitri->node, &car->clowns); + + /* State 3 */ + + list_del(&dimitri->node); + + /* State 4 */ + + return 0; + } + +The result of this would be this:: + + .--------------------. + v | + .--------. .-------. | .---------. + | clowns |---->| Grock |--' | Dimitri | + '--------' '-------' '---------' + +This diagram shows the "clowns" node pointing at the node labeled "Grock", +which points back at the "clowns" node. Off to the side is a lone node labeled +"Dimitri", which has no arrows pointing anywhere. + +Note how the Dimitri node does not point to itself; its pointers are +intentionally set to a "poison" value that the list code refuses to traverse. + +If we wanted to reinitialize the removed node instead to make it point at itself +again like an empty list head, we can use list_del_init() instead: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + /* ... */ + + list_add(&dimitri->node, &car->clowns); + + /* State 3 */ + + list_del_init(&dimitri->node); + + /* State 4b */ + + return 0; + } + +This results in the deleted node pointing to itself again:: + + .--------------------. .-------. + v | v | + .--------. .-------. | .---------. | + | clowns |---->| Grock |--' | Dimitri |--' + '--------' '-------' '---------' + +This diagram shows the "clowns" node pointing at the node labeled "Grock", +which points back at the "clowns" node. Off to the side is a lone node labeled +"Dimitri", which points to itself. + +Traversing whilst removing nodes +-------------------------------- + +Deleting entries while we're traversing the list will cause problems if we use +list_for_each() and list_for_each_entry(), as deleting the current entry would +modify the ``next`` pointer of it, which means the traversal can't properly +advance to the next list entry. + +There is a solution to this however: list_for_each_safe() and +list_for_each_entry_safe(). These take an additional parameter of a pointer to +a struct list_head to use as temporary storage for the next entry during +iteration, solving the issue. + +An example of how to use it: + +.. code-block:: c + + static void circus_eject_insufficient_clowns(struct circus_priv *circus) + { + struct clown *e; + struct clown *n; /* temporary storage for safe iteration */ + + list_for_each_entry_safe(e, n, &circus->car.clowns, node) { + if (e->shoe_size < 500) + list_del(&e->node); + } + } + +Proper memory management (i.e. freeing the deleted node while making sure +nothing still references it) in this case is left as an exercise to the reader. + +Cutting a list +-------------- + +There are two helper functions to cut lists with. Both take elements from the +list ``head``, and replace the contents of the list ``list``. + +The first such function is list_cut_position(). It removes all list entries from +``head`` up to and including ``entry``, placing them in ``list`` instead. + +In this example, it's assumed we start with the following list:: + + .----------------------------------------------------------------. + v | + .--------. .-------. .---------. .-----. .---------. | + | clowns |---->| Grock |---->| Dimitri |---->| Pic |---->| Alfredo |--' + '--------' '-------' '---------' '-----' '---------' + +With the following code, every clown up to and including "Pic" is moved from +the "clowns" list head to a separate struct list_head initialized at local +stack variable ``retirement``: + +.. code-block:: c + + static void circus_retire_clowns(struct circus_priv *circus) + { + struct list_head retirement = LIST_HEAD_INIT(retirement); + struct clown *grock, *dimitri, *pic, *alfredo; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + list_cut_position(&retirement, &car->clowns, &pic->node); + + /* State 1 */ + } + +The resulting ``car->clowns`` list would be this:: + + .----------------------. + v | + .--------. .---------. | + | clowns |---->| Alfredo |--' + '--------' '---------' + +Meanwhile, the ``retirement`` list is transformed to the following:: + + .--------------------------------------------------. + v | + .------------. .-------. .---------. .-----. | + | retirement |---->| Grock |---->| Dimitri |---->| Pic |--' + '------------' '-------' '---------' '-----' + +The second function, list_cut_before(), is much the same, except it cuts before +the ``entry`` node, i.e. it removes all list entries from ``head`` up to but +excluding ``entry``, placing them in ``list`` instead. This example assumes the +same initial starting list as the previous example: + +.. code-block:: c + + static void circus_retire_clowns(struct circus_priv *circus) + { + struct list_head retirement = LIST_HEAD_INIT(retirement); + struct clown *grock, *dimitri, *pic, *alfredo; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + list_cut_before(&retirement, &car->clowns, &pic->node); + + /* State 1b */ + } + +The resulting ``car->clowns`` list would be this:: + + .----------------------------------. + v | + .--------. .-----. .---------. | + | clowns |---->| Pic |---->| Alfredo |--' + '--------' '-----' '---------' + +Meanwhile, the ``retirement`` list is transformed to the following:: + + .--------------------------------------. + v | + .------------. .-------. .---------. | + | retirement |---->| Grock |---->| Dimitri |--' + '------------' '-------' '---------' + +It should be noted that both functions will destroy links to any existing nodes +in the destination ``struct list_head *list``. + +Moving entries and partial lists +-------------------------------- + +The list_move() and list_move_tail() functions can be used to move an entry +from one list to another, to either the start or end respectively. + +In the following example, we'll assume we start with two lists ("clowns" and +"sidewalk" in the following initial state "State 0":: + + .----------------------------------------------------------------. + v | + .--------. .-------. .---------. .-----. .---------. | + | clowns |---->| Grock |---->| Dimitri |---->| Pic |---->| Alfredo |--' + '--------' '-------' '---------' '-----' '---------' + + .-------------------. + v | + .----------. .-----. | + | sidewalk |---->| Pio |--' + '----------' '-----' + +We apply the following example code to the two lists: + +.. code-block:: c + + static void circus_clowns_exit_car(struct circus_priv *circus) + { + struct list_head sidewalk = LIST_HEAD_INIT(sidewalk); + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_move(&pic->node, &sidewalk); + + /* State 1 */ + + list_move_tail(&dimitri->node, &sidewalk); + + /* State 2 */ + } + +In State 1, we arrive at the following situation:: + + .-----------------------------------------------------. + | | + v | + .--------. .-------. .---------. .---------. | + | clowns |---->| Grock |---->| Dimitri |---->| Alfredo |--' + '--------' '-------' '---------' '---------' + + .-------------------------------. + v | + .----------. .-----. .-----. | + | sidewalk |---->| Pic |---->| Pio |--' + '----------' '-----' '-----' + +In State 2, after we've moved Dimitri to the tail of sidewalk, the situation +changes as follows:: + + .-------------------------------------. + | | + v | + .--------. .-------. .---------. | + | clowns |---->| Grock |---->| Alfredo |--' + '--------' '-------' '---------' + + .-----------------------------------------------. + v | + .----------. .-----. .-----. .---------. | + | sidewalk |---->| Pic |---->| Pio |---->| Dimitri |--' + '----------' '-----' '-----' '---------' + +As long as the source and destination list head are part of the same list, we +can also efficiently bulk move a segment of the list to the tail end of the +list. We continue the previous example by adding a list_bulk_move_tail() after +State 2, moving Pic and Pio to the tail end of the sidewalk list. + +.. code-block:: c + + static void circus_clowns_exit_car(struct circus_priv *circus) + { + struct list_head sidewalk = LIST_HEAD_INIT(sidewalk); + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_move(&pic->node, &sidewalk); + + /* State 1 */ + + list_move_tail(&dimitri->node, &sidewalk); + + /* State 2 */ + + list_bulk_move_tail(&sidewalk, &pic->node, &pio->node); + + /* State 3 */ + } + +For the sake of brevity, only the altered "sidewalk" list at State 3 is depicted +in the following diagram:: + + .-----------------------------------------------. + v | + .----------. .---------. .-----. .-----. | + | sidewalk |---->| Dimitri |---->| Pic |---->| Pio |--' + '----------' '---------' '-----' '-----' + +Do note that list_bulk_move_tail() does not do any checking as to whether all +three supplied ``struct list_head *`` parameters really do belong to the same +list. If you use it outside the constraints the documentation gives, then the +result is a matter between you and the implementation. + +Rotating entries +---------------- + +A common write operation on lists, especially when using them as queues, is +to rotate it. A list rotation means entries at the front are sent to the back. + +For rotation, Linux provides us with two functions: list_rotate_left() and +list_rotate_to_front(). The former can be pictured like a bicycle chain, taking +the entry after the supplied ``struct list_head *`` and moving it to the tail, +which in essence means the entire list, due to its circular nature, rotates by +one position. + +The latter, list_rotate_to_front(), takes the same concept one step further: +instead of advancing the list by one entry, it advances it *until* the specified +entry is the new front. + +In the following example, our starting state, State 0, is the following:: + + .-----------------------------------------------------------------. + v | + .--------. .-------. .---------. .-----. .---------. .-----. | + | clowns |-->| Grock |-->| Dimitri |-->| Pic |-->| Alfredo |-->| Pio |-' + '--------' '-------' '---------' '-----' '---------' '-----' + +The example code being used to demonstrate list rotations is the following: + +.. code-block:: c + + static void circus_clowns_rotate(struct circus_priv *circus) + { + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_rotate_left(&car->clowns); + + /* State 1 */ + + list_rotate_to_front(&alfredo->node, &car->clowns); + + /* State 2 */ + + } + +In State 1, we arrive at the following situation:: + + .-----------------------------------------------------------------. + v | + .--------. .---------. .-----. .---------. .-----. .-------. | + | clowns |-->| Dimitri |-->| Pic |-->| Alfredo |-->| Pio |-->| Grock |-' + '--------' '---------' '-----' '---------' '-----' '-------' + +Next, after the list_rotate_to_front() call, we arrive in the following +State 2:: + + .-----------------------------------------------------------------. + v | + .--------. .---------. .-----. .-------. .---------. .-----. | + | clowns |-->| Alfredo |-->| Pio |-->| Grock |-->| Dimitri |-->| Pic |-' + '--------' '---------' '-----' '-------' '---------' '-----' + +As is hopefully evident from the diagrams, the entries in front of "Alfredo" +were cycled to the tail end of the list. + +Swapping entries +---------------- + +Another common operation is that two entries need to be swapped with each other. + +For this, Linux provides us with list_swap(). + +In the following example, we have a list with three entries, and swap two of +them. This is our starting state in "State 0":: + + .-----------------------------------------. + v | + .--------. .-------. .---------. .-----. | + | clowns |-->| Grock |-->| Dimitri |-->| Pic |-' + '--------' '-------' '---------' '-----' + +.. code-block:: c + + static void circus_clowns_swap(struct circus_priv *circus) + { + struct clown *grock, *dimitri, *pic; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_swap(&dimitri->node, &pic->node); + + /* State 1 */ + } + +The resulting list at State 1 is the following:: + + .-----------------------------------------. + v | + .--------. .-------. .-----. .---------. | + | clowns |-->| Grock |-->| Pic |-->| Dimitri |-' + '--------' '-------' '-----' '---------' + +As is evident by comparing the diagrams, the "Pic" and "Dimitri" nodes have +traded places. + +Splicing two lists together +--------------------------- + +Say we have two lists, in the following example one represented by a list head +we call "knie" and one we call "stey". In a hypothetical circus acquisition, +the two list of clowns should be spliced together. The following is our +situation in "State 0":: + + .-----------------------------------------. + | | + v | + .------. .-------. .---------. .-----. | + | knie |-->| Grock |-->| Dimitri |-->| Pic |--' + '------' '-------' '---------' '-----' + + .-----------------------------. + v | + .------. .---------. .-----. | + | stey |-->| Alfredo |-->| Pio |--' + '------' '---------' '-----' + +The function to splice these two lists together is list_splice(). Our example +code is as follows: + +.. code-block:: c + + static void circus_clowns_splice(void) + { + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct list_head knie = LIST_HEAD_INIT(knie); + struct list_head stey = LIST_HEAD_INIT(stey); + + /* ... Clown allocation and initialization here ... */ + + list_add_tail(&grock->node, &knie); + list_add_tail(&dimitri->node, &knie); + list_add_tail(&pic->node, &knie); + list_add_tail(&alfredo->node, &stey); + list_add_tail(&pio->node, &stey); + + /* State 0 */ + + list_splice(&stey, &dimitri->node); + + /* State 1 */ + } + +The list_splice() call here adds all the entries in ``stey`` to the list +``dimitri``'s ``node`` list_head is in, after the ``node`` of ``dimitri``. A +somewhat surprising diagram of the resulting "State 1" follows:: + + .-----------------------------------------------------------------. + | | + v | + .------. .-------. .---------. .---------. .-----. .-----. | + | knie |-->| Grock |-->| Dimitri |-->| Alfredo |-->| Pio |-->| Pic |--' + '------' '-------' '---------' '---------' '-----' '-----' + ^ + .-------------------------------' + | + .------. | + | stey |--' + '------' + +Traversing the ``stey`` list no longer results in correct behavior. A call of +list_for_each() on ``stey`` results in an infinite loop, as it never returns +back to the ``stey`` list head. + +This is because list_splice() did not reinitialize the list_head it took +entries from, leaving its pointer pointing into what is now a different list. + +If we want to avoid this situation, list_splice_init() can be used. It does the +same thing as list_splice(), except reinitalizes the donor list_head after the +transplant. + +Concurrency considerations +-------------------------- + +Concurrent access and modification of a list needs to be protected with a lock +in most cases. Alternatively and preferably, one may use the RCU primitives for +lists in read-mostly use-cases, where read accesses to the list are common but +modifications to the list less so. See Documentation/RCU/listRCU.rst for more +details. + +Further reading +--------------- + +* `How does the kernel implements Linked Lists? - KernelNewbies <https://kernelnewbies.org/FAQ/LinkedLists>`_ + +Full List API +============= + +.. kernel-doc:: include/linux/list.h + :internal: + +Private List API +================ + +.. kernel-doc:: include/linux/list_private.h + :doc: Private List Primitives + +.. kernel-doc:: include/linux/list_private.h + :internal: diff --git a/Documentation/core-api/liveupdate.rst b/Documentation/core-api/liveupdate.rst new file mode 100644 index 000000000000..5a292d0f3706 --- /dev/null +++ b/Documentation/core-api/liveupdate.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +Live Update Orchestrator +======================== +:Author: Pasha Tatashin <pasha.tatashin@soleen.com> + +.. kernel-doc:: kernel/liveupdate/luo_core.c + :doc: Live Update Orchestrator (LUO) + +LUO Sessions +============ +.. kernel-doc:: kernel/liveupdate/luo_session.c + :doc: LUO Sessions + +LUO Preserving File Descriptors +=============================== +.. kernel-doc:: kernel/liveupdate/luo_file.c + :doc: LUO File Descriptors + +LUO File Lifecycle Bound Global Data +==================================== +.. kernel-doc:: kernel/liveupdate/luo_flb.c + :doc: LUO File Lifecycle Bound Global Data + +Live Update Orchestrator ABI +============================ +.. kernel-doc:: include/linux/kho/abi/luo.h + :doc: Live Update Orchestrator ABI + +The following types of file descriptors can be preserved + +.. toctree:: + :maxdepth: 1 + + ../mm/memfd_preservation + +Public API +========== +.. kernel-doc:: include/linux/liveupdate.h + +.. kernel-doc:: include/linux/kho/abi/luo.h + :functions: + +.. kernel-doc:: kernel/liveupdate/luo_core.c + :export: + +.. kernel-doc:: kernel/liveupdate/luo_flb.c + :export: + +.. kernel-doc:: kernel/liveupdate/luo_file.c + :export: + +Internal API +============ +.. kernel-doc:: kernel/liveupdate/luo_core.c + :internal: + +.. kernel-doc:: kernel/liveupdate/luo_flb.c + :internal: + +.. kernel-doc:: kernel/liveupdate/luo_session.c + :internal: + +.. kernel-doc:: kernel/liveupdate/luo_file.c + :internal: + +See Also +======== + +- :doc:`Live Update uAPI </userspace-api/liveupdate>` +- :doc:`/core-api/kho/index` diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index 682259ee633a..46b0490f5319 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -9,6 +9,9 @@ Memory hotplug event notifier Hotplugging events are sent to a notification queue. +Memory notifier +---------------- + There are six types of notification defined in ``include/linux/memory.h``: MEM_GOING_ONLINE @@ -56,20 +59,18 @@ The third argument (arg) passes a pointer of struct memory_notify:: struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid_normal; - int status_change_nid; } - start_pfn is start_pfn of online/offline memory. - nr_pages is # of pages of online/offline memory. -- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask - is (will be) set/clear, if this is -1, then nodemask status is not changed. -- status_change_nid is set node id when N_MEMORY of nodemask is (will be) - set/clear. It means a new(memoryless) node gets new memory by online and a - node loses all memory. If this is -1, then nodemask status is not changed. - If status_changed_nid* >= 0, callback should create/discard structures for the - node if necessary. +It is possible to get notified for MEM_CANCEL_ONLINE without having been notified +for MEM_GOING_ONLINE, and the same applies to MEM_CANCEL_OFFLINE and +MEM_GOING_OFFLINE. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of memory_notify make no assumptions and get +prepared to handle such cases. The callback routine shall return one of the values NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP @@ -83,6 +84,78 @@ further processing of the notification queue. NOTIFY_STOP stops further processing of the notification queue. +Numa node notifier +------------------ + +There are six types of notification defined in ``include/linux/node.h``: + +NODE_ADDING_FIRST_MEMORY + Generated before memory becomes available to this node for the first time. + +NODE_CANCEL_ADDING_FIRST_MEMORY + Generated if NODE_ADDING_FIRST_MEMORY fails. + +NODE_ADDED_FIRST_MEMORY + Generated when memory has become available for this node for the first time. + +NODE_REMOVING_LAST_MEMORY + Generated when the last memory available to this node is about to be offlined. + +NODE_CANCEL_REMOVING_LAST_MEMORY + Generated when NODE_CANCEL_REMOVING_LAST_MEMORY fails. + +NODE_REMOVED_LAST_MEMORY + Generated when the last memory available to this node has been offlined. + +A callback routine can be registered by calling:: + + hotplug_node_notifier(callback_func, priority) + +Callback functions with higher values of priority are called before callback +functions with lower values. + +A callback function must have the following prototype:: + + int callback_func( + + struct notifier_block *self, unsigned long action, void *arg); + +The first argument of the callback function (self) is a pointer to the block +of the notifier chain that points to the callback function itself. +The second argument (action) is one of the event types described above. +The third argument (arg) passes a pointer of struct node_notify:: + + struct node_notify { + int nid; + } + +- nid is the node we are adding or removing memory to. + +It is possible to get notified for NODE_CANCEL_ADDING_FIRST_MEMORY without +having been notified for NODE_ADDING_FIRST_MEMORY, and the same applies to +NODE_CANCEL_REMOVING_LAST_MEMORY and NODE_REMOVING_LAST_MEMORY. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of node_notify make no assumptions and get +prepared to handle such cases. + +The callback routine shall return one of the values +NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP +defined in ``include/linux/notifier.h`` + +NOTIFY_DONE and NOTIFY_OK have no effect on the further processing. + +NOTIFY_BAD is used as response to the NODE_ADDING_FIRST_MEMORY, +NODE_REMOVING_LAST_MEMORY, NODE_ADDED_FIRST_MEMORY or +NODE_REMOVED_LAST_MEMORY action to cancel hotplugging. +It stops further processing of the notification queue. + +NOTIFY_STOP stops further processing of the notification queue. + +Please note that we should not fail for NODE_ADDED_FIRST_MEMORY / +NODE_REMOVED_FIRST_MEMORY, as memory_hotplug code cannot rollback at that +point anymore. + Locking Internals ================= diff --git a/Documentation/core-api/min_heap.rst b/Documentation/core-api/min_heap.rst index 683bc6d09f00..9f57766581df 100644 --- a/Documentation/core-api/min_heap.rst +++ b/Documentation/core-api/min_heap.rst @@ -47,8 +47,8 @@ Example: #define MIN_HEAP_PREALLOCATED(_type, _name, _nr) struct _name { - int nr; /* Number of elements in the heap */ - int size; /* Maximum number of elements that can be held */ + size_t nr; /* Number of elements in the heap */ + size_t size; /* Maximum number of elements that can be held */ _type *data; /* Pointer to the heap data */ _type preallocated[_nr]; /* Static preallocated array */ } diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index af8151db88b2..aabdd3cba58e 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -91,12 +91,6 @@ Memory pools .. kernel-doc:: mm/mempool.c :export: -DMA pools -========= - -.. kernel-doc:: mm/dmapool.c - :export: - More Memory Management Functions ================================ @@ -124,7 +118,6 @@ More Memory Management Functions .. kernel-doc:: mm/memremap.c .. kernel-doc:: mm/hugetlb.c .. kernel-doc:: mm/swap.c -.. kernel-doc:: mm/zpool.c .. kernel-doc:: mm/memcontrol.c .. #kernel-doc:: mm/memory-tiers.c (build warnings) .. kernel-doc:: mm/shmem.c @@ -137,6 +130,5 @@ More Memory Management Functions .. kernel-doc:: mm/vmscan.c .. kernel-doc:: mm/memory_hotplug.c .. kernel-doc:: mm/mmu_notifier.c -.. kernel-doc:: mm/balloon_compaction.c +.. kernel-doc:: mm/balloon.c .. kernel-doc:: mm/huge_memory.c -.. kernel-doc:: mm/io-mapping.c diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst index 0ce2078c8e13..f68f1e08fef9 100644 --- a/Documentation/core-api/packing.rst +++ b/Documentation/core-api/packing.rst @@ -319,7 +319,7 @@ Here is an example of how to use the fields APIs: #define SIZE 13 - typdef struct __packed { u8 buf[SIZE]; } packed_buf_t; + typedef struct __packed { u8 buf[SIZE]; } packed_buf_t; static const struct packed_field_u8 fields[] = { PACKED_FIELD(100, 90, struct data, field1), diff --git a/Documentation/core-api/printk-basics.rst b/Documentation/core-api/printk-basics.rst index 2dde24ca7d9f..48eaff0ce44c 100644 --- a/Documentation/core-api/printk-basics.rst +++ b/Documentation/core-api/printk-basics.rst @@ -103,6 +103,42 @@ For debugging purposes there are also two conditionally-compiled macros: pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined. +Avoiding lockups from excessive printk() use +============================================ + +.. note:: + + This section is relevant only for legacy console drivers (those not + using the nbcon API) and !PREEMPT_RT kernels. Once all console drivers + are updated to nbcon, this documentation can be removed. + +Using ``printk()`` in hot paths (such as interrupt handlers, timer +callbacks, or high-frequency network receive routines) with legacy +consoles (e.g., ``console=ttyS0``) may cause lockups. Legacy consoles +synchronously acquire ``console_sem`` and block while flushing messages, +potentially disabling interrupts long enough to trigger hard or soft +lockup detectors. + +To avoid this: + +- Use rate-limited variants (e.g., ``pr_*_ratelimited()``) or one-time + macros (e.g., ``pr_*_once()``) to reduce message frequency. +- Assign lower log levels (e.g., ``KERN_DEBUG``) to non-essential messages + and filter console output via ``console_loglevel``. +- Use ``printk_deferred()`` to log messages immediately to the ringbuffer + and defer console printing. This is a workaround for legacy consoles. +- Port legacy console drivers to the non-blocking ``nbcon`` API (indicated + by ``CON_NBCON``). This is the preferred solution, as nbcon consoles + offload message printing to a dedicated kernel thread. + +For temporary debugging, ``trace_printk()`` can be used, but it must not +appear in mainline code. See ``Documentation/trace/debugging.rst`` for +more information. + +If more permanent output is needed in a hot path, trace events can be used. +See ``Documentation/trace/events.rst`` and +``samples/trace_events/trace-events-sample.[ch]``. + Function reference ================== diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index ecccc0473da9..c0b1b6089307 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -521,7 +521,7 @@ Fwnode handles %pfw[fP] -For printing information on fwnode handles. The default is to print the full +For printing information on an fwnode_handle. The default is to print the full node name, including the path. The modifiers are functionally equivalent to %pOF above. @@ -547,11 +547,13 @@ Time and date %pt[RT]s YYYY-mm-dd HH:MM:SS %pt[RT]d YYYY-mm-dd %pt[RT]t HH:MM:SS - %pt[RT][dt][r][s] + %ptSp <seconds>.<nanoseconds> + %pt[RST][dt][r][s] For printing date and time as represented by:: - R struct rtc_time structure + R content of struct rtc_time + S content of struct timespec64 T time64_t type in human readable format. @@ -563,6 +565,11 @@ The %pt[RT]s (space) will override ISO 8601 separator by using ' ' (space) instead of 'T' (Capital T) between date and time. It won't have any effect when date or time is omitted. +The %ptSp is equivalent to %lld.%09ld for the content of the struct timespec64. +When the other specifiers are given, it becomes the respective equivalent of +%ptT[dt][r][s].%09ld. In other words, the seconds are being printed in +the human readable format followed by a dot and nanoseconds. + Passed by reference. struct clk @@ -571,9 +578,8 @@ struct clk :: %pC pll1 - %pCn pll1 -For printing struct clk structures. %pC and %pCn print the name of the clock +For printing struct clk structures. %pC prints the name of the clock (Common Clock Framework) or a unique 32-bit ID (legacy clock framework). Passed by reference. @@ -648,6 +654,38 @@ Examples:: %p4cc Y10 little-endian (0x20303159) %p4cc NV12 big-endian (0xb231564e) +Generic FourCC code +------------------- + +:: + %p4c[h[R]lb] gP00 (0x67503030) + +Print a generic FourCC code, as both ASCII characters and its numerical +value as hexadecimal. + +The generic FourCC code is always printed in the big-endian format, +the most significant byte first. This is the opposite of V4L/DRM FourCCs. + +The additional ``h``, ``hR``, ``l``, and ``b`` specifiers define what +endianness is used to load the stored bytes. The data might be interpreted +using the host, reversed host byte order, little-endian, or big-endian. + +Passed by reference. + +Examples for a little-endian machine, given &(u32)0x67503030:: + + %p4ch gP00 (0x67503030) + %p4chR 00Pg (0x30305067) + %p4cl gP00 (0x67503030) + %p4cb 00Pg (0x30305067) + +Examples for a big-endian machine, given &(u32)0x67503030:: + + %p4ch gP00 (0x67503030) + %p4chR 00Pg (0x30305067) + %p4cl 00Pg (0x30305067) + %p4cb gP00 (0x67503030) + Rust ---- @@ -661,7 +699,7 @@ Do *not* use it from C. Thanks ====== -If you add other %p extensions, please extend <lib/test_printf.c> with -one or more test cases, if at all feasible. +If you add other %p extensions, please extend <lib/tests/printf_kunit.c> +with one or more test cases, if at all feasible. Thank you for your cooperation and attention. diff --git a/Documentation/core-api/rbtree.rst b/Documentation/core-api/rbtree.rst index ed1a9fbc779e..cce80e19087b 100644 --- a/Documentation/core-api/rbtree.rst +++ b/Documentation/core-api/rbtree.rst @@ -197,7 +197,7 @@ Cached rbtrees -------------- Computing the leftmost (smallest) node is quite a common task for binary -search trees, such as for traversals or users relying on a the particular +search trees, such as for traversals or users relying on the particular order for their own logic. To this end, users can use 'struct rb_root_cached' to optimize O(logN) rb_first() calls to a simple pointer fetch avoiding potentially expensive tree iterations. This is done at negligible runtime @@ -255,7 +255,7 @@ affected subtrees. When erasing a node, the user must call rb_erase_augmented() instead of rb_erase(). rb_erase_augmented() calls back into user provided functions -to updated the augmented information on affected subtrees. +to update the augmented information on affected subtrees. In both cases, the callbacks are provided through struct rb_augment_callbacks. 3 callbacks must be defined: @@ -293,7 +293,7 @@ way making it possible to do efficient lookup and exact match. This "extra information" stored in each node is the maximum hi (max_hi) value among all the nodes that are its descendants. This -information can be maintained at each node just be looking at the node +information can be maintained at each node just by looking at the node and its immediate children. And this will be used in O(log n) lookup for lowest match (lowest start address among all possible matches) with something like:: diff --git a/Documentation/core-api/real-time/architecture-porting.rst b/Documentation/core-api/real-time/architecture-porting.rst new file mode 100644 index 000000000000..c9a39d708866 --- /dev/null +++ b/Documentation/core-api/real-time/architecture-porting.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================= +Porting an architecture to support PREEMPT_RT +============================================= + +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> + +This list outlines the architecture specific requirements that must be +implemented in order to enable PREEMPT_RT. Once all required features are +implemented, ARCH_SUPPORTS_RT can be selected in architecture’s Kconfig to make +PREEMPT_RT selectable. +Many prerequisites (genirq support for example) are enforced by the common code +and are omitted here. + +The optional features are not strictly required but it is worth to consider +them. + +Requirements +------------ + +Forced threaded interrupts + CONFIG_IRQ_FORCED_THREADING must be selected. Any interrupts that must + remain in hard-IRQ context must be marked with IRQF_NO_THREAD. This + requirement applies for instance to clocksource event interrupts, + perf interrupts and cascading interrupt-controller handlers. + +PREEMPTION support + Kernel preemption must be supported and requires that + CONFIG_ARCH_NO_PREEMPT remain unselected. Scheduling requests, such as those + issued from an interrupt or other exception handler, must be processed + immediately. + +POSIX CPU timers and KVM + POSIX CPU timers must expire from thread context rather than directly within + the timer interrupt. This behavior is enabled by setting the configuration + option CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK. + When virtualization support, such as KVM, is enabled, + CONFIG_VIRT_XFER_TO_GUEST_WORK must also be set to ensure + that any pending work, such as POSIX timer expiration, is handled before + transitioning into guest mode. + +Hard-IRQ and Soft-IRQ stacks + Soft interrupts are handled in the thread context in which they are raised. If + a soft interrupt is triggered from hard-IRQ context, its execution is deferred + to the ksoftirqd thread. Preemption is never disabled during soft interrupt + handling, which makes soft interrupts preemptible. + If an architecture provides a custom __do_softirq() implementation that uses a + separate stack, it must select CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK. The + functionality should only be enabled when CONFIG_SOFTIRQ_ON_OWN_STACK is set. + +FPU and SIMD access in kernel mode + FPU and SIMD registers are typically not used in kernel mode and are therefore + not saved during kernel preemption. As a result, any kernel code that uses + these registers must be enclosed within a kernel_fpu_begin() and + kernel_fpu_end() section. + The kernel_fpu_begin() function usually invokes local_bh_disable() to prevent + interruptions from softirqs and to disable regular preemption. This allows the + protected code to run safely in both thread and softirq contexts. + On PREEMPT_RT kernels, however, kernel_fpu_begin() must not call + local_bh_disable(). Instead, it should use preempt_disable(), since softirqs + are always handled in thread context under PREEMPT_RT. In this case, disabling + preemption alone is sufficient. + The crypto subsystem operates on memory pages and requires users to "walk and + map" these pages while processing a request. This operation must occur outside + the kernel_fpu_begin()/ kernel_fpu_end() section because it requires preemption + to be enabled. These preemption points are generally sufficient to avoid + excessive scheduling latency. + +Exception handlers + Exception handlers, such as the page fault handler, typically enable interrupts + early, before invoking any generic code to process the exception. This is + necessary because handling a page fault may involve operations that can sleep. + Enabling interrupts is especially important on PREEMPT_RT, where certain + locks, such as spinlock_t, become sleepable. For example, handling an + invalid opcode may result in sending a SIGILL signal to the user task. A + debug exception will send a SIGTRAP signal. + In both cases, if the exception occurred in user space, it is safe to enable + interrupts early. Sending a signal requires both interrupts and kernel + preemption to be enabled. + +Optional features +----------------- + +Timer and clocksource + A high-resolution clocksource and clockevents device are recommended. The + clockevents device should support the CLOCK_EVT_FEAT_ONESHOT feature for + optimal timer behavior. In most cases, microsecond-level accuracy is + sufficient + +Lazy preemption + This mechanism allows an in-kernel scheduling request for non-real-time tasks + to be delayed until the task is about to return to user space. It helps avoid + preempting a task that holds a sleeping lock at the time of the scheduling + request. + With CONFIG_GENERIC_IRQ_ENTRY enabled, supporting this feature requires + defining a bit for TIF_NEED_RESCHED_LAZY, preferably near TIF_NEED_RESCHED. + +Serial console with NBCON + With PREEMPT_RT enabled, all console output is handled by a dedicated thread + rather than directly from the context in which printk() is invoked. This design + allows printk() to be safely used in atomic contexts. + However, this also means that if the kernel crashes and cannot switch to the + printing thread, no output will be visible preventing the system from printing + its final messages. + There are exceptions for immediate output, such as during panic() handling. To + support this, the console driver must implement new-style lock handling. This + involves setting the CON_NBCON flag in console::flags and providing + implementations for the write_atomic, write_thread, device_lock, and + device_unlock callbacks. diff --git a/Documentation/core-api/real-time/differences.rst b/Documentation/core-api/real-time/differences.rst new file mode 100644 index 000000000000..a129570dab5a --- /dev/null +++ b/Documentation/core-api/real-time/differences.rst @@ -0,0 +1,242 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +How realtime kernels differ +=========================== + +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> + +Preface +======= + +With forced-threaded interrupts and sleeping spin locks, code paths that +previously caused long scheduling latencies have been made preemptible and +moved into process context. This allows the scheduler to manage them more +effectively and respond to higher-priority tasks with reduced latency. + +The following chapters provide an overview of key differences between a +PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel. + +Locking +======= + +Spinning locks such as spinlock_t are used to provide synchronization for data +structures accessed from both interrupt context and process context. For this +reason, locking functions are also available with the _irq() or _irqsave() +suffixes, which disable interrupts before acquiring the lock. This ensures that +the lock can be safely acquired in process context when interrupts are enabled. + +However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer +run in hard IRQ context. As a result, there is no need to disable interrupts as +part of the locking procedure when using spinlock_t. + +For low-level core components such as interrupt handling, the scheduler, or the +timer subsystem the kernel uses raw_spinlock_t. This lock type preserves +traditional semantics: it disables preemption and, when used with _irq() or +_irqsave(), also disables interrupts. This ensures proper synchronization in +critical sections that must remain non-preemptible or with interrupts disabled. + +Execution context +================= + +Interrupt handling in a PREEMPT_RT system is invoked in process context through +the use of threaded interrupts. Other parts of the kernel also shift their +execution into threaded context by different mechanisms. The goal is to keep +execution paths preemptible, allowing the scheduler to interrupt them when a +higher-priority task needs to run. + +Below is an overview of the kernel subsystems involved in this transition to +threaded, preemptible execution. + +Interrupt handling +------------------ + +All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are +interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or +IRQF_ONESHOT flags. + +The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those +registered using request_threaded_irq() and providing only a threaded handler. +Its purpose is to keep the interrupt line masked until the threaded handler has +completed. + +If a primary handler is also provided in this case, it is essential that the +handler does not acquire any sleeping locks, as it will not be threaded. The +handler should be minimal and must avoid introducing delays, such as +busy-waiting on hardware registers. + + +Soft interrupts, bottom half handling +------------------------------------- + +Soft interrupts are raised by the interrupt handler and are executed after the +handler returns. Since they run in thread context, they can be preempted by +other threads. Do not assume that softirq context runs with preemption +disabled. This means you must not rely on mechanisms like local_bh_disable() in +process context to protect per-CPU variables. Because softirq handlers are +preemptible under PREEMPT_RT, this approach does not provide reliable +synchronization. + +If this kind of protection is required for performance reasons, consider using +local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to +verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the +necessary locking to ensure proper protection. + +Using local_lock_nested_bh() also makes the locking scope explicit and easier +for readers and maintainers to understand. + + +per-CPU variables +----------------- + +Protecting access to per-CPU variables solely by using preempt_disable() should +be avoided, especially if the critical section has unbounded runtime or may +call APIs that can sleep. + +If using a spinlock_t is considered too costly for performance reasons, +consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces +no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies +that the lock is only acquired in process context and never from softirq or +hard IRQ context. + +On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t, +which provides safe local protection for per-CPU data while keeping the system +preemptible. + +Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used +to protect per-CPU data by relying on implicit preemption disabling. If this +inherited preemption disabling is essential and if local_lock_t cannot be used +due to performance constraints, brevity of the code, or abstraction boundaries +within an API then preempt_disable_nested() may be a suitable alternative. On +non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already +disabled. On PREEMPT_RT, it explicitly disables preemption. + +Timers +------ + +By default, an hrtimer is executed in hard interrupt context. The exception is +timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in +softirq context. + +On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in +softirq context by default, typically within the ktimersd thread. This thread +runs at the lowest real-time priority, ensuring it executes before any +SCHED_OTHER tasks but does not interfere with higher-priority real-time +threads. To explicitly request execution in hard interrupt context on +PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag. + +Memory allocation +----------------- + +The memory allocation APIs, such as kmalloc() and alloc_pages(), require a +gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is +necessary to use GFP_ATOMIC when allocating memory from interrupt context or +from sections where preemption is disabled. This is because the allocator must +not sleep in these contexts waiting for memory to become available. + +However, this approach does not work on PREEMPT_RT kernels. The memory +allocator in PREEMPT_RT uses sleeping locks internally, which cannot be +acquired when preemption is disabled. Fortunately, this is generally not a +problem, because PREEMPT_RT moves most contexts that would traditionally run +with preemption or interrupts disabled into threaded context, where sleeping is +allowed. + +What remains problematic is code that explicitly disables preemption or +interrupts. In such cases, memory allocation must be performed outside the +critical section. + +This restriction also applies to memory deallocation routines such as kfree() +and free_pages(), which may also involve internal locking and must not be +called from non-preemptible contexts. + +IRQ work +-------- + +The irq_work API provides a mechanism to schedule a callback in interrupt +context. It is designed for use in contexts where traditional scheduling is not +possible, such as from within NMI handlers or from inside the scheduler, where +using a workqueue would be unsafe. + +On non-PREEMPT_RT systems, all irq_work items are executed immediately in +interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next +timer tick but are still executed in interrupt context. + +On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks +may acquire sleeping locks or have unbounded execution time, they are handled +in thread context by a per-CPU irq_work kernel thread. This thread runs at the +lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks +but does not interfere with higher-priority real-time threads. + +The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still +executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be +deferred until the next timer tick and are also executed by the irq_work/ +thread. + +RCU callbacks +------------- + +RCU callbacks are invoked by default in softirq context. Their execution is +important because, depending on the use case, they either free memory or ensure +progress in state transitions. Running these callbacks as part of the softirq +chain can lead to undesired situations, such as contention for CPU resources +with other SCHED_OTHER tasks when executed within ksoftirqd. + +To avoid running callbacks in softirq context, the RCU subsystem provides a +mechanism to execute them in process context instead. This behavior can be +enabled by setting the boot command-line parameter rcutree.use_softirq=0. This +setting is enforced in kernels configured with PREEMPT_RT. + +Spin until ready +================ + +The "spin until ready" pattern involves repeatedly checking (spinning on) the +state of a data structure until it becomes available. This pattern assumes that +preemption, soft interrupts, or interrupts are disabled. If the data structure +is marked busy, it is presumed to be in use by another CPU, and spinning should +eventually succeed as that CPU makes progress. + +Some examples are hrtimer_cancel() or timer_delete_sync(). These functions +cancel timers that execute with interrupts or soft interrupts disabled. If a +thread attempts to cancel a timer and finds it active, spinning until the +callback completes is safe because the callback can only run on another CPU and +will eventually finish. + +On PREEMPT_RT kernels, however, timer callbacks run in thread context. This +introduces a challenge: a higher-priority thread attempting to cancel the timer +may preempt the timer callback thread. Since the scheduler cannot migrate the +callback thread to another CPU due to affinity constraints, spinning can result +in livelock even on multiprocessor systems. + +To avoid this, both the canceling and callback sides must use a handshake +mechanism that supports priority inheritance. This allows the canceling thread +to suspend until the callback completes, ensuring forward progress without +risking livelock. + +In order to solve the problem at the API level, the sequence locks were extended +to allow a proper handover between the spinning reader and the maybe +blocked writer. + +Sequence locks +-------------- + +Sequence counters and sequential locks are documented in +Documentation/locking/seqlock.rst. + +The interface has been extended to ensure proper preemption states for the +writer and spinning reader contexts. This is achieved by embedding the writer +serialization lock directly into the sequence counter type, resulting in +composite types such as seqcount_spinlock_t or seqcount_mutex_t. + +These composite types allow readers to detect an ongoing write and actively +boost the writer’s priority to help it complete its update instead of spinning +and waiting for its completion. + +If the plain seqcount_t is used, extra care must be taken to synchronize the +reader with the writer during updates. The writer must ensure its update is +serialized and non-preemptible relative to the reader. This cannot be achieved +using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable +preemption. In such cases, using seqcount_spinlock_t is the preferred solution. + +However, if there is no spinning involved i.e., if the reader only needs to +detect whether a write has started and not serialize against it then using +seqcount_t is reasonable. diff --git a/Documentation/core-api/real-time/hardware.rst b/Documentation/core-api/real-time/hardware.rst new file mode 100644 index 000000000000..19f9bb3786e0 --- /dev/null +++ b/Documentation/core-api/real-time/hardware.rst @@ -0,0 +1,132 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Considering hardware +==================== + +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> + +The way a workload is handled can be influenced by the hardware it runs on. +Key components include the CPU, memory, and the buses that connect them. +These resources are shared among all applications on the system. +As a result, heavy utilization of one resource by a single application +can affect the deterministic handling of workloads in other applications. + +Below is a brief overview. + +System memory and cache +----------------------- + +Main memory and the associated caches are the most common shared resources among +tasks in a system. One task can dominate the available caches, forcing another +task to wait until a cache line is written back to main memory before it can +proceed. The impact of this contention varies based on write patterns and the +size of the caches available. Larger caches may reduce stalls because more lines +can be buffered before being written back. Conversely, certain write patterns +may trigger the cache controller to flush many lines at once, causing +applications to stall until the operation completes. + +This issue can be partly mitigated if applications do not share the same CPU +cache. The kernel is aware of the cache topology and exports this information to +user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc) +project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy. + +Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing +is minimized, bottlenecks can still occur when accessing system memory. Memory +is used not only by the CPU but also by peripheral devices via DMA, such as +graphics cards or network adapters. + +In some cases, cache and memory bottlenecks can be controlled if the hardware +provides the necessary support. On x86 systems, Intel offers Cache Allocation +Technology (CAT), which enables cache partitioning among applications and +provides control over the interconnect. AMD provides similar functionality under +Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory +System Resource Partitioning and Monitoring (MPAM). + +These features can be configured through the Linux Resource Control interface. +For details, see Documentation/filesystems/resctrl.rst. + +The perf tool can be used to monitor cache behavior. It can analyze +cache misses of an application and compare how they change under +different workloads on a neighboring CPU. Even more powerful, the perf +c2c tool can help identify cache-to-cache issues, where multiple CPU +cores repeatedly access and modify data on the same cache line. + +Hardware buses +-------------- + +Real-time systems often need to access hardware directly to perform their work. +Any latency in this process is undesirable, as it can affect the outcome of the +task. For example, on an I/O bus, a changed output may not become immediately +visible but instead appear with variable delay depending on the latency of the +bus used for communication. + +A bus such as PCI is relatively simple because register accesses are routed +directly to the connected device. In the worst case, a read operation stalls the +CPU until the device responds. + +A bus such as USB is more complex, involving multiple layers. A register read +or write is wrapped in a USB Request Block (URB), which is then sent by the +USB host controller to the device. Timing and latency are influenced by the +underlying USB bus. Requests cannot be sent immediately; they must align with +the next frame boundary according to the endpoint type and the host controller's +scheduling rules. This can introduce delays and additional latency. For example, +a network device connected via USB may still deliver sufficient throughput, but +the added latency when sending or receiving packets may fail to meet the +requirements of certain real-time use cases. + +Additional restrictions on bus latency can arise from power management. For +instance, PCIe with Active State Power Management (ASPM) enabled can suspend +the link between the device and the host. While this behavior is beneficial for +power savings, it delays device access and adds latency to responses. This issue +is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be +affected by power management mechanisms. + +Virtualization +-------------- + +In a virtualized environment such as KVM, each guest CPU is represented as a +thread on the host. If such a thread runs with real-time priority, the system +should be tested to confirm it can sustain this behavior over extended periods. +Because of its priority, the thread will not be preempted by lower-priority +threads (such as SCHED_OTHER), which may then receive no CPU time. This can +cause problems if a lower-priority thread is pinned to a CPU already occupied by +a real-time task and unable to make progress. Even if a CPU has been isolated, +the system may still (accidentally) start a per‑CPU thread on that CPU. +Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both +task scheduling and interrupt handling. Furthermore, if the guest CPU does go +idle but the guest system is booted with the option **idle=poll**, the guest +CPU will never enter an idle state and will instead spin until an event +arrives. + +Device handling introduces additional considerations. Emulated PCI devices or +VirtIO devices require a counterpart on the host to complete requests. This +adds latency because the host must intercept and either process the request +directly or schedule a thread for its completion. These delays can be avoided if +the required PCI device is passed directly through to the guest. Some devices, +such as networking or storage controllers, support the PCIe SR-IOV feature. +SR-IOV allows a single PCIe device to be divided into multiple virtual functions, +which can then be assigned to different guests. + +Networking +---------- + +For low-latency networking, the full networking stack may be undesirable, as it +can introduce additional sources of delay. In this context, XDP can be used +as a shortcut to bypass much of the stack while still relying on the kernel's +network driver. + +The requirements are that the network driver must support XDP- preferably using +an "skb pool" and that the application must use an XDP socket. Additional +configuration may involve BPF filters, tuning networking queues, or configuring +qdiscs for time-based transmission. These techniques are often +applied in Time-Sensitive Networking (TSN) environments. + +Documenting all required steps exceeds the scope of this text. For detailed +guidance, see the TSN documentation at https://tsn.readthedocs.io. + +Another useful resource is the Linux Real-Time Communication Testbench +https://github.com/Linutronix/RTC-Testbench. +The goal of this project is to validate real-time network communication. It can +be thought of as a "cyclictest" for networking and also serves as a starting +point for application development. diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst new file mode 100644 index 000000000000..f08d2395a22c --- /dev/null +++ b/Documentation/core-api/real-time/index.rst @@ -0,0 +1,17 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Real-time preemption +===================== + +This documentation is intended for Linux kernel developers and contributors +interested in the inner workings of PREEMPT_RT. It explains key concepts and +the required changes compared to a non-PREEMPT_RT configuration. + +.. toctree:: + :maxdepth: 2 + + theory + differences + hardware + architecture-porting diff --git a/Documentation/core-api/real-time/theory.rst b/Documentation/core-api/real-time/theory.rst new file mode 100644 index 000000000000..43d0120737f8 --- /dev/null +++ b/Documentation/core-api/real-time/theory.rst @@ -0,0 +1,116 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Theory of operation +===================== + +:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> + +Preface +======= + +PREEMPT_RT transforms the Linux kernel into a real-time kernel. It achieves +this by replacing locking primitives, such as spinlock_t, with a preemptible +and priority-inheritance aware implementation known as rtmutex, and by enforcing +the use of threaded interrupts. As a result, the kernel becomes fully +preemptible, with the exception of a few critical code paths, including entry +code, the scheduler, and low-level interrupt handling routines. + +This transformation places the majority of kernel execution contexts under the +control of the scheduler and significantly increasing the number of preemption +points. Consequently, it reduces the latency between a high-priority task +becoming runnable and its actual execution on the CPU. + +Scheduling +========== + +The core principles of Linux scheduling and the associated user-space API are +documented in the man page sched(7) +`sched(7) <https://man7.org/linux/man-pages/man7/sched.7.html>`_. +By default, the Linux kernel uses the SCHED_OTHER scheduling policy. Under +this policy, a task is preempted when the scheduler determines that it has +consumed a fair share of CPU time relative to other runnable tasks. However, +the policy does not guarantee immediate preemption when a new SCHED_OTHER task +becomes runnable. The currently running task may continue executing. + +This behavior differs from that of real-time scheduling policies such as +SCHED_FIFO. When a task with a real-time policy becomes runnable, the +scheduler immediately selects it for execution if it has a higher priority than +the currently running task. The task continues to run until it voluntarily +yields the CPU, typically by blocking on an event. + +Sleeping spin locks +=================== + +The various lock types and their behavior under real-time configurations are +described in detail in Documentation/locking/locktypes.rst. +In a non-PREEMPT_RT configuration, a spinlock_t is acquired by first disabling +preemption and then actively spinning until the lock becomes available. Once +the lock is released, preemption is enabled. From a real-time perspective, +this approach is undesirable because disabling preemption prevents the +scheduler from switching to a higher-priority task, potentially increasing +latency. + +To address this, PREEMPT_RT replaces spinning locks with sleeping spin locks +that do not disable preemption. On PREEMPT_RT, spinlock_t is implemented using +rtmutex. Instead of spinning, a task attempting to acquire a contended lock +disables CPU migration, donates its priority to the lock owner (priority +inheritance), and voluntarily schedules out while waiting for the lock to +become available. + +Disabling CPU migration provides the same effect as disabling preemption, while +still allowing preemption and ensuring that the task continues to run on the +same CPU while holding a sleeping lock. + +Priority inheritance +==================== + +Lock types such as spinlock_t and mutex_t in a PREEMPT_RT enabled kernel are +implemented on top of rtmutex, which provides support for priority inheritance +(PI). When a task blocks on such a lock, the PI mechanism temporarily +propagates the blocked task’s scheduling parameters to the lock owner. + +For example, if a SCHED_FIFO task A blocks on a lock currently held by a +SCHED_OTHER task B, task A’s scheduling policy and priority are temporarily +inherited by task B. After this inheritance, task A is put to sleep while +waiting for the lock, and task B effectively becomes the highest-priority task +in the system. This allows B to continue executing, make progress, and +eventually release the lock. + +Once B releases the lock, it reverts to its original scheduling parameters, and +task A can resume execution. + +Threaded interrupts +=================== + +Interrupt handlers are another source of code that executes with preemption +disabled and outside the control of the scheduler. To bring interrupt handling +under scheduler control, PREEMPT_RT enforces threaded interrupt handlers. + +With forced threading, interrupt handling is split into two stages. The first +stage, the primary handler, is executed in IRQ context with interrupts disabled. +Its sole responsibility is to wake the associated threaded handler. The second +stage, the threaded handler, is the function passed to request_irq() as the +interrupt handler. It runs in process context, scheduled by the kernel. + +From waking the interrupt thread until threaded handling is completed, the +interrupt source is masked in the interrupt controller. This ensures that the +device interrupt remains pending but does not retrigger the CPU, allowing the +system to exit IRQ context and handle the interrupt in a scheduled thread. + +By default, the threaded handler executes with the SCHED_FIFO scheduling policy +and a priority of 50 (MAX_RT_PRIO / 2), which is midway between the minimum and +maximum real-time priorities. + +If the threaded interrupt handler raises any soft interrupts during its +execution, those soft interrupt routines are invoked after the threaded handler +completes, within the same thread. Preemption remains enabled during the +execution of the soft interrupt handler. + +Summary +======= + +By using sleeping locks and forced-threaded interrupts, PREEMPT_RT +significantly reduces sections of code where interrupts or preemption is +disabled, allowing the scheduler to preempt the current execution context and +switch to a higher-priority task. diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst index 79a009ce11df..94e628c1eb49 100644 --- a/Documentation/core-api/refcount-vs-atomic.rst +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -86,7 +86,19 @@ Memory ordering guarantee changes: * none (both fully unordered) -case 2) - increment-based ops that return no value +case 2) - non-"Read/Modify/Write" (RMW) ops with release ordering +----------------------------------------------------------------- + +Function changes: + + * atomic_set_release() --> refcount_set_release() + +Memory ordering guarantee changes: + + * none (both provide RELEASE ordering) + + +case 3) - increment-based ops that return no value -------------------------------------------------- Function changes: @@ -98,7 +110,7 @@ Memory ordering guarantee changes: * none (both fully unordered) -case 3) - decrement-based RMW ops that return no value +case 4) - decrement-based RMW ops that return no value ------------------------------------------------------ Function changes: @@ -110,7 +122,7 @@ Memory ordering guarantee changes: * fully unordered --> RELEASE ordering -case 4) - increment-based RMW ops that return a value +case 5) - increment-based RMW ops that return a value ----------------------------------------------------- Function changes: @@ -126,7 +138,20 @@ Memory ordering guarantees changes: result of obtaining pointer to the object! -case 5) - generic dec/sub decrement-based RMW ops that return a value +case 6) - increment-based RMW ops with acquire ordering that return a value +--------------------------------------------------------------------------- + +Function changes: + + * atomic_inc_not_zero() --> refcount_inc_not_zero_acquire() + * no atomic counterpart --> refcount_add_not_zero_acquire() + +Memory ordering guarantees changes: + + * fully ordered --> ACQUIRE ordering on success + + +case 7) - generic dec/sub decrement-based RMW ops that return a value --------------------------------------------------------------------- Function changes: @@ -139,7 +164,7 @@ Memory ordering guarantees changes: * fully ordered --> RELEASE ordering + ACQUIRE ordering on success -case 6) other decrement-based RMW ops that return a value +case 8) other decrement-based RMW ops that return a value --------------------------------------------------------- Function changes: @@ -154,7 +179,7 @@ Memory ordering guarantees changes: .. note:: atomic_add_unless() only provides full order on success. -case 7) - lock-based RMW +case 9) - lock-based RMW ------------------------ Function changes: diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst index 06f766a6aab2..2304d5bffcce 100644 --- a/Documentation/core-api/symbol-namespaces.rst +++ b/Documentation/core-api/symbol-namespaces.rst @@ -6,18 +6,8 @@ The following document describes how to use Symbol Namespaces to structure the export surface of in-kernel symbols exported through the family of EXPORT_SYMBOL() macros. -.. Table of Contents - - === 1 Introduction - === 2 How to define Symbol Namespaces - --- 2.1 Using the EXPORT_SYMBOL macros - --- 2.2 Using the DEFAULT_SYMBOL_NAMESPACE define - === 3 How to use Symbols exported in Namespaces - === 4 Loading Modules that use namespaced Symbols - === 5 Automatically creating MODULE_IMPORT_NS statements - -1. Introduction -=============== +Introduction +============ Symbol Namespaces have been introduced as a means to structure the export surface of the in-kernel API. It allows subsystem maintainers to partition @@ -28,15 +18,18 @@ kernel. As of today, modules that make use of symbols exported into namespaces, are required to import the namespace. Otherwise the kernel will, depending on its configuration, reject loading the module or warn about a missing import. -2. How to define Symbol Namespaces -================================== +Additionally, it is possible to put symbols into a module namespace, strictly +limiting which modules are allowed to use these symbols. + +How to define Symbol Namespaces +=============================== Symbols can be exported into namespace using different methods. All of them are changing the way EXPORT_SYMBOL and friends are instrumented to create ksymtab entries. -2.1 Using the EXPORT_SYMBOL macros -================================== +Using the EXPORT_SYMBOL macros +------------------------------ In addition to the macros EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(), that allow exporting of kernel symbols to the kernel symbol table, variants of these are @@ -54,8 +47,8 @@ refer to ``NULL``. There is no default namespace if none is defined. ``modpost`` and kernel/module/main.c make use the namespace at build time or module load time, respectively. -2.2 Using the DEFAULT_SYMBOL_NAMESPACE define -============================================= +Using the DEFAULT_SYMBOL_NAMESPACE define +----------------------------------------- Defining namespaces for all symbols of a subsystem can be very verbose and may become hard to maintain. Therefore a default define (DEFAULT_SYMBOL_NAMESPACE) @@ -83,8 +76,25 @@ unit as preprocessor statement. The above example would then read:: within the corresponding compilation unit before the #include for <linux/export.h>. Typically it's placed before the first #include statement. -3. How to use Symbols exported in Namespaces -============================================ +Using the EXPORT_SYMBOL_FOR_MODULES() macro +------------------------------------------- + +Symbols exported using this macro are put into a module namespace. This +namespace cannot be imported. These exports are GPL-only as they are only +intended for in-tree modules. + +The macro takes a comma separated list of module names, allowing only those +modules to access this symbol. Simple tail-globs are supported. + +For example:: + + EXPORT_SYMBOL_FOR_MODULES(preempt_notifier_inc, "kvm,kvm-*") + +will limit usage of this symbol to modules whose name matches the given +patterns. + +How to use Symbols exported in Namespaces +========================================= In order to use symbols that are exported into namespaces, kernel modules need to explicitly import these namespaces. Otherwise the kernel might reject to @@ -104,13 +114,17 @@ inspected with modinfo:: import_ns: USB_STORAGE [...] +For modules that are currently loaded, imported namespaces are also available +via sysfs:: + + $ cat /sys/module/ums_karma/import_ns + USB_STORAGE It is advisable to add the MODULE_IMPORT_NS() statement close to other module -metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). Refer to section -5. for a way to create missing import statements automatically. +metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). -4. Loading Modules that use namespaced Symbols -============================================== +Loading Modules that use namespaced Symbols +=========================================== At module loading time (e.g. ``insmod``), the kernel will check each symbol referenced from the module for its availability and whether the namespace it @@ -121,8 +135,8 @@ allow loading of modules that don't satisfy this precondition, a configuration option is available: Setting MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y will enable loading regardless, but will emit a warning. -5. Automatically creating MODULE_IMPORT_NS statements -===================================================== +Automatically creating MODULE_IMPORT_NS statements +================================================== Missing namespaces imports can easily be detected at build time. In fact, modpost will emit a warning if a module uses a symbol from a namespace @@ -154,3 +168,6 @@ in-tree modules:: You can also run nsdeps for external module builds. A typical usage is:: $ make -C <path_to_kernel_src> M=$PWD nsdeps + +Note: it will happily generate an import statement for the module namespace; +which will not work and generates build and runtime failures. diff --git a/Documentation/core-api/this_cpu_ops.rst b/Documentation/core-api/this_cpu_ops.rst index 91acbcf30e9b..533ac5dd5750 100644 --- a/Documentation/core-api/this_cpu_ops.rst +++ b/Documentation/core-api/this_cpu_ops.rst @@ -138,12 +138,22 @@ get_cpu/put_cpu sequence requires. No processor number is available. Instead, the offset of the local per cpu area is simply added to the per cpu offset. -Note that this operation is usually used in a code segment when -preemption has been disabled. The pointer is then used to -access local per cpu data in a critical section. When preemption -is re-enabled this pointer is usually no longer useful since it may -no longer point to per cpu data of the current processor. - +Note that this operation can only be used in code segments where +smp_processor_id() may be used, for example, where preemption has been +disabled. The pointer is then used to access local per cpu data in a +critical section. When preemption is re-enabled this pointer is usually +no longer useful since it may no longer point to per cpu data of the +current processor. + +The special cases where it makes sense to obtain a per-CPU pointer in +preemptible code are addressed by raw_cpu_ptr(), but such use cases need +to handle cases where two different CPUs are accessing the same per cpu +variable, which might well be that of a third CPU. These use cases are +typically performance optimizations. For example, SRCU implements a pair +of counters as a pair of per-CPU variables, and rcu_read_lock_nmisafe() +uses raw_cpu_ptr() to get a pointer to some CPU's counter, and uses +atomic_inc_long() to handle migration between the raw_cpu_ptr() and +the atomic_inc_long(). Per cpu variables and offsets ----------------------------- diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst index e295835fc116..411e1b28b8de 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -183,6 +183,12 @@ resources, scheduled and executed. BH work items cannot sleep. All other features such as delayed queueing, flushing and canceling are supported. +``WQ_PERCPU`` + Work items queued to a per-cpu wq are bound to a specific CPU. + This flag is the right choice when cpu locality is important. + + This flag is the complement of ``WQ_UNBOUND``. + ``WQ_UNBOUND`` Work items queued to an unbound wq are served by the special worker-pools which host workers which are not bound to any @@ -372,9 +378,9 @@ Affinity Scopes An unbound workqueue groups CPUs according to its affinity scope to improve cache locality. For example, if a workqueue is using the default affinity -scope of "cache", it will group CPUs according to last level cache -boundaries. A work item queued on the workqueue will be assigned to a worker -on one of the CPUs which share the last level cache with the issuing CPU. +scope of "cache_shard", it will group CPUs into sub-LLC shards. A work item +queued on the workqueue will be assigned to a worker on one of the CPUs +within the same shard as the issuing CPU. Once started, the worker may or may not be allowed to move outside the scope depending on the ``affinity_strict`` setting of the scope. @@ -396,7 +402,13 @@ Workqueue currently supports the following affinity scopes. ``cache`` CPUs are grouped according to cache boundaries. Which specific cache boundary is used is determined by the arch code. L3 is used in a lot of - cases. This is the default affinity scope. + cases. + +``cache_shard`` + CPUs are grouped into sub-LLC shards of at most ``wq_cache_shard_size`` + cores (default 8, tunable via the ``workqueue.cache_shard_size`` boot + parameter). Shards are always split on core (SMT group) boundaries. + This is the default affinity scope. ``numa`` CPUs are grouped according to NUMA boundaries. diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index f6a3eef4fe7f..c6c91cbd0c3c 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -489,7 +489,19 @@ Storing ``NULL`` into any index of a multi-index entry will set the entry at every index to ``NULL`` and dissolve the tie. A multi-index entry can be split into entries occupying smaller ranges by calling xas_split_alloc() without the xa_lock held, followed by taking the lock -and calling xas_split(). +and calling xas_split() or calling xas_try_split() with xa_lock. The +difference between xas_split_alloc()+xas_split() and xas_try_alloc() is +that xas_split_alloc() + xas_split() split the entry from the original +order to the new order in one shot uniformly, whereas xas_try_split() +iteratively splits the entry containing the index non-uniformly. +For example, to split an order-9 entry, which takes 2^(9-6)=8 slots, +assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need +8 xa_node. xas_try_split() splits the order-9 entry into +2 order-8 entries, then split one order-8 entry, based on the given index, +to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries. +When splitting the order-6 entry and a new xa_node is needed, xas_try_split() +will try to allocate one if possible. As a result, xas_try_split() would only +need 1 xa_node instead of 8. Functions and structures ======================== |
