Age | Commit message (Collapse) | Author |
|
Currently, we lookup sk_keys from the entire struct net_namespace, which
may contain multiple MCTP net IDs. In those cases we want to distinguish
between endpoints with the same EID but different net ID.
Add the net ID data to the struct mctp_sk_key, populate on add and
filter on this during route lookup.
For the ioctl interface, we use a default net of
MCTP_INITIAL_DEFAULT_NET (ie., what will be in use for single-net
configurations), but we'll extend the ioctl interface to provide
net-specific tag allocation in an upcoming change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We have a double-swap of local and peer addresses in
mctp_alloc_local_tag; the arguments in both call sites are swapped, but
there is also a swap in the implementation of alloc_local_tag. This is
opaque because we're using source/dest address references, which don't
match the local/peer semantics.
Avoid this confusion by naming the arguments as 'local' and 'peer', and
remove the double swap. The calling order now matches mctp_key_alloc.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We want to re-organize the struct sock layout. The sk_peek_off
field location is problematic, as most protocols want it in the
RX read area, while UDP wants it on a cacheline different from
sk_receive_queue.
Create a local (inside udp_sock) copy of the 'peek offset is enabled'
flag and place it inside the same cacheline of reader_queue.
Check such flag before reading sk_peek_off. This will save potential
false sharing and cache misses in the fast-path.
Tested under UDP flood with small packets. The struct sock layout
update causes a 4% performance drop, and this patch restores completely
the original tput.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/67ab679c15fbf49fa05b3ffe05d91c47ab84f147.1708426665.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dst is transferred to the flow object, route object does not own it
anymore. Reset dst in route object, otherwise if flow_offload_add()
fails, error path releases dst twice, leading to a refcount underflow.
Fixes: a3c90f7a2323 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add a driver for the Marvell 88Q2220. This driver allows to detect the
link, switch between 100BASE-T1 and 1000BASE-T1 and switch between
master and slave mode. Autonegotiation is supported.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Gregor Herburger <gregor.herburger@ew.tq-group.com>
Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Link: https://lore.kernel.org/r/20240218075753.18067-6-dima.fedrau@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend helper functions mii_t1_adv_m_mod_linkmode_t and
linkmode_adv_to_mii_t1_adv_m_t to support 100BT1 and 1000BT1 linkmode
advertisements.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Link: https://lore.kernel.org/r/20240218075753.18067-3-dima.fedrau@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Added constants for advertising 100BT1 and 1000BT1 in register BASE-T1
auto-negotiation advertisement register [31:16] (Register 7.515)
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Link: https://lore.kernel.org/r/20240218075753.18067-2-dima.fedrau@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the
following kernel warning appears:
WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8
Call trace:
kiocb_set_cancel_fn+0x9c/0xa8
ffs_epfile_read_iter+0x144/0x1d0
io_read+0x19c/0x498
io_issue_sqe+0x118/0x27c
io_submit_sqes+0x25c/0x5fc
__arm64_sys_io_uring_enter+0x104/0xab0
invoke_syscall+0x58/0x11c
el0_svc_common+0xb4/0xf4
do_el0_svc+0x2c/0xb0
el0_svc+0x2c/0xa4
el0t_64_sync_handler+0x68/0xb4
el0t_64_sync+0x1a4/0x1a8
Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is
submitted by libaio.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the existing ML element parsing helpers and add a new
one for this (ieee80211_mle_get_mld_id).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240216135047.4da47b1f035b.I437a5570ac456449facb0b147851ef24a1e473c2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Align the prototype of ieee80211_mle_get_bss_param_ch_cnt()
to also take a u8 * like the other functions, and make it
return -1 when the field isn't found, so that mac80211 can
check that instead of explicitly open-coding the check.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240216135047.583309181bc3.Ia61cb0b4fc034d5ac8fcfaf6f6fb2e115fadafe7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Make cfg80211_inform_bss_frame_data() call the existing
cfg80211_inform_bss_data() after parsing the frame in the
appropriate way, so we have less code duplication. This
required introducing a new CFG80211_BSS_FTYPE_S1G_BEACON,
but that can be used by other drivers as well.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240216135047.874aed1eff5f.Ib7d88d126eec50c64763251a78cb432bb5df14df@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The KHZ_PER_GHZ might be used by others (with the name aligned
with similar constants). Define it in units.h and convert
wireless to use it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://msgid.link/20240215154136.630029-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Some drivers need the data in it, so move it to the link conf,
which is exposed to the driver.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240206164849.6fe9782b87b4.Ifbffef638f07ca7f5c2b27f40d2cf2942d21de0b@changeid
[remove bss pointer from internal struct, update docs]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Currently, function to check if beacon countdown is complete uses deflink
to fetch the beacon and check the counter. However, with MLO, there is
a need to check the counter for the beacon in a particular link.
Add support to use link_id in order to fetch the beacon from a particular
link data.
Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://msgid.link/20240216144621.514385-2-quic_adisi@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Up until now we have managed not to have the mdio-bcm-unimac manage its
clock except during probe and suspend/resume. This works most of the
time, except where it does not.
With a fully modular build, we can get into a situation whereby the
GENET driver is fully registered, and so is the mdio-bcm-unimac driver,
however the Ethernet PHY driver is not yet, because it depends on a
resource that is not yet available (e.g.: GPIO provider). In that state,
the network device is not usable yet, and so to conserve power, the
GENET driver will have turned off its "main" clock which feeds its MDIO
controller.
When the PHY driver finally probes however, we make an access to the PHY
registers to e.g.: disable interrupts, and this causes a bus error
within the MDIO controller space because the MDIO controller clock(s)
are turned off.
To remedy that, we manage the clock around all of the I/O accesses to
the hardware which are done exclusively during read, write and clock
divider configuration.
This ensures that the register space is accessible, and this also
ensures that there are not unnecessarily elevated reference counts
keeping the clocks active when the network device is administratively
turned off. It would be the case with the previous way of managing the
clock.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove documentation of non-existent children field
from the Kernel doc for struct framer_ops.
Introduced by 82c944d05b1a ("net: wan: Add framer framework support")
Signed-off-by: Simon Horman <horms@kernel.org>
Acked-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.9
The second "new features" pull request for v6.9. Lots of iwlwifi and
stack changes this time. And naturally smaller changes to other drivers.
We also twice merged wireless into wireless-next to avoid conflicts
between the trees.
Major changes:
stack
* mac80211: negotiated TTLM request support
* SPP A-MSDU support
* mac80211: wider bandwidth OFDMA config support
iwlwifi
* kunit tests
* bump FW API to 89 for AX/BZ/SC devices
* enable SPP A-MSDUs
* support for new devices
ath12k
* refactoring in preparation for Multi-Link Operation (MLO) support
* 1024 Block Ack window size support
* provide firmware wmi logs via a trace event
ath11k
* 36 bit DMA mask support
* support 6 GHz station power modes: Low Power Indoor (LPI), Standard
Power) SP and Very Low Power (VLP)
rtl8xxxu
* TP-Link TL-WN823N V2 support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No need to keep this in the core, move it to the nfnetlink_queue module.
nf_reroute is moved too, there were no other callers.
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Persistent exec_queues delays explicit destruction of exec_queues
until they are done executing, but destruction on process exit
is still immediate. It turns out no UMD is relying on this
functionality, so remove it. If there turns out to be a use-case
in the future, let's re-add.
Persistent exec_queues were never used for LR VMs
v2:
- Don't add an "UNUSED" define for the missing property
(Lucas, Rodrigo)
v3:
- Remove the remaining struct xe_exec_queue::persistent state
(Niranjana, Lucas)
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: José Roberto de Souza <jose.souza@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240209113444.8396-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit f1a9abc0cf311375695bede1590364864c05976d)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
Pick up CXL CPER notification removal for v6.8-rc6, to return in a later
merge window.
|
|
Initial tests with the CXL CPER implementation identified that error
reports were being duplicated in the log and the trace event [1]. Then
it was discovered that the notification handler took sleeping locks
while the GHES event handling runs in spin_lock_irqsave() context [2]
While the duplicate reporting was fixed in v6.8-rc4, the fix for the
sleeping-lock-vs-atomic collision would enjoy more time to settle and
gain some test cycles. Given how late it is in the development cycle,
remove the CXL hookup for now and try again during the next merge
window.
Note that end result is that v6.8 does not emit CXL CPER payloads to the
kernel log, but this is in line with the CXL trend to move error
reporting to trace events instead of the kernel log.
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: http://lore.kernel.org/r/20240108165855.00002f5a@Huawei.com [1]
Closes: http://lore.kernel.org/r/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain [2]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
The xlate callbacks are supposed to translate of_phandle_args to proper
provider without modifying the of_phandle_args. Make the argument
pointer to const for code safety and readability.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240217100306.86740-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull rdma fixes from Jason Gunthorpe:
"Mostly irdma and bnxt_re fixes:
- Missing error unwind in hf1
- For bnxt - fix fenching behavior to work on new chips, fail
unsupported SRQ resize back to userspace, propogate SRQ FW failure
back to userspace.
- Correctly fail unsupported SRQ resize back to userspace in bnxt
- Adjust a memcpy in mlx5 to not overflow a struct field.
- Prevent userspace from triggering mlx5 fw syndrome logging from
sysfs
- Use the correct access mode for MLX5_IB_METHOD_DEVX_OBJ_MODIFY to
avoid a userspace failure on modify
- For irdma - Don't UAF a concurrent tasklet during destroy, prevent
userspace from issuing invalid QP attrs, fix a possible CQ
overflow, capture a missing HW async error event
- sendmsg() triggerable memory access crash in hfi1
- Fix the srpt_service_guid parameter to not crash due to missing
function pointer
- Don't leak objects in error unwind in qedr
- Don't weirdly cast function pointers in srpt"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/srpt: fix function pointer cast warnings
RDMA/qedr: Fix qedr_create_user_qp error flow
RDMA/srpt: Support specifying the srpt_service_guid parameter
IB/hfi1: Fix sdma.h tx->num_descs off-by-one error
RDMA/irdma: Add AE for too many RNRS
RDMA/irdma: Set the CQ read threshold for GEN 1
RDMA/irdma: Validate max_send_wr and max_recv_wr
RDMA/irdma: Fix KASAN issue with tasklet
RDMA/mlx5: Relax DEVX access upon modify commands
IB/mlx5: Don't expose debugfs entries for RRoCE general parameters if not supported
RDMA/mlx5: Fix fortify source warning while accessing Eth segment
RDMA/bnxt_re: Add a missing check in bnxt_qplib_query_srq
RDMA/bnxt_re: Return error for SRQ resize
RDMA/bnxt_re: Fix unconditional fence for newer adapters
RDMA/bnxt_re: Remove a redundant check inside bnxt_re_vf_res_config
RDMA/bnxt_re: Avoid creating fence MR for newer adapters
IB/hfi1: Fix a memleak in init_credit_return
|
|
When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
swapin the same entry at the same time, they get different pages (A, B).
Before one thread (T0) finishes the swapin and installs page (A) to the
PTE, another thread (T1) could finish swapin of page (B), swap_free the
entry, then swap out the possibly modified page reusing the same entry.
It breaks the pte_same check in (T0) because PTE value is unchanged,
causing ABA problem. Thread (T0) will install a stalled page (A) into the
PTE and cause data corruption.
One possible callstack is like this:
CPU0 CPU1
---- ----
do_swap_page() do_swap_page() with same entry
<direct swapin path> <direct swapin path>
<alloc page A> <alloc page B>
swap_read_folio() <- read to page A swap_read_folio() <- read to page B
<slow on later locks or interrupt> <finished swapin first>
... set_pte_at()
swap_free() <- entry is free
<write to page B, now page A stalled>
<swap out page B to same swap entry>
pte_same() <- Check pass, PTE seems
unchanged, but page A
is stalled!
swap_free() <- page B content lost!
set_pte_at() <- staled page A installed!
And besides, for ZRAM, swap_free() allows the swap device to discard the
entry content, so even if page (B) is not modified, if swap_read_folio()
on CPU0 happens later than swap_free() on CPU1, it may also cause data
loss.
To fix this, reuse swapcache_prepare which will pin the swap entry using
the cache flag, and allow only one thread to swap it in, also prevent any
parallel code from putting the entry in the cache. Release the pin after
PT unlocked.
Racers just loop and wait since it's a rare and very short event. A
schedule_timeout_uninterruptible(1) call is added to avoid repeated page
faults wasting too much CPU, causing livelock or adding too much noise to
perf statistics. A similar livelock issue was described in commit
029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead")
Reproducer:
This race issue can be triggered easily using a well constructed
reproducer and patched brd (with a delay in read path) [1]:
With latest 6.8 mainline, race caused data loss can be observed easily:
$ gcc -g -lpthread test-thread-swap-race.c && ./a.out
Polulating 32MB of memory region...
Keep swapping out...
Starting round 0...
Spawning 65536 workers...
32746 workers spawned, wait for done...
Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
Round 0 Failed, 15 data loss!
This reproducer spawns multiple threads sharing the same memory region
using a small swap device. Every two threads updates mapped pages one by
one in opposite direction trying to create a race, with one dedicated
thread keep swapping out the data out using madvise.
The reproducer created a reproduce rate of about once every 5 minutes, so
the race should be totally possible in production.
After this patch, I ran the reproducer for over a few hundred rounds and
no data loss observed.
Performance overhead is minimal, microbenchmark swapin 10G from 32G
zram:
Before: 10934698 us
After: 11157121 us
Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
[kasong@tencent.com: v4]
Link: https://lkml.kernel.org/r/20240219082040.7495-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzbot managed to trigger following splat:
BUG: KASAN: use-after-free in __skb_flow_dissect+0x4a3b/0x5e50
Read of size 1 at addr ffff888208a4000e by task a.out/2313
[..]
__skb_flow_dissect+0x4a3b/0x5e50
__skb_get_hash+0xb4/0x400
ip_tunnel_xmit+0x77e/0x26f0
ipip_tunnel_xmit+0x298/0x410
..
Analysis shows that the skb has a valid ->head, but bogus ->data
pointer.
skb->data gets its bogus value via the neigh layer, which does:
1556 __skb_pull(skb, skb_network_offset(skb));
... and the skb was already dodgy at this point:
skb_network_offset(skb) returns a negative value due to an
earlier overflow of skb->network_header (u16). __skb_pull thus
"adjusts" skb->data by a huge offset, pointing outside skb->head
area.
Allow debug builds to splat when we try to pull/push more than
INT_MAX bytes.
After this, the syzkaller reproducer yields a more precise splat
before the flow dissector attempts to read off skb->data memory:
WARNING: CPU: 5 PID: 2313 at include/linux/skbuff.h:2653 neigh_connected_output+0x28e/0x400
ip_finish_output2+0xb25/0xed0
iptunnel_xmit+0x4ff/0x870
ipgre_xmit+0x78e/0xbb0
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240216113700.23013-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Last major reorg happened in commit 9115e8cd2a0c ("net: reorganize
struct sock for better data locality")
Since then, many changes have been done.
Before SO_PEEK_OFF support is added to TCP, we need
to move sk_peek_off to a better location.
It is time to make another pass, and add six groups,
without explicit alignment.
- sock_write_rx (following sk_refcnt) read-write fields in rx path.
- sock_read_rx read-mostly fields in rx path.
- sock_read_rxtx read-mostly fields in both rx and tx paths.
- sock_write_rxtx read-write fields in both rx and tx paths.
- sock_write_tx read-write fields in tx paths.
- sock_read_tx read-mostly fields in tx paths.
Results on TCP_RR benchmarks seem to show a gain (4 to 5 %).
It is possible UDP needs a change, because sk_peek_off
shares a cache line with sk_receive_queue.
If this the case, we can exchange roles of sk->sk_receive
and up->reader_queue queues.
After this change, we have the following layout:
struct sock {
struct sock_common __sk_common; /* 0 0x88 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
__u8 __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */
atomic_t sk_drops; /* 0x88 0x4 */
__s32 sk_peek_off; /* 0x8c 0x4 */
struct sk_buff_head sk_error_queue; /* 0x90 0x18 */
struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct {
atomic_t rmem_alloc; /* 0xc0 0x4 */
int len; /* 0xc4 0x4 */
struct sk_buff * head; /* 0xc8 0x8 */
struct sk_buff * tail; /* 0xd0 0x8 */
} sk_backlog; /* 0xc0 0x18 */
struct {
atomic_t rmem_alloc; /* 0 0x4 */
int len; /* 0x4 0x4 */
struct sk_buff * head; /* 0x8 0x8 */
struct sk_buff * tail; /* 0x10 0x8 */
/* size: 24, cachelines: 1, members: 4 */
/* last cacheline: 24 bytes */
};
__u8 __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */
__u8 __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */
rcu * sk_rx_dst; /* 0xd8 0x8 */
int sk_rx_dst_ifindex; /* 0xe0 0x4 */
u32 sk_rx_dst_cookie; /* 0xe4 0x4 */
unsigned int sk_ll_usec; /* 0xe8 0x4 */
unsigned int sk_napi_id; /* 0xec 0x4 */
u16 sk_busy_poll_budget; /* 0xf0 0x2 */
u8 sk_prefer_busy_poll; /* 0xf2 0x1 */
u8 sk_userlocks; /* 0xf3 0x1 */
int sk_rcvbuf; /* 0xf4 0x4 */
rcu * sk_filter; /* 0xf8 0x8 */
/* --- cacheline 4 boundary (256 bytes) --- */
union {
rcu * sk_wq; /* 0x100 0x8 */
struct socket_wq * sk_wq_raw; /* 0x100 0x8 */
}; /* 0x100 0x8 */
union {
rcu * sk_wq; /* 0 0x8 */
struct socket_wq * sk_wq_raw; /* 0 0x8 */
};
void (*sk_data_ready)(struct sock *); /* 0x108 0x8 */
long sk_rcvtimeo; /* 0x110 0x8 */
int sk_rcvlowat; /* 0x118 0x4 */
__u8 __cacheline_group_end__sock_read_rx[0]; /* 0x11c 0 */
__u8 __cacheline_group_begin__sock_read_rxtx[0]; /* 0x11c 0 */
int sk_err; /* 0x11c 0x4 */
struct socket * sk_socket; /* 0x120 0x8 */
struct mem_cgroup * sk_memcg; /* 0x128 0x8 */
rcu * sk_policy[2]; /* 0x130 0x10 */
/* --- cacheline 5 boundary (320 bytes) --- */
__u8 __cacheline_group_end__sock_read_rxtx[0]; /* 0x140 0 */
__u8 __cacheline_group_begin__sock_write_rxtx[0]; /* 0x140 0 */
socket_lock_t sk_lock; /* 0x140 0x20 */
u32 sk_reserved_mem; /* 0x160 0x4 */
int sk_forward_alloc; /* 0x164 0x4 */
u32 sk_tsflags; /* 0x168 0x4 */
__u8 __cacheline_group_end__sock_write_rxtx[0]; /* 0x16c 0 */
__u8 __cacheline_group_begin__sock_write_tx[0]; /* 0x16c 0 */
int sk_write_pending; /* 0x16c 0x4 */
atomic_t sk_omem_alloc; /* 0x170 0x4 */
int sk_sndbuf; /* 0x174 0x4 */
int sk_wmem_queued; /* 0x178 0x4 */
refcount_t sk_wmem_alloc; /* 0x17c 0x4 */
/* --- cacheline 6 boundary (384 bytes) --- */
unsigned long sk_tsq_flags; /* 0x180 0x8 */
union {
struct sk_buff * sk_send_head; /* 0x188 0x8 */
struct rb_root tcp_rtx_queue; /* 0x188 0x8 */
}; /* 0x188 0x8 */
union {
struct sk_buff * sk_send_head; /* 0 0x8 */
struct rb_root tcp_rtx_queue; /* 0 0x8 */
};
struct sk_buff_head sk_write_queue; /* 0x190 0x18 */
u32 sk_dst_pending_confirm; /* 0x1a8 0x4 */
u32 sk_pacing_status; /* 0x1ac 0x4 */
struct page_frag sk_frag; /* 0x1b0 0x10 */
/* --- cacheline 7 boundary (448 bytes) --- */
struct timer_list sk_timer; /* 0x1c0 0x28 */
/* XXX last struct has 4 bytes of padding */
unsigned long sk_pacing_rate; /* 0x1e8 0x8 */
atomic_t sk_zckey; /* 0x1f0 0x4 */
atomic_t sk_tskey; /* 0x1f4 0x4 */
__u8 __cacheline_group_end__sock_write_tx[0]; /* 0x1f8 0 */
__u8 __cacheline_group_begin__sock_read_tx[0]; /* 0x1f8 0 */
unsigned long sk_max_pacing_rate; /* 0x1f8 0x8 */
/* --- cacheline 8 boundary (512 bytes) --- */
long sk_sndtimeo; /* 0x200 0x8 */
u32 sk_priority; /* 0x208 0x4 */
u32 sk_mark; /* 0x20c 0x4 */
rcu * sk_dst_cache; /* 0x210 0x8 */
netdev_features_t sk_route_caps; /* 0x218 0x8 */
u16 sk_gso_type; /* 0x220 0x2 */
u16 sk_gso_max_segs; /* 0x222 0x2 */
unsigned int sk_gso_max_size; /* 0x224 0x4 */
gfp_t sk_allocation; /* 0x228 0x4 */
u32 sk_txhash; /* 0x22c 0x4 */
u8 sk_pacing_shift; /* 0x230 0x1 */
bool sk_use_task_frag; /* 0x231 0x1 */
__u8 __cacheline_group_end__sock_read_tx[0]; /* 0x232 0 */
u8 sk_gso_disabled:1; /* 0x232: 0 0x1 */
u8 sk_kern_sock:1; /* 0x232:0x1 0x1 */
u8 sk_no_check_tx:1; /* 0x232:0x2 0x1 */
u8 sk_no_check_rx:1; /* 0x232:0x3 0x1 */
/* XXX 4 bits hole, try to pack */
u8 sk_shutdown; /* 0x233 0x1 */
u16 sk_type; /* 0x234 0x2 */
u16 sk_protocol; /* 0x236 0x2 */
unsigned long sk_lingertime; /* 0x238 0x8 */
/* --- cacheline 9 boundary (576 bytes) --- */
struct proto * sk_prot_creator; /* 0x240 0x8 */
rwlock_t sk_callback_lock; /* 0x248 0x8 */
int sk_err_soft; /* 0x250 0x4 */
u32 sk_ack_backlog; /* 0x254 0x4 */
u32 sk_max_ack_backlog; /* 0x258 0x4 */
kuid_t sk_uid; /* 0x25c 0x4 */
spinlock_t sk_peer_lock; /* 0x260 0x4 */
int sk_bind_phc; /* 0x264 0x4 */
struct pid * sk_peer_pid; /* 0x268 0x8 */
const struct cred * sk_peer_cred; /* 0x270 0x8 */
ktime_t sk_stamp; /* 0x278 0x8 */
/* --- cacheline 10 boundary (640 bytes) --- */
int sk_disconnects; /* 0x280 0x4 */
u8 sk_txrehash; /* 0x284 0x1 */
u8 sk_clockid; /* 0x285 0x1 */
u8 sk_txtime_deadline_mode:1; /* 0x286: 0 0x1 */
u8 sk_txtime_report_errors:1; /* 0x286:0x1 0x1 */
u8 sk_txtime_unused:6; /* 0x286:0x2 0x1 */
/* XXX 1 byte hole, try to pack */
void * sk_user_data; /* 0x288 0x8 */
void * sk_security; /* 0x290 0x8 */
struct sock_cgroup_data sk_cgrp_data; /* 0x298 0x8 */
void (*sk_state_change)(struct sock *); /* 0x2a0 0x8 */
void (*sk_write_space)(struct sock *); /* 0x2a8 0x8 */
void (*sk_error_report)(struct sock *); /* 0x2b0 0x8 */
int (*sk_backlog_rcv)(struct sock *, struct sk_buff *); /* 0x2b8 0x8 */
/* --- cacheline 11 boundary (704 bytes) --- */
void (*sk_destruct)(struct sock *); /* 0x2c0 0x8 */
rcu * sk_reuseport_cb; /* 0x2c8 0x8 */
rcu * sk_bpf_storage; /* 0x2d0 0x8 */
struct callback_head sk_rcu __attribute__((__aligned__(8))); /* 0x2d8 0x10 */
netns_tracker ns_tracker; /* 0x2e8 0x8 */
/* size: 752, cachelines: 12, members: 105 */
/* sum members: 749, holes: 1, sum holes: 1 */
/* sum bitfield members: 12 bits, bit holes: 1, sum bit holes: 4 bits */
/* paddings: 1, sum paddings: 4 */
/* forced alignments: 1 */
/* last cacheline: 48 bytes */
};
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240216162006.2342759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Use struct netmem* instead of page in skb_frag_t. Currently struct
netmem* is always a struct page underneath, but the abstraction
allows efforts to add support for skb frags not backed by pages.
There is unfortunately 1 instance where the skb_frag_t is assumed to be
a exactly a bio_vec in kcm. For this case, WARN_ON_ONCE and return error
before doing a cast.
Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add the netmem_ref type, an abstraction for network memory.
To add support for new memory types to the net stack, we must first
abstract the current memory type. Currently parts of the net stack
use struct page directly:
- page_pool
- drivers
- skb_frag_t
Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for compiler type checking we need to introduce a new type.
netmem_ref is introduced to abstract the underlying memory type.
Currently it's a no-op abstraction that is always a struct page
underneath. In parallel there is an undergoing effort to add support
for devmem to the net stack:
https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrymina@google.com/
netmem_ref can be pointers to different underlying memory types, and the
low bits are set to indicate the memory type. Helpers are provided
to convert netmem pointers to the underlying memory type (currently only
struct page). In the devmem series helpers are provided so that calling
code can use netmem without worrying about the underlying memory type
unless absolutely necessary.
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Use global percpu page_pool_recycle_stats counter for system page_pool
allocator instead of allocating a separate percpu variable for each
(also percpu) page pool instance.
Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/87f572425e98faea3da45f76c3c68815c01a20ee.1708075412.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that direct recycling is performed basing on pool->cpuid when set,
memory leaks are possible:
1. A pool is destroyed.
2. Alloc cache is emptied (it's done only once).
3. pool->cpuid is still set.
4. napi_pp_put_page() does direct recycling basing on pool->cpuid.
5. Now alloc cache is not empty, but it won't ever be freed.
In order to avoid that, rewrite pool->cpuid to -1 when unlinking NAPI to
make sure no direct recycling will be possible after emptying the cache.
This involves a bit of overhead as pool->cpuid now must be accessed
via READ_ONCE() to avoid partial reads.
Rename page_pool_unlink_napi() -> page_pool_disable_direct_recycling()
to reflect what it actually does and unexport it.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240215113905.96817-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct tc_pedit.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since the introduction of the subflow ULP diag interface, the
dump callback accessed all the subflow data with lockless.
We need either to annotate all the read and write operation accordingly,
or acquire the subflow socket lock. Let's do latter, even if slower, to
avoid a diffstat havoc.
Fixes: 5147dfb50832 ("mptcp: allow dumping subflow context to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As a prerequisite for adding EEE CAP2 register support, complement
PHY_EEE_CAP1_FEATURES with PHY_EEE_CAP2_FEATURES.
For now only 2500baseT and 5000baseT modes are supported.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds helpers for accessing the EEE CAP2 registers.
For now only 2500baseT and 5000baseT modes are supported.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / miscdriver fixes from Greg KH:
"Here is a small set of char/misc and IIO driver fixes for 6.8-rc5.
Included in here are:
- lots of iio driver fixes for reported issues
- nvmem device naming fixup for reported problem
- interconnect driver fixes for reported issues
All of these have been in linux-next for a while with no reported the
issues (the nvmem patch was included in a different branch in
linux-next before sent to me for inclusion here)"
* tag 'char-misc-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
nvmem: include bit index in cell sysfs file name
iio: adc: ad4130: only set GPIO_CTRL if pin is unused
iio: adc: ad4130: zero-initialize clock init data
interconnect: qcom: x1e80100: Add missing ACV enable_mask
interconnect: qcom: sm8650: Use correct ACV enable_mask
iio: accel: bma400: Fix a compilation problem
iio: commom: st_sensors: ensure proper DMA alignment
iio: hid-sensor-als: Return 0 for HID_USAGE_SENSOR_TIME_TIMESTAMP
iio: move LIGHT_UVA and LIGHT_UVB to the end of iio_modifier
staging: iio: ad5933: fix type mismatch regression
iio: humidity: hdc3020: fix temperature offset
iio: adc: ad7091r8: Fix error code in ad7091r8_gpio_setup()
iio: adc: ad_sigma_delta: ensure proper DMA alignment
iio: imu: adis: ensure proper DMA alignment
iio: humidity: hdc3020: Add Makefile, Kconfig and MAINTAINERS entry
iio: imu: bno055: serdev requires REGMAP
iio: magnetometer: rm3100: add boundary check for the value read from RM3100_REG_TMRC
iio: pressure: bmp280: Add missing bmp085 to SPI id table
iio: core: fix memleak in iio_device_register_sysfs
interconnect: qcom: sm8550: Enable sync_state
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial fixes from Greg KH:
"Here are three small tty and serial driver fixes for 6.8-rc5:
- revert a 8250_pci1xxxx off-by-one change that was incorrect
- two changes to fix the transmit path of the mxs-auart driver,
fixing a regression in the 6.2 release
All of these have been in linux-next for over a week with no reported
issues"
* tag 'tty-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: mxs-auart: fix tx
serial: core: introduce uart_port_tx_flags()
serial: 8250_pci1xxxx: partially revert off by one patch
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt fixes from Greg KH:
"Here are two small fixes for 6.8-rc5:
- thunderbolt to fix a reported issue on many platforms
- dwc3 driver revert of a commit that caused problems in -rc1
Both of these changes have been in linux-next for over a week with no
reported issues"
* tag 'usb-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
Revert "usb: dwc3: Support EBC feature of DWC_usb31"
thunderbolt: Fix setting the CNS bit in ROUTER_CS_5
|
|
numa_fill_memblks() fills in the gaps in numa_meminfo memblks over a
physical address range. To do so, it first creates a list of existing
memblks that overlap that address range. The issue is that it is off
by one when comparing to the end of the address range, so memblks
that do not overlap are selected.
The impact of selecting a memblk that does not actually overlap is
that an existing memblk may be filled when the expected action is to
do nothing and return NUMA_NO_MEMBLK to the caller. The caller can
then add a new NUMA node and memblk.
Replace the broken open-coded search for address overlap with the
memblock helper memblock_addrs_overlap(). Update the kernel doc
and in code comments.
Suggested by: "Huang, Ying" <ying.huang@intel.com>
Fixes: 8f012db27c95 ("x86/numa: Introduce numa_fill_memblks()")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/10a3e6109c34c21a8dd4c513cf63df63481a2b07.1705085543.git.alison.schofield@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Pull block fixes from Jens Axboe:
"Just an nvme pull request via Keith:
- Fabrics connection error handling (Chaitanya)
- Use relaxed effects to reduce unnecessary queue freezes (Keith)"
* tag 'block-6.8-2024-02-16' of git://git.kernel.dk/linux:
nvmet: remove superfluous initialization
nvme: implement support for relaxed effects
nvme-fabrics: fix I/O connect error handling
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the #ifndef that didn't have the 'CONFIG_' prefix on
HAVE_DYNAMIC_FTRACE_WITH_REGS
The fix to have dynamic trampolines work with x86 broke arm64 as the
config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the
previous fix was to fix.
- Fix tracing_on state
The code to test if "tracing_on" is set incorrectly used
ring_buffer_record_is_on() which returns false if the ring buffer
isn't able to be written to.
But the ring buffer disable has several bits that disable it. One is
internal disabling which is used for resizing and other modifications
of the ring buffer. But the "tracing_on" user space visible flag
should only report if tracing is actually on and not internally
disabled, as this can cause confusion as writing "1" when it is
disabled will not enable it.
Instead use ring_buffer_record_is_set_on() which shows the user space
visible settings.
- Fix a false positive kmemleak on saved cmdlines
Now that the saved_cmdlines structure is allocated via alloc_page()
and not via kmalloc() it has become invisible to kmemleak. The
allocation done to one of its pointers was flagged as a dangling
allocation leak. Make kmemleak aware of this allocation and free.
- Fix synthetic event dynamic strings
An update that cleaned up the synthetic event code removed the return
value of trace_string(), and had it return zero instead of the
length, causing dynamic strings in the synthetic event to always have
zero size.
- Clean up documentation and header files for seq_buf
* tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
seq_buf: Fix kernel documentation
seq_buf: Don't use "proxy" headers
tracing/synthetic: Fix trace_string() return value
tracing: Inform kmemleak of saved_cmdlines allocation
tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of device-specific fixes. It became a bit bigger than
wished, but all look reasonably small and safe to apply.
- A few Cirrus Logic CS35L56 and CS42L43 driver fixes
- ASoC SOF fixes and workarounds
- Various ASoC Intel fixes
- Lots of HD-, USB-audio and AMD ACP quirks"
* tag 'sound-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (33 commits)
ALSA: usb-audio: More relaxed check of MIDI jack names
ALSA: hda/realtek: fix mute/micmute LED For HP mt645
ALSA: hda/realtek: cs35l41: Fix order and duplicates in quirks table
ALSA: hda/realtek: cs35l41: Fix device ID / model name
ALSA: hda/realtek: cs35l41: Add internal speaker support for ASUS UM3402 with missing DSD
ASoC: cs35l56: Workaround for ACPI with broken spk-id-gpios property
ALSA: hda: Add Lenovo Legion 7i gen7 sound quirk
ASoC: SOF: IPC3: fix message bounds on ipc ops
ASoC: SOF: ipc4-pcm: Workaround for crashed firmware on system suspend
ASoC: q6dsp: fix event handler prototype
ASoC: SOF: Intel: pci-lnl: Change the topology path to intel/sof-ipc4-tplg
ASoC: SOF: Intel: pci-tgl: Change the default paths and firmware names
ASoC: amd: yc: Fix non-functional mic on Lenovo 82UU
ASoC: rt5645: Add DMI quirk for inverted jack-detect on MeeGoPad T8
ASoC: rt5645: Make LattePanda board DMI match more precise
ASoC: SOF: amd: Fix locking in ACP IRQ handler
ASoC: rt5645: Fix deadlock in rt5645_jack_detect_work()
ASoC: Intel: cht_bsw_rt5645: Cleanup codec_name handling
ASoC: Intel: Boards: Fix NULL pointer deref in BYT/CHT boards
ASoC: cs35l56: Remove default from IRQ1_CFG register
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- add missing stubs for functions that are not built with GPIOLIB
disabled
* tag 'gpio-fixes-for-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: add gpio_device_get_label() stub for !GPIOLIB
gpiolib: add gpio_device_get_base() stub for !GPIOLIB
gpiolib: add gpiod_to_gpio_device() stub for !GPIOLIB
|
|
Before this change, generation of the list of MDB events to replay
would race against the creation of new group memberships, either from
the IGMP/MLD snooping logic or from user configuration.
While new memberships are immediately visible to walkers of
br->mdb_list, the notification of their existence to switchdev event
subscribers is deferred until a later point in time. So if a replay
list was generated during a time that overlapped with such a window,
it would also contain a replay of the not-yet-delivered event.
The driver would thus receive two copies of what the bridge internally
considered to be one single event. On destruction of the bridge, only
a single membership deletion event was therefore sent. As a
consequence of this, drivers which reference count memberships (at
least DSA), would be left with orphan groups in their hardware
database when the bridge was destroyed.
This is only an issue when replaying additions. While deletion events
may still be pending on the deferred queue, they will already have
been removed from br->mdb_list, so no duplicates can be generated in
that scenario.
To a user this meant that old group memberships, from a bridge in
which a port was previously attached, could be reanimated (in
hardware) when the port joined a new bridge, without the new bridge's
knowledge.
For example, on an mv88e6xxx system, create a snooping bridge and
immediately add a port to it:
root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
> ip link set dev x3 up master br0
And then destroy the bridge:
root@infix-06-0b-00:~$ ip link del dev br0
root@infix-06-0b-00:~$ mvls atu
ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a
DEV:0 Marvell 88E6393X
33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . .
33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . .
ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a
root@infix-06-0b-00:~$
The two IPv6 groups remain in the hardware database because the
port (x3) is notified of the host's membership twice: once via the
original event and once via a replay. Since only a single delete
notification is sent, the count remains at 1 when the bridge is
destroyed.
Then add the same port (or another port belonging to the same hardware
domain) to a new bridge, this time with snooping disabled:
root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
> ip link set dev x3 up master br1
All multicast, including the two IPv6 groups from br0, should now be
flooded, according to the policy of br1. But instead the old
memberships are still active in the hardware database, causing the
switch to only forward traffic to those groups towards the CPU (port
0).
Eliminate the race in two steps:
1. Grab the write-side lock of the MDB while generating the replay
list.
This prevents new memberships from showing up while we are generating
the replay list. But it leaves the scenario in which a deferred event
was already generated, but not delivered, before we grabbed the
lock. Therefore:
2. Make sure that no deferred version of a replay event is already
enqueued to the switchdev deferred queue, before adding it to the
replay list, when replaying additions.
Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Stanislaw Gruszka says:
====================
thermal/netlink/intel_hfi: Enable HFI feature only when required
The patchset introduces a new genetlink family bind/unbind callbacks
and thermal/netlink notifications, which allow drivers to send netlink
multicast events based on the presence of actual user-space consumers.
This functionality optimizes resource usage by allowing disabling
of features when not needed.
v1: https://lore.kernel.org/linux-pm/20240131120535.933424-1-stanislaw.gruszka@linux.intel.com//
v2: https://lore.kernel.org/linux-pm/20240206133605.1518373-1-stanislaw.gruszka@linux.intel.com/
v3: https://lore.kernel.org/linux-pm/20240209120625.1775017-1-stanislaw.gruszka@linux.intel.com/
====================
Link: https://lore.kernel.org/r/20240212161615.161935-1-stanislaw.gruszka@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add genetlink family bind()/unbind() callbacks when adding/removing
multicast group to/from netlink client socket via setsockopt() or
bind() syscall.
They can be used to track if consumers of netlink multicast messages
emerge or disappear. Thus, a client implementing callbacks, can now
send events only when there are active consumers, preventing unnecessary
work when none exist.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240212161615.161935-2-stanislaw.gruszka@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
net/core/dev.c
9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()")
723de3ebef03 ("net: free altname using an RCU callback")
net/unix/garbage.c
11498715f266 ("af_unix: Remove io_uring code for GC.")
25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.")
drivers/net/ethernet/renesas/ravb_main.c
ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path"
)
c2da9408579d ("ravb: Add Rx checksum offload support for GbEth")
net/mptcp/protocol.c
bdd70eb68913 ("mptcp: drop the push_pending field")
28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit c92a6b5d6335 ("scsi: core: Query VPD size before getting full
page") removed the logic which checks whether a VPD page is present on
the supported pages list before asking for the page itself. That was
done because SPC helpfully states "The Supported VPD Pages VPD page
list may or may not include all the VPD pages that are able to be
returned by the device server". Testing had revealed a few devices
that supported some of the 0xBn pages but didn't actually list them in
page 0.
Julian Sikorski bisected a problem with his drive resetting during
discovery to the commit above. As it turns out, this particular drive
firmware will crash if we attempt to fetch page 0xB9.
Various approaches were attempted to work around this. In the end,
reinstating the logic that consults VPD page 0 before fetching any
other page was the path of least resistance. A firmware update for the
devices which originally compelled us to remove the check has since
been released.
Link: https://lore.kernel.org/r/20240214221411.2888112-1-martin.petersen@oracle.com
Fixes: c92a6b5d6335 ("scsi: core: Query VPD size before getting full page")
Cc: stable@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Reported-by: Julian Sikorski <belegdol@gmail.com>
Tested-by: Julian Sikorski <belegdol@gmail.com>
Reviewed-by: Lee Duncan <lee.duncan@suse.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from can, wireless and netfilter.
Current release - regressions:
- af_unix: fix task hung while purging oob_skb in GC
- pds_core: do not try to run health-thread in VF path
Current release - new code bugs:
- sched: act_mirred: don't zero blockid when net device is being
deleted
Previous releases - regressions:
- netfilter:
- nat: restore default DNAT behavior
- nf_tables: fix bidirectional offload, broken when unidirectional
offload support was added
- openvswitch: limit the number of recursions from action sets
- eth: i40e: do not allow untrusted VF to remove administratively set
MAC address
Previous releases - always broken:
- tls: fix races and bugs in use of async crypto
- mptcp: prevent data races on some of the main socket fields, fix
races in fastopen handling
- dpll: fix possible deadlock during netlink dump operation
- dsa: lan966x: fix crash when adding interface under a lag when some
of the ports are disabled
- can: j1939: prevent deadlock by changing j1939_socks_lock to rwlock
Misc:
- a handful of fixes and reliability improvements for selftests
- fix sysfs documentation missing net/ in paths
- finish the work of squashing the missing MODULE_DESCRIPTION()
warnings in networking"
* tag 'net-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (92 commits)
net: fill in MODULE_DESCRIPTION()s for missing arcnet
net: fill in MODULE_DESCRIPTION()s for mdio_devres
net: fill in MODULE_DESCRIPTION()s for ppp
net: fill in MODULE_DESCRIPTION()s for fddik/skfp
net: fill in MODULE_DESCRIPTION()s for plip
net: fill in MODULE_DESCRIPTION()s for ieee802154/fakelb
net: fill in MODULE_DESCRIPTION()s for xen-netback
net: ravb: Count packets instead of descriptors in GbEth RX path
pppoe: Fix memory leak in pppoe_sendmsg()
net: sctp: fix skb leak in sctp_inq_free()
net: bcmasp: Handle RX buffer allocation failure
net-timestamp: make sk_tskey more predictable in error path
selftests: tls: increase the wait in poll_partial_rec_async
ice: Add check for lport extraction to LAG init
netfilter: nf_tables: fix bidirectional offload regression
netfilter: nat: restore default DNAT behavior
netfilter: nft_set_pipapo: fix missing : in kdoc
igc: Remove temporary workaround
igb: Fix string truncation warnings in igb_set_fw_version
can: netlink: Fix TDCO calculation using the old data bittiming
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Fixes and simple cleanups:
- use a proper flexible array instead of a one-element array in order
to avoid array-bounds sanitizer errors
- add NULL pointer checks after allocating memory
- use memdup_array_user() instead of open-coding it
- fix a rare race condition in Xen event channel allocation code
- make struct bus_type instances const
- make kerneldoc inline comments match reality"
* tag 'for-linus-6.8a-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: close evtchn after mapping cleanup
xen/gntalloc: Replace UAPI 1-element array
xen: balloon: make balloon_subsys const
xen: pcpu: make xen_pcpu_subsys const
xen/privcmd: Use memdup_array_user() in alloc_ioreq()
x86/xen: Add some null pointer checking to smp.c
xen/xenbus: document will_handle argument for xenbus_watch_path()
|
|
In commit 4356e9f841f7 ("work around gcc bugs with 'asm goto' with
outputs") I did the gcc workaround unconditionally, because the cause of
the bad code generation wasn't entirely clear.
In the meantime, Jakub Jelinek debugged the issue, and has come up with
a fix in gcc [2], which also got backported to the still maintained
branches of gcc-11, gcc-12 and gcc-13.
Note that while the fix technically wasn't in the original gcc-14
branch, Jakub says:
"while it is true that no GCC 14 snapshots until today (or whenever the
fix will be committed) have the fix, for GCC trunk it is up to the
distros to use the latest snapshot if they use it at all and would
allow better testing of the kernel code without the workaround, so
that if there are other issues they won't be discovered years later.
Most userland code doesn't actually use asm goto with outputs..."
so we will consider gcc-14 to be fixed - if somebody is using gcc
snapshots of the gcc-14 before the fix, they should upgrade.
Note that while the bug goes back to gcc-11, in practice other gcc
changes seem to have effectively hidden it since gcc-12.1 as per a
bisect by Jakub. So even a gcc-14 snapshot without the fix likely
doesn't show actual problems.
Also, make the default 'asm_goto_output()' macro mark the asm as
volatile by hand, because of an unrelated gcc issue [1] where it doesn't
match the documented behavior ("asm goto is always volatile").
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103979 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 [2]
Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/
Requested-by: Jakub Jelinek <jakub@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Andrew Pinski <quic_apinski@quicinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|