summaryrefslogtreecommitdiff
path: root/net/core/dev.h
AgeCommit message (Collapse)Author
2026-04-21net: introduce ndo_set_rx_mode_async and netdev_rx_mode_workStanislav Fomichev
Add ndo_set_rx_mode_async callback that drivers can implement instead of the legacy ndo_set_rx_mode. The legacy callback runs under the netif_addr_lock spinlock with BHs disabled, preventing drivers from sleeping. The async variant runs from a work queue with rtnl_lock and netdev_lock_ops held, in fully sleepable context. When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules netdev_rx_mode_work instead of calling the driver inline. The work function takes two snapshots of each address list (uc/mc) under the addr_lock, then drops the lock and calls the driver with the work copies. After the driver returns, it reconciles the snapshots back to the real lists under the lock. Add netif_rx_mode_sync() to opportunistically execute the pending workqueue update inline, so that rx mode changes are committed before returning to userspace: - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK) - dev_set_promiscuity - dev_set_allmulti - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI - do_setlink (RTM_SETLINK) Note that some deep hierarchies still do skip the lower updates via: - dev_uc_sync - dev_mc_sync If we do end up hitting user-visible issues, we can add more calls to netif_rx_mode_sync in specific places. But hopefully we should not, the actual user-visible lists are still synced, it's that just HW state that might be lagging. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21net: add address list snapshot and reconciliation infrastructureStanislav Fomichev
Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile() for use by the upcoming ndo_set_rx_mode_async callback. The async rx_mode path needs to snapshot the device's unicast and multicast address lists under the addr_lock, hand those snapshots to the driver (which may sleep), and then propagate any sync_cnt changes back to the real lists. Two identical snapshots are taken: a work copy for the driver to pass to __hw_addr_sync_dev() and a reference copy to compute deltas against. __hw_addr_list_reconcile() walks the reference snapshot comparing each entry against the work snapshot to determine what the driver synced or unsynced. It then applies those deltas to the real list, handling concurrent modifications: - If the real entry was concurrently removed but the driver synced it to hardware (delta > 0), re-insert a stale entry so the next work run properly unsyncs it from hardware. - If the entry still exists, apply the delta normally. An entry whose refcount drops to zero is removed. # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter # dev_addr_test_snapshot_benchmark.speed: slow Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-12net: fix reference tracker mismanagement in netdev_put_lock()Jakub Kicinski
dev_put() releases a reference which didn't have a tracker. References without a tracker are accounted in the tracking code as "no_tracker". We can't free the tracker and then call dev_put(). The references themselves will be fine but the tracking code will think it's a double-release: refcount_t: decrement hit 0; leaking memory. IOW commit under fixes confused dev_put() (release never tracked reference) with __dev_put() (just release the reference, skipping the reference tracking infra). Since __netdev_put_lock() uses dev_put() we can't feed a previously tracked netdev ref into it. Let's flip things around. netdev_put(dev, NULL) is the same as dev_put(dev) so make netdev_put_lock() the real function and have __netdev_put_lock() feed it a NULL tracker for all the cases that were untracked. Fixes: d04686d9bc86 ("net: Implement netdev_nl_queue_create_doit") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20260410153600.1984522-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09net: remove the netif_get_rx_queue_lease_locked() helpersJakub Kicinski
The netif_get_rx_queue_lease_locked() API hides the locking and the descend onto the leased queue. Making the code harder to follow (at least to me). Remove the API and open code the descend a bit. Most of the code now looks like: if (!leased) return __helper(x); hw_rxq = .. netdev_lock(hw_rxq->dev); ret = __helper(x); netdev_unlock(hw_rxq->dev); return ret; Of course if we have more code paths that need the wrapping we may need to revisit. For now, IMHO, having to know what netif_get_rx_queue_lease_locked() does is not worth the 20LoC it saves. Link: https://patch.msgid.link/20260408151251.72bd2482@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09net: Proxy netif_mp_{open,close}_rxq for leased queuesDavid Wei
When a process in a container wants to setup a memory provider, it will use the virtual netdev and a leased rxq, and call netif_mp_{open,close}_rxq to try and restart the queue. At this point, proxy the queue restart on the real rxq in the physical netdev. For memory providers (io_uring zero-copy rx and devmem), it causes the real rxq in the physical netdev to be filled from a memory provider that has DMA mapped memory from a process within a container. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-7-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09net: Implement netdev_nl_queue_create_doitDaniel Borkmann
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ynl --family netdev --output-json --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-10net: add xmit recursion limit to tunnel xmit functionsWeiming Shi
Tunnel xmit functions (iptunnel_xmit, ip6tunnel_xmit) lack their own recursion limit. When a bond device in broadcast mode has GRE tap interfaces as slaves, and those GRE tunnels route back through the bond, multicast/broadcast traffic triggers infinite recursion between bond_xmit_broadcast() and ip_tunnel_xmit()/ip6_tnl_xmit(), causing kernel stack overflow. The existing XMIT_RECURSION_LIMIT (8) in the no-qdisc path is not sufficient because tunnel recursion involves route lookups and full IP output, consuming much more stack per level. Use a lower limit of 4 (IP_TUNNEL_RECURSION_LIMIT) to prevent overflow. Add recursion detection using dev_xmit_recursion helpers directly in iptunnel_xmit() and ip6tunnel_xmit() to cover all IPv4/IPv6 tunnel paths including UDP encapsulated tunnels (VXLAN, Geneve, etc.). Move dev_xmit_recursion helpers from net/core/dev.h to public header include/linux/netdevice.h so they can be used by tunnel code. BUG: KASAN: stack-out-of-bounds in blake2s.constprop.0+0xe7/0x160 Write of size 32 at addr ffff88810033fed0 by task kworker/0:1/11 Workqueue: mld mld_ifc_work Call Trace: <TASK> __build_flow_key.constprop.0 (net/ipv4/route.c:515) ip_rt_update_pmtu (net/ipv4/route.c:1073) iptunnel_xmit (net/ipv4/ip_tunnel_core.c:84) ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847) gre_tap_xmit (net/ipv4/ip_gre.c:779) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279) bond_start_xmit (drivers/net/bonding/bond_main.c:5530) dev_hard_start_xmit (net/core/dev.c:3887) __dev_queue_xmit (net/core/dev.c:4841) ip_finish_output2 (net/ipv4/ip_output.c:237) ip_output (net/ipv4/ip_output.c:438) iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86) gre_tap_xmit (net/ipv4/ip_gre.c:779) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279) bond_start_xmit (drivers/net/bonding/bond_main.c:5530) dev_hard_start_xmit (net/core/dev.c:3887) __dev_queue_xmit (net/core/dev.c:4841) ip_finish_output2 (net/ipv4/ip_output.c:237) ip_output (net/ipv4/ip_output.c:438) iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86) ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847) gre_tap_xmit (net/ipv4/ip_gre.c:779) dev_hard_start_xmit (net/core/dev.c:3887) sch_direct_xmit (net/sched/sch_generic.c:347) __dev_queue_xmit (net/core/dev.c:4802) bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279) bond_start_xmit (drivers/net/bonding/bond_main.c:5530) dev_hard_start_xmit (net/core/dev.c:3887) __dev_queue_xmit (net/core/dev.c:4841) mld_sendpack mld_ifc_work process_one_work worker_thread </TASK> Fixes: 745e20f1b626 ("net: add a recursion limit in xmit path") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260306160133.3852900-2-bestswngs@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-23net: add queue config validation callbackJakub Kicinski
I imagine (tm) that as the number of per-queue configuration options grows some of them may conflict for certain drivers. While the drivers can obviously do all the validation locally doing so is fairly inconvenient as the config is fed to drivers piecemeal via different ops (for different params and NIC-wide vs per-queue). Add a centralized callback for validating the queue config in queue ops. The callback gets invoked before memory provider is installed, and in the future should also be called when ring params are modified. The validation is done after each layer of configuration. Since we can't fail MP un-binding we must make sure that the config is valid both before and after MP overrides are applied. This is moot for now since the set of MP and device configs are disjoint. It will matter significantly in the future, so adding it now so that we don't forget.. Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"Jakub Kicinski
This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497: 931420a2fc36 ("selftests/net: Add netkit container tests") ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing") 6be87fbb2776 ("selftests/net: Add env for container based tests") 61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program") 920da3634194 ("netkit: Add xsk support for af_xdp applications") eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices") b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create") b5c3fa4a0b16 ("netkit: Add single device mode for netkit") 0073d2fd679d ("xsk: Proxy pool management for leased queues") 1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation") 804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues") 0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues") ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized") 9e2103f36110 ("net: Add lease info to queue-get response") 31127deddef4 ("net: Implement netdev_nl_queue_create_doit") a5546e18f77c ("net: Add queue-create operation") The series will conflict with io_uring work, and the code needs more polish. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: Implement netdev_nl_queue_create_doitDaniel Borkmann
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-03Merge tag 'for-6.19/io_uring-20251201' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...
2025-11-11net: export netdev_get_by_index_lock()David Wei
Need to call netdev_get_by_index_lock() from io_uring/zcrx.c, but it is currently private to net. Export the function in linux/netdevice.h. Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03net: Extend NAPI threaded polling to allow kthread based busy pollingSamiullah Khawaja
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-30net: add NUMA awareness to skb_attempt_defer_free()Eric Dumazet
Instead of sharing sd->defer_list & sd->defer_count with many cpus, add one pair for each NUMA node. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-12net: prevent deadlocks when enabling NAPIs with mixed kthread configJakub Kicinski
The following order of calls currently deadlocks if: - device has threaded=1; and - NAPI has persistent config with threaded=0. netif_napi_add_weight_config() dev->threaded == 1 napi_kthread_create() napi_enable() napi_restore_config() napi_set_threaded(0) napi_stop_kthread() while (NAPIF_STATE_SCHED) msleep(20) We deadlock because disabled NAPI has STATE_SCHED set. Creating a thread in netif_napi_add() just to destroy it in napi_disable() is fairly ugly in the first place. Let's read both the device config and the NAPI config in netif_napi_add(). Fixes: e6d76268813d ("net: Update threaded state in napi config in netif_set_threaded") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250809001205.1147153-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-24net: define an enum for the napi threaded stateSamiullah Khawaja
Instead of using '0' and '1' for napi threaded state use an enum with 'disabled' and 'enabled' states. Tested: ./tools/testing/selftests/net/nl_netdev.py TAP version 13 1..7 ok 1 nl_netdev.empty_check ok 2 nl_netdev.lo_check ok 3 nl_netdev.page_pool_check ok 4 nl_netdev.napi_list_check ok 5 nl_netdev.dev_set_threaded ok 6 nl_netdev.napi_set_threaded ok 7 nl_netdev.nsim_rxq_reset_down # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250723013031.2911384-4-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24net: Use netif_threaded_enable instead of netif_set_threaded in driversSamiullah Khawaja
Prepare for adding an enum type for NAPI threaded states by adding netif_threaded_enable API. De-export the existing netif_set_threaded API and only use it internally. Update existing drivers to use netif_threaded_enable instead of the de-exported netif_set_threaded. Note that dev_set_threaded used by mt76 debugfs file is unchanged. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250723013031.2911384-3-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14Add support to set NAPI threaded for individual NAPISamiullah Khawaja
A net device has a threaded sysctl that can be used to enable threaded NAPI polling on all of the NAPI contexts under that device. Allow enabling threaded NAPI polling at individual NAPI level using netlink. Extend the netlink operation `napi-set` and allow setting the threaded attribute of a NAPI. This will enable the threaded polling on a NAPI context. Add a test in `nl_netdev.py` that verifies various cases of threaded NAPI being set at NAPI and at device level. Tested ./tools/testing/selftests/net/nl_netdev.py TAP version 13 1..7 ok 1 nl_netdev.empty_check ok 2 nl_netdev.lo_check ok 3 nl_netdev.page_pool_check ok 4 nl_netdev.napi_list_check ok 5 nl_netdev.dev_set_threaded ok 6 nl_netdev.napi_set_threaded ok 7 nl_netdev.nsim_rxq_reset_down # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250710211203.3979655-1-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09netdev: add "ops compat locking" helpersJakub Kicinski
Add helpers to "lock a netdev in a backward-compatible way", which for ops-locked netdevs will mean take the instance lock. For drivers which haven't opted into the ops locking we'll take rtnl_lock. The scoped foreach is dropping and re-taking the lock for each device, even if prev and next are both under rtnl_lock. I hope that's fine since we expect that netdev nl to be mostly supported by modern drivers, and modern drivers should also opt into the instance locking. Note that these helpers are mostly needed for queue related state, because drivers modify queue config in their ops in a non-atomic way. Or differently put, queue changes don't have a clear-cut API like NAPI configuration. Any state that can should just use the instance lock directly, not the "compat" hacks. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: avoid potential race between netdev_get_by_index_lock() and netns switchJakub Kicinski
netdev_get_by_index_lock() performs following steps: rcu_lock(); dev = lookup(netns, ifindex); dev_get(dev); rcu_unlock(); [... lock & validate the dev ...] return dev Validation right now only checks if the device is registered but since the lookup is netns-aware we must also protect against the device switching netns right after we dropped the RCU lock. Otherwise the caller in netns1 may get a pointer to a device which has just switched to netns2. We can't hold the lock for the entire netns change process (because of the NETDEV_UNREGISTER notifier), and there's no existing marking to indicate that the netns is unlisted because of netns move, so add one. AFAIU none of the existing netdev_get_by_index_lock() callers can suffer from this problem (NAPI code double checks the netns membership and other callers are either under rtnl_lock or not ns-sensitive), so this patch does not have to be treated as a fix. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08net: rps: remove kfree_rcu_mightsleep() useEric Dumazet
Add an rcu_head to sd_flow_limit and rps_sock_flow_table structs to use the more conventional and predictable k[v]free_rcu(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08net: rps: annotate data-races around (struct sd_flow_limit)->countEric Dumazet
softnet_seq_show() can read fl->count while another cpu updates this field from skb_flow_limit(). Make this field an 'unsigned int', as its only consumer only deals with 32 bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08net: rps: change skb_flow_limit() hash functionEric Dumazet
As explained in commit f3483c8e1da6 ("net: rfs: hash function change"), masking low order bits of skb_get_hash(skb) has low entropy. A NIC with 32 RX queues uses the 5 low order bits of rss key to select a queue. This means all packets landing to a given queue share the same 5 low order bits. Switch to hash_32() to reduce hash collisions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-19Revert "net: replace dev_addr_sem with netdev instance lock"Stanislav Fomichev
This reverts commit df43d8bf10316a7c3b1e47e3cc0057a54df4a5b8. Cc: Kohei Enju <enjuk@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Fixes: df43d8bf1031 ("net: replace dev_addr_sem with netdev instance lock") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250312190513.1252045-2-sdf@fomichev.me Tested-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: replace dev_addr_sem with netdev instance lockStanislav Fomichev
Lockdep reports possible circular dependency in [0]. Instead of fixing the ordering, replace global dev_addr_sem with netdev instance lock. Most of the paths that set/get mac are RTNL protected. Two places where it's not, convert to explicit locking: - sysfs address_show - dev_get_mac_address via dev_ioctl 0: https://netdev-3.bots.linux.dev/vmksft-forwarding-dbg/results/993321/24-router-bridge-1d-lag-sh/stderr Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-12-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: hold netdev instance lock during rtnetlink operationsStanislav Fomichev
To preserve the atomicity, hold the lock while applying multiple attributes. The major issue with a full conversion to the instance lock are software nesting devices (bonding/team/vrf/etc). Those devices call into the core stack for their lower (potentially real hw) devices. To avoid explicitly wrapping all those places into instance lock/unlock, introduce new API boundaries: - (some) existing dev_xxx calls are now considered "external" (to drivers) APIs and they transparently grab the instance lock if needed (dev_api.c) - new netif_xxx calls are internal core stack API (naming is sketchy, I've tried netdev_xxx_locked per Jakub's suggestion, but it feels a bit verbose; but happy to get back to this naming scheme if this is the preference) This avoids touching most of the existing ioctl/sysfs/drivers paths. Note the special handling of ndo_xxx_slave operations: I exploit the fact that none of the drivers that call these functions need/use instance lock. At the same time, they use dev_xxx APIs, so the lower device has to be unlocked. Changes in unregister_netdevice_many_notify (to protect dev->state with instance lock) trigger lockdep - the loop over close_list (mostly from cleanup_net) introduces spurious ordering issues. netdev_lock_cmp_fn has a justification on why it's ok to suppress for now. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-7-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: hold netdev instance lock during ndo_open/ndo_stopStanislav Fomichev
For the drivers that use shaper API, switch to the mode where core stack holds the netdev lock. This affects two drivers: * iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly remove these * netdevsim - switch to _locked APIs to avoid deadlock iavf_close diff is a bit confusing, the existing call looks like this: iavf_close() { netdev_lock() .. netdev_unlock() wait_event_timeout(down_waitqueue) } I change it to the following: netdev_lock() iavf_close() { .. netdev_unlock() wait_event_timeout(down_waitqueue) netdev_lock() // reusing this lock call } netdev_unlock() Since I'm reusing existing netdev_lock call, so it looks like I only add netdev_unlock. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-07net: page_pool: avoid false positive warning if NAPI was never addedJakub Kicinski
We expect NAPI to be in disabled state when page pool is torn down. But it is also legal if the NAPI is completely uninitialized. Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250206225638.1387810-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: make netdev netlink ops hold netdev_lock()Jakub Kicinski
In prep for dropping rtnl_lock, start locking netdev->lock in netlink genl ops. We need to be using netdev->up instead of flags & IFF_UP. We can remove the RCU lock protection for the NAPI since NAPI list is protected by netdev->lock already. Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: add netdev->up protected by netdev_lock()Jakub Kicinski
Some uAPI (netdev netlink) hide net_device's sub-objects while the interface is down to ensure uniform behavior across drivers. To remove the rtnl_lock dependency from those uAPIs we need a way to safely tell if the device is down or up. Add an indication of whether device is open or closed, protected by netdev->lock. The semantics are the same as IFF_UP, but taking netdev_lock around every write to ->flags would be a lot of code churn. We don't want to blanket the entire open / close path by netdev_lock, because it will prevent us from applying it to specific structures - core helpers won't be able to take that lock from any function called by the drivers on open/close paths. So the state of the flag is "pessimistic", as in it may report false negatives, but never false positives. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: add helpers for lookup and walking netdevs under netdev_lock()Jakub Kicinski
Add helpers for accessing netdevs under netdev_lock(). There's some careful handling needed to find the device and lock it safely, without it getting unregistered, and without taking rtnl_lock (the latter being the whole point of the new locking, after all). Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-10net: hide the definition of dev_get_by_napi_id()Jakub Kicinski
There are no module callers of dev_get_by_napi_id(), and commit d1cacd747768 ("netdev: prevent accessing NAPI instances from another namespace") proves that getting NAPI by id needs to be done with care. So hide dev_get_by_napi_id(). Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250110004924.3212260-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc7). Conflicts: a42d71e322a8 ("net_sched: sch_cake: Add drop reasons") 737d4d91d35b ("sched: sch_cake: add bounds checks to host bulk flow fairness counts") Adjacent changes: drivers/net/ethernet/meta/fbnic/fbnic.h 3a856ab34726 ("eth: fbnic: add IRQ reuse support") 95978931d55f ("eth: fbnic: Revert "eth: fbnic: Add hardware monitoring support via HWMON interface"") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-08netdev: prevent accessing NAPI instances from another namespaceJakub Kicinski
The NAPI IDs were not fully exposed to user space prior to the netlink API, so they were never namespaced. The netlink API must ensure that at the very least NAPI instance belongs to the same netns as the owner of the genl sock. napi_by_id() can become static now, but it needs to move because of dev_get_by_napi_id(). Cc: stable@vger.kernel.org Fixes: 1287c1ae0fc2 ("netdev-genl: Support setting per-NAPI config values") Fixes: 27f91aaf49b3 ("netdev-genl: Add netlink framework functions for napi") Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250106180137.1861472-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16net: Make net_hwtstamp_validate accessibleKory Maincent
Make the net_hwtstamp_validate function accessible in prevision to use it from ethtool to validate the hwtstamp configuration before setting it. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16net: Make dev_get_hwtstamp_phylib accessibleKory Maincent
Make the dev_get_hwtstamp_phylib function accessible in prevision to use it from ethtool to read the hwtstamp current configuration. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-11-11net: Add napi_struct parameter irq_suspend_timeoutMartin Karsten
Add a per-NAPI IRQ suspension parameter, which can be get/set with netdev-genl. This patch doesn't change any behavior but prepares the code for other changes in the following commits which use irq_suspend_timeout as a timeout for IRQ suspension. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: napi: Add napi_configJoe Damato
Add a persistent NAPI config area for NAPI configuration to the core. Drivers opt-in to setting the persistent config for a NAPI by passing an index when calling netif_napi_add_config. napi_config is allocated in alloc_netdev_mqs, freed in free_netdev (after the NAPIs are deleted). Drivers which call netif_napi_add_config will have persistent per-NAPI settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings. Per-NAPI settings are saved in napi_disable and restored in napi_enable. Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241011184527.16393-6-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: napi: Make gro_flush_timeout per-NAPIJoe Damato
Allow per-NAPI gro_flush_timeout setting. The existing sysfs parameter is respected; writes to sysfs will write to all NAPI structs for the device and the net_device gro_flush_timeout field. Reads from sysfs will read from the net_device field. The ability to set gro_flush_timeout on specific NAPI instances will be added in a later commit, via netdev-genl. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-4-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: napi: Make napi_defer_hard_irqs per-NAPIJoe Damato
Add defer_hard_irqs to napi_struct in preparation for per-NAPI settings. The existing sysfs parameter is respected; writes to sysfs will write to all NAPI structs for the device and the net_device defer_hard_irq field. Reads from sysfs show the net_device field. The ability to set defer_hard_irqs on specific NAPI instances will be added in a later commit, via netdev-genl. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-2-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10net-shapers: implement shaper cleanup on queue deletionPaolo Abeni
hook into netif_set_real_num_tx_queues() to cleanup any shaper configured on top of the to-be-destroyed TX queues. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/6da4ee03cae2b2a757d7b59e88baf09cc94c5ef1.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10net-shapers: implement NL get operationPaolo Abeni
Introduce the basic infrastructure to implement the net-shaper core functionality. Each network devices carries a net-shaper cache, the NL get() operation fetches the data from such cache. The cache is initially empty, will be fill by the set()/group() operation implemented later and is destroyed at device cleanup time. The net_shaper_fill_handle(), net_shaper_ctx_init(), and net_shaper_generic_pre() implementations handle generic index type attributes, despite the current caller always pass a constant value to avoid more noise in later patches using them with different attributes. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/ddd10fd645a9367803ad02fca4a5664ea5ace170.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24net: softnet_data: Make xmit per task.Sebastian Andrzej Siewior
Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in local_bh_disable() there is no guarantee that only one device is transmitting at a time. With preemption and multiple senders it is possible that the per-CPU `recursion' counter gets incremented by different threads and exceeds XMIT_RECURSION_LIMIT leading to a false positive recursion alert. The `more' member is subject to similar problems if set by one thread for one driver and wrongly used by another driver within another thread. Instead of adding a lock to protect the per-CPU variable it is simpler to make xmit per-task. Sending and receiving skbs happens always in thread context anyway. Having a lock to protected the per-CPU counter would block/ serialize two sending threads needlessly. It would also require a recursive lock to ensure that the owner can increment the counter further. Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add needed wrapper. Cc: Ben Segall <bsegall@google.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17net: Move dev_set_hwtstamp_phylib to net/core/dev.hKory Maincent
This declaration was added to the header to be called from ethtool. ethtool is separated from core for code organization but it is not really a separate entity, it controls very core things. As ethtool is an internal stuff it is not wise to have it in netdevice.h. Move the declaration to net/core/dev.h instead. Remove the EXPORT_SYMBOL_GPL call as ethtool can not be built as a module. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://lore.kernel.org/r/20240612-feature_ptp_netnext-v15-2-b2a086257b63@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30net: move sysctl_skb_defer_max to net_hotdataEric Dumazet
sysctl_skb_defer_max is used in TCP fast path, move it to net_hodata. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-01net: move dev_xmit_recursion() helpers to net/core/dev.hEric Dumazet
Move dev_xmit_recursion() and friends to net/core/dev.h They are only used from net/core/dev.c and net/core/filter.c. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move kick_defer_list_purge() to net/core/dev.hEric Dumazet
kick_defer_list_purge() is defined in net/core/dev.c and used from net/core/skubff.c Because we need softnet_data, include <linux/netdevice.h> from net/core/dev.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-07net: move netdev_tstamp_prequeue into net_hotdataEric Dumazet
netdev_tstamp_prequeue is used in rx path. Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: move netdev_budget and netdev_budget to net_hotdataEric Dumazet
netdev_budget and netdev_budget are used in rx path (net_rx_action()) Move them into net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>