summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
33 hoursMerge tag 'net-next-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Continue Netlink conversions to per-namespace RTNL lock (IPv4 routing, routing rules, routing next hops, ARP ioctls) - Continue extending the use of netdev instance locks. As a driver opt-in protect queue operations and (in due course) ethtool operations with the instance lock and not RTNL lock. - Support collecting TCP timestamps (data submitted, sent, acked) in BPF, allowing for transparent (to the application) and lower overhead tracking of TCP RPC performance. - Tweak existing networking Rx zero-copy infra to support zero-copy Rx via io_uring. - Optimize MPTCP performance in single subflow mode by 29%. - Enable GRO on packets which went thru XDP CPU redirect (were queued for processing on a different CPU). Improving TCP stream performance up to 2x. - Improve performance of contended connect() by 200% by searching for an available 4-tuple under RCU rather than a spin lock. Bring an additional 229% improvement by tweaking hash distribution. - Avoid unconditionally touching sk_tsflags on RX, improving performance under UDP flood by as much as 10%. - Avoid skb_clone() dance in ping_rcv() to improve performance under ping flood. - Avoid FIB lookup in netfilter if socket is available, 20% perf win. - Rework network device creation (in-kernel) API to more clearly identify network namespaces and their roles. There are up to 4 namespace roles but we used to have just 2 netns pointer arguments, interpreted differently based on context. - Use sysfs_break_active_protection() instead of trylock to avoid deadlocks between unregistering objects and sysfs access. - Add a new sysctl and sockopt for capping max retransmit timeout in TCP. - Support masking port and DSCP in routing rule matches. - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST. - Support specifying at what time packet should be sent on AF_XDP sockets. - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users. - Add Netlink YAML spec for WiFi (nl80211) and conntrack. - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols which only need to be exported when IPv6 support is built as a module. - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to normal bridging. - Allow users to specify source port range for GENEVE tunnels. - netconsole: allow attaching kernel release, CPU ID and task name to messages as metadata Driver API: - Continue rework / fixing of Energy Efficient Ethernet (EEE) across the SW layers. Delegate the responsibilities to phylink where possible. Improve its handling in phylib. - Support symmetric OR-XOR RSS hashing algorithm. - Support tracking and preserving IRQ affinity by NAPI itself. - Support loopback mode speed selection for interface selftests. Device drivers: - Remove the IBM LCS driver for s390 - Remove the sb1000 cable modem driver - Add support for SFP module access over SMBus - Add MCTP transport driver for MCTP-over-USB - Enable XDP metadata support in multiple drivers - Ethernet high-speed NICs: - Broadcom (bnxt): - add PCIe TLP Processing Hints (TPH) support for new AMD platforms - support dumping RoCE queue state for debug - opt into instance locking - Intel (100G, ice, idpf): - ice: rework MSI-X IRQ management and distribution - ice: support for E830 devices - iavf: add support for Rx timestamping - iavf: opt into instance locking - nVidia/Mellanox: - mlx4: use page pool memory allocator for Rx - mlx5: support for one PTP device per hardware clock - mlx5: support for 200Gbps per-lane link modes - mlx5: move IPSec policy check after decryption - AMD/Solarflare: - support FW flashing via devlink - Cisco (enic): - use page pool memory allocator for Rx - enable 32, 64 byte CQEs - get max rx/tx ring size from the device - Meta (fbnic): - support flow steering and RSS configuration - report queue stats - support TCP segmentation - support IRQ coalescing - support ring size configuration - Marvell/Cavium: - support AF_XDP - Wangxun: - support for PTP clock and timestamping - Huawei (hibmcge): - checksum offload - add more statistics - Ethernet virtual: - VirtIO net: - aggressively suppress Tx completions, improve perf by 96% with 1 CPU and 55% with 2 CPUs - expose NAPI to IRQ mapping and persist NAPI settings - Google (gve): - support XDP in DQO RDA Queue Format - opt into instance locking - Microsoft vNIC: - support BIG TCP - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - cleanup Tx and Tx clock setting and other link-focused cleanups - enable SGMII and 2500BASEX mode switching for Intel platforms - support Sophgo SG2044 - Broadcom switches (b53): - support for BCM53101 - TI: - iep: add perout configuration support - icssg: support XDP - Cadence (macb): - implement BQL - Xilinx (axinet): - support dynamic IRQ moderation and changing coalescing at runtime - implement BQL - report standard stats - MediaTek: - support phylink managed EEE - Intel: - igc: don't restart the interface on every XDP program change - RealTek (r8169): - support reading registers of internal PHYs directly - increase max jumbo packet size on RTL8125/RTL8126 - Airoha: - support for RISC-V NPU packet processing unit - enable scatter-gather and support MTU up to 9kB - Tehuti (tn40xx): - support cards with TN4010 MAC and an Aquantia AQR105 PHY - Ethernet PHYs: - support for TJA1102S, TJA1121 - dp83tg720: add randomized polling intervals for link detection - dp83822: support changing the transmit amplitude voltage - support for LEDs on 88q2xxx - CAN: - canxl: support Remote Request Substitution bit access - flexcan: add S32G2/S32G3 SoC - WiFi: - remove cooked monitor support - strict mode for better AP testing - basic EPCS support - OMI RX bandwidth reduction support - batman-adv: add support for jumbo frames - WiFi drivers: - RealTek (rtw88): - support RTL8814AE and RTL8814AU - RealTek (rtw89): - switch using wiphy_lock and wiphy_work - add BB context to manipulate two PHY as preparation of MLO - improve BT-coexistence mechanism to play A2DP smoothly - Intel (iwlwifi): - add new iwlmld sub-driver for latest HW/FW combinations - MediaTek (mt76): - preparation for mt7996 Multi-Link Operation (MLO) support - Qualcomm/Atheros (ath12k): - continued work on MLO - Silabs (wfx): - Wake-on-WLAN support - Bluetooth: - add support for skb TX SND/COMPLETION timestamping - hci_core: enable buffer flow control for SCO/eSCO - coredump: log devcd dumps into the monitor - Bluetooth drivers: - intel: add support to configure TX power - nxp: handle bootloader error during cmd5 and cmd7" * tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits) unix: fix up for "apparmor: add fine grained af_unix mediation" mctp: Fix incorrect tx flow invalidation condition in mctp-i2c net: usb: asix: ax88772: Increase phy_name size net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string net: libwx: fix Tx L4 checksum net: libwx: fix Tx descriptor content for some tunnel packets atm: Fix NULL pointer dereference net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus net: phy: aquantia: add essential functions to aqr105 driver net: phy: aquantia: search for firmware-name in fwnode net: phy: aquantia: add probe function to aqr105 for firmware loading net: phy: Add swnode support to mdiobus_scan gve: add XDP DROP and PASS support for DQ gve: update XDP allocation path support RX buffer posting gve: merge packet buffer size fields gve: update GQ RX to use buf_size gve: introduce config-based allocation for XDP gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics ...
46 hoursMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.15 net-next PR. No conflicts, adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 919f9f497dbc ("eth: bnxt: fix out-of-range access of vnic_info array") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
46 hoursunix: fix up for "apparmor: add fine grained af_unix mediation"Stephen Rothwell
After merging the apparmor tree, today's linux-next build (x86_64 allmodconfig) failed like this: security/apparmor/af_unix.c: In function 'unix_state_double_lock': security/apparmor/af_unix.c:627:17: error: implicit declaration of function 'unix_state_lock'; did you mean 'unix_state_double_lock'? [-Wimplicit-function-declaration] 627 | unix_state_lock(sk1); | ^~~~~~~~~~~~~~~ | unix_state_double_lock security/apparmor/af_unix.c: In function 'unix_state_double_unlock': security/apparmor/af_unix.c:642:17: error: implicit declaration of function 'unix_state_unlock'; did you mean 'unix_state_double_lock'? [-Wimplicit-function-declaration] 642 | unix_state_unlock(sk1); | ^~~~~~~~~~~~~~~~~ | unix_state_double_lock Caused by commit c05e705812d1 ("apparmor: add fine grained af_unix mediation") interacting with commit 84960bf24031 ("af_unix: Move internal definitions to net/unix/.") from the net-next tree. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://patch.msgid.link/20250326150148.72d9138d@canb.auug.org.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysMerge tag 'crc-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...
3 daysMerge tag 'for-net-next-2025-03-25' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: core: - Add support for skb TX SND/COMPLETION timestamping - hci_core: Enable buffer flow control for SCO/eSCO - coredump: Log devcd dumps into the monitor drivers: - btusb: Add 2 HWIDs for MT7922 - btusb: Fix regression in the initialization of fake Bluetooth controllers - btusb: Add 14 USB device IDs for Qualcomm WCN785x - btintel: Add support for Intel Scorpius Peak - btintel: Add support to configure TX power - btintel: Add DSBR support for ScP - btintel_pcie: Add device id of Whale Peak - btintel_pcie: Setup buffers for firmware traces - btintel_pcie: Read hardware exception data - btintel_pcie: Add support for device coredump - btintel_pcie: Trigger device coredump on hardware exception - btnxpuart: Support for controller wakeup gpio config - btnxpuart: Add support to set BD address - btnxpuart: Add correct bootloader error codes - btnxpuart: Handle bootloader error during cmd5 and cmd7 - btnxpuart: Fix kernel panic during FW release - qca: add WCN3950 support - hci_qca: use the power sequencer for wcn6750 - btmtksdio: Prevent enabling interrupts after IRQ handler removal * tag 'for-net-next-2025-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (53 commits) Bluetooth: MGMT: Add LL Privacy Setting Bluetooth: hci_event: Fix handling of HCI_EV_LE_DIRECT_ADV_REPORT Bluetooth: btnxpuart: Fix kernel panic during FW release Bluetooth: btnxpuart: Handle bootloader error during cmd5 and cmd7 Bluetooth: btnxpuart: Add correct bootloader error codes t blameBluetooth: btintel: Fix leading white space Bluetooth: btintel: Add support to configure TX power Bluetooth: btmtksdio: Prevent enabling interrupts after IRQ handler removal Bluetooth: btmtk: Remove the resetting step before downloading the fw Bluetooth: SCO: add TX timestamping Bluetooth: L2CAP: add TX timestamping Bluetooth: ISO: add TX timestamping Bluetooth: add support for skb TX SND/COMPLETION timestamping net-timestamp: COMPLETION timestamp on packet tx completion HCI: coredump: Log devcd dumps into the monitor Bluetooth: HCI: Add definition of hci_rp_remote_name_req_cancel Bluetooth: hci_vhci: Mark Sync Flow Control as supported Bluetooth: hci_core: Enable buffer flow control for SCO/eSCO Bluetooth: btintel_pci: Fix build warning Bluetooth: btintel_pcie: Trigger device coredump on hardware exception ... ==================== Link: https://patch.msgid.link/20250325192925.2497890-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysBluetooth: MGMT: Add LL Privacy SettingLuiz Augusto von Dentz
This adds LL Privacy (bit 22) to Read Controller Information so the likes of bluetoothd(1) can detect when the controller supports it or not. Fixes: e209e5ccc5ac ("Bluetooth: MGMT: Mark LL Privacy as stable") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daystcp/dccp: remove icsk->icsk_ack.timeoutEric Dumazet
icsk->icsk_ack.timeout can be replaced by icsk->csk_delack_timer.expires This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250324203607.703850-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daystcp/dccp: remove icsk->icsk_timeoutEric Dumazet
icsk->icsk_timeout can be replaced by icsk->icsk_retransmit_timer.expires This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250324203607.703850-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysnet: designate queue -> napi linking as "ops protected"Jakub Kicinski
netdev netlink is the only reader of netdev_{,rx_}queue->napi, and it already holds netdev->lock. Switch protection of the writes to netdev->lock to "ops protected". The expectation will be now that accessing queue->napi will require netdev->lock for "ops locked" drivers, and rtnl_lock for all other drivers. Current "ops locked" drivers don't require any changes. gve and netdevsim use _locked() helpers right next to netif_queue_set_napi() so they must be holding the instance lock. iavf doesn't call it. bnxt is a bit messy but all paths seem locked. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysnet: explain "protection types" for the instance lockJakub Kicinski
Try to define some terminology for which fields are protected by which lock and how. Some fields are protected by both rtnl_lock and instance lock which is hard to talk about without having a "key phrase" to refer to a particular protection scheme. "ops protected" fields are defined later in the series, one by one. Add ASSERT_RTNL() to netdev_ops_assert_locked() for drivers not other instance protection of ops. Hopefully it's not too confusion that netdev_lock_ops() does not match the lock which netdev_ops_assert_locked() will assert, exactly. The noun "ops" is in a different place in the name, so I think it's acceptable... Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysnet: constify dev pointer in misc instance lock helpersJakub Kicinski
lockdep asserts and predicates can operate on const pointers. In the future this will let us add asserts in functions which operate on const pointers like dev_get_min_mp_channel_count(). Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysBluetooth: L2CAP: add TX timestampingPauli Virtanen
Support TX timestamping in L2CAP sockets. Support MSG_ERRQUEUE recvmsg. For other than SOCK_STREAM L2CAP sockets, if a packet from sendmsg() is fragmented, only the first ACL fragment is timestamped. For SOCK_STREAM L2CAP sockets, use the bytestream convention and timestamp the last fragment and count bytes in tskey. Timestamps are not generated in the Enhanced Retransmission mode, as meaning of COMPLETION stamp is unclear if L2CAP layer retransmits. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: ISO: add TX timestampingPauli Virtanen
Add BT_SCM_ERROR socket CMSG type. Support TX timestamping in ISO sockets. Support MSG_ERRQUEUE in ISO recvmsg. If a packet from sendmsg() is fragmented, only the first ACL fragment is timestamped. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: add support for skb TX SND/COMPLETION timestampingPauli Virtanen
Support enabling TX timestamping for some skbs, and track them until packet completion. Generate software SCM_TSTAMP_COMPLETION when getting completion report from hardware. Generate software SCM_TSTAMP_SND before sending to driver. Sending from driver requires changes in the driver API, and drivers mostly are going to send the skb immediately. Make the default situation with no COMPLETION TX timestamping more efficient by only counting packets in the queue when there is nothing to track. When there is something to track, we need to make clones, since the driver may modify sent skbs. The tx_q queue length is bounded by the hdev flow control, which will not send new packets before it has got completion reports for old ones. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: HCI: Add definition of hci_rp_remote_name_req_cancelWentao Guan
Return Parameters is not only status, also bdaddr: BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E page 1870: BLUETOOTH CORE SPECIFICATION Version 5.0 | Vol 2, Part E page 802: Return parameters: Status: Size: 1 octet BD_ADDR: Size: 6 octets Note that it also fixes the warning: "Bluetooth: hci0: unexpected cc 0x041a length: 7 > 1" Fixes: c8992cffbe741 ("Bluetooth: hci_event: Use of a function table to handle Command Complete") Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: hci_core: Enable buffer flow control for SCO/eSCOLuiz Augusto von Dentz
This enables buffer flow control for SCO/eSCO (see: Bluetooth Core 6.0 spec: 6.22. Synchronous Flow Control Enable), recently this has caused the following problem and is actually a nice addition for the likes of Socket TX complete: < HCI Command: Read Buffer Size (0x04|0x0005) plen 0 > HCI Event: Command Complete (0x0e) plen 11 Read Buffer Size (0x04|0x0005) ncmd 1 Status: Success (0x00) ACL MTU: 1021 ACL max packet: 5 SCO MTU: 240 SCO max packet: 8 ... < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 > HCI Event: Hardware Error (0x10) plen 1 Code: 0x0a To fix the code will now attempt to enable buffer flow control when HCI_QUIRK_SYNC_FLOWCTL_SUPPORTED is set by the driver: < HCI Command: Write Sync Fl.. (0x03|0x002f) plen 1 Flow control: Enabled (0x01) > HCI Event: Command Complete (0x0e) plen 4 Write Sync Flow Control Enable (0x03|0x002f) ncmd 1 Status: Success (0x00) On success then HCI_SCO_FLOWCTL would be set which indicates sco_cnt shall be used for flow contro. Fixes: 7fedd3bb6b77 ("Bluetooth: Prioritize SCO traffic") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Pauli Virtanen <pav@iki.fi>
3 daysBluetooth: Add quirk for broken READ_PAGE_SCAN_TYPEPedro Nishiyama
Some fake controllers cannot be initialized because they return a smaller report than expected for READ_PAGE_SCAN_TYPE. Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: Add quirk for broken READ_VOICE_SETTINGPedro Nishiyama
Some fake controllers cannot be initialized because they return a smaller report than expected for READ_VOICE_SETTING. Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: L2CAP: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies() for readability. Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysBluetooth: MGMT: Remove unused mgmt_*_discovery_completeDr. David Alan Gilbert
mgmt_start_discovery_complete() and mgmt_stop_discovery_complete() last uses were removed in 2022 by commit ec2904c259c5 ("Bluetooth: Remove dead code from hci_request.c") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
3 daysRevert "udp_tunnel: GRO optimizations"Jakub Kicinski
Revert "udp_tunnel: use static call for GRO hooks when possible" This reverts commit 311b36574ceaccfa3f91b74054a09cd4bb877702. Revert "udp_tunnel: create a fastpath GRO lookup." This reverts commit 8d4880db378350f8ed8969feea13bdc164564fc1. There are multiple small issues with the series. In the interest of unblocking the merge window let's opt for a revert. Link: https://lore.kernel.org/cover.1742557254.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysMerge tag 'ipsec-next-2025-03-24' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2025-03-24 1) Prevent setting high order sequence number bits input in non-ESN mode. From Leon Romanovsky. 2) Support PMTU handling in tunnel mode for packet offload. From Leon Romanovsky. 3) Make xfrm_state_lookup_byaddr lockless. From Florian Westphal. 4) Remove unnecessary NULL check in xfrm_lookup_with_ifid(). From Dan Carpenter. * tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid() xfrm: state: make xfrm_state_lookup_byaddr lockless xfrm: check for PMTU in tunnel mode for packet offload xfrm: provide common xdo_dev_offload_ok callback implementation xfrm: rely on XFRM offload xfrm: simplify SA initialization routine xfrm: delay initialization of offload path till its actually requested xfrm: prevent high SEQ input in non-ESN mode ==================== Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysMerge tag 'nf-next-25-03-23' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: 1) Use kvmalloc in xt_hashlimit, from Denis Kirjanov. 2) Tighten nf_conntrack sysctl accepted values for nf_conntrack_max and nf_ct_expect_max, from Nicolas Bouchinet. 3) Avoid lookup in nft_fib if socket is available, from Florian Westphal. 4) Initialize struct lsm_context in nfnetlink_queue to avoid hypothetical ENOMEM errors, Chenyuan Yang. 5) Use strscpy() instead of _pad when initializing xtables table name, kzalloc is already used to initialized the table memory area. From Thorsten Blum. 6) Missing socket lookup by conntrack information for IPv6 traffic in nft_socket, there is a similar chunk in IPv4, this was never added when IPv6 NAT was introduced. From Maxim Mikityanskiy. 7) Fix clang issues with nf_tables CONFIG_MITIGATION_RETPOLINE, from WangYuli. * tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: Only use nf_skip_indirect_calls() when MITIGATION_RETPOLINE netfilter: socket: Lookup orig tuple for IPv6 SNAT netfilter: xtables: Use strscpy() instead of strscpy_pad() netfilter: nfnetlink_queue: Initialize ctx to avoid memory allocation error netfilter: fib: avoid lookup if socket is available netfilter: conntrack: Bound nf_conntrack sysctl writes netfilter: xt_hashlimit: replace vmalloc calls with kvmalloc ==================== Link: https://patch.msgid.link/20250323100922.59983-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysnet: rfs: hash function changeEric Dumazet
RFS is using two kinds of hash tables. First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N and using the N low order bits of the l4 hash is good enough. Then each RX queue has its own hash table, controlled by /sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X Current hash function, using the X low order bits is suboptimal, because RSS is usually using Func(hash) = (hash % power_of_two); For example, with 32 RX queues, 6 low order bits have no entropy for a given queue. Switch this hash function to hash_32(hash, log) to increase chances to use all possible slots and reduce collisions. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysMerge tag 'wireless-next-2025-03-20' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== More features for 6.15, major changes: * cfg80211/mac80211: fix and enable link reconfiguration * rtw88: support RTL8814AE/RTL8814AU * mt7996: preparations for MLO * ath12k: continued work on MLO * iwlwifi: add new iwlmld sub-driver/op-mode for some current and future devices * wfx: wowlan support * tag 'wireless-next-2025-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (311 commits) wifi: mt76: mt7996: fix locking in mt7996_mac_sta_rc_work() wifi: mt76: mt76x2u: add TP-Link TL-WDN6200 ID to device table wifi: mt76: mt792x: re-register CHANCTX_STA_CSA only for the mt7921 series wifi: mt76: mt7996: Update mt7996_tx to MLO support wifi: mt76: mt7996: rework mt7996_ampdu_action to support MLO wifi: mt76: mt7996: rework set/get_tsf callabcks to support MLO wifi: mt76: mt7996: set vif default link_id adding/removing vif links wifi: mt76: mt7996: rework mt7996_mcu_beacon_inband_discov to support MLO wifi: mt76: mt7996: rework mt7996_mcu_add_obss_spr to support MLO wifi: mt76: mt7996: rework mt7996_net_fill_forward_path to support MLO wifi: mt76: mt7996: rework mt7996_update_mu_group to support MLO wifi: mt76: mt7996: rework mt7996_mac_sta_poll to support MLO wifi: mt76: mt7996: rework mt7996_mac_sta_rc_work to support MLO wifi: mt76: mt7996: remove mt7996_mac_enable_rtscts() wifi: mt76: mt7996: rework mt7996_sta_hw_queue_read to support MLO wifi: mt76: mt7996: rework mt7996_set_hw_key to support MLO wifi: mt76: mt7996: Add mt7996_sta_link to mt7996_mcu_add_bss_info signature wifi: mt76: mt7996: rework mt7996_sta_set_4addr and mt7996_sta_set_decap_offload to support MLO wifi: mt76: mt7996: rework mt7996_rx_get_wcid to support MLO wifi: mt76: mt7996: Rely on wcid_to_sta in mt7996_mac_add_txs_skb() ... ==================== Link: https://patch.msgid.link/20250320131106.33266-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysbonding: check xdp prog when set bond modeWang Liang
Following operations can trigger a warning[1]: ip netns add ns1 ip netns exec ns1 ip link add bond0 type bond mode balance-rr ip netns exec ns1 ip link set dev bond0 xdp obj af_xdp_kern.o sec xdp ip netns exec ns1 ip link set bond0 type bond mode broadcast ip netns del ns1 When delete the namespace, dev_xdp_uninstall() is called to remove xdp program on bond dev, and bond_xdp_set() will check the bond mode. If bond mode is changed after attaching xdp program, the warning may occur. Some bond modes (broadcast, etc.) do not support native xdp. Set bond mode with xdp program attached is not good. Add check for xdp program when set bond mode. [1] ------------[ cut here ]------------ WARNING: CPU: 0 PID: 11 at net/core/dev.c:9912 unregister_netdevice_many_notify+0x8d9/0x930 Modules linked in: CPU: 0 UID: 0 PID: 11 Comm: kworker/u4:0 Not tainted 6.14.0-rc4 #107 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:unregister_netdevice_many_notify+0x8d9/0x930 Code: 00 00 48 c7 c6 6f e3 a2 82 48 c7 c7 d0 b3 96 82 e8 9c 10 3e ... RSP: 0018:ffffc90000063d80 EFLAGS: 00000282 RAX: 00000000ffffffa1 RBX: ffff888004959000 RCX: 00000000ffffdfff RDX: 0000000000000000 RSI: 00000000ffffffea RDI: ffffc90000063b48 RBP: ffffc90000063e28 R08: ffffffff82d39b28 R09: 0000000000009ffb R10: 0000000000000175 R11: ffffffff82d09b40 R12: ffff8880049598e8 R13: 0000000000000001 R14: dead000000000100 R15: ffffc90000045000 FS: 0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000d406b60 CR3: 000000000483e000 CR4: 00000000000006f0 Call Trace: <TASK> ? __warn+0x83/0x130 ? unregister_netdevice_many_notify+0x8d9/0x930 ? report_bug+0x18e/0x1a0 ? handle_bug+0x54/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? unregister_netdevice_many_notify+0x8d9/0x930 ? bond_net_exit_batch_rtnl+0x5c/0x90 cleanup_net+0x237/0x3d0 process_one_work+0x163/0x390 worker_thread+0x293/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xec/0x1e0 ? __pfx_kthread+0x10/0x10 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 9e2ee5c7e7c3 ("net, bonding: Add XDP support to the bonding driver") Signed-off-by: Wang Liang <wangliang74@huawei.com> Acked-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250321044852.1086551-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daystcp: avoid atomic operations on sk->sk_rmem_allocEric Dumazet
TCP uses generic skb_set_owner_r() and sock_rfree() for received packets, with socket lock being owned. Switch to private versions, avoiding two atomic operations per packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250320121604.3342831-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysipv6: fix _DEVADD() and _DEVUPD() macrosEric Dumazet
ip6_rcv_core() is using: __IP6_ADD_STATS(net, idev, IPSTATS_MIB_NOECTPKTS + (ipv6_get_dsfield(hdr) & INET_ECN_MASK), max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs)); This is currently evaluating both expressions twice. Fix _DEVADD() and _DEVUPD() macros to evaluate their arguments once. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250319212516.2385451-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysaf_unix: Explicitly include headers for non-pointer struct fields.Kuniyuki Iwashima
include/net/af_unix.h indirectly includes some definitions for structs. Let's include such headers explicitly. linux/atomic.h : scm_stat.nr_fds linux/net.h : unix_sock.peer_wq linux/path.h : unix_sock.path linux/spinlock.h : unix_sock.lock linux/wait.h : unix_sock.peer_wake uapi/linux/un.h : unix_address.name[] linux/socket.h is removed as the structs there are not used directly, and linux/un.h is clarified with uapi as un.h only exists under include/uapi. While at it, duplicate headers are removed from .c files. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250318034934.86708-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysaf_unix: Move internal definitions to net/unix/.Kuniyuki Iwashima
net/af_unix.h is included by core and some LSMs, but most definitions need not be. Let's move struct unix_{vertex,edge} to net/unix/garbage.c and other definitions to net/unix/af_unix.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250318034934.86708-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysaf_unix: Sort headers.Kuniyuki Iwashima
This is a prep patch to make the following changes cleaner. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250318034934.86708-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daystcp: support TCP_RTO_MIN_US for set/getsockopt useJason Xing
Support adjusting/reading RTO MIN for socket level by using set/getsockopt(). This new option has the same effect as TCP_BPF_RTO_MIN, which means it doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. When the socket is created, its icsk_rto_min is set to the default value that is controlled by sysctl_tcp_rto_min_us. Then if application calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then icsk_rto_min will be overridden in jiffies unit. This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around icsk_rto_min. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 daysnet: introduce per netns packet chainsPaolo Abeni
Currently network taps unbound to any interface are linked in the global ptype_all list, affecting the performance in all the network namespaces. Add per netns ptypes chains, so that in the mentioned case only the netns owning the packet socket(s) is affected. While at that drop the global ptype_all list: no in kernel user registers a tap on "any" type without specifying either the target device or the target namespace (and IMHO doing that would not make any sense). Note that this adds a conditional in the fast path (to check for per netns ptype_specific list) and increases the dataset size by a cacheline (owing the per netns lists). Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumaze@google.com> Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 daysMerge tag 'vfs-6.15-rc1.afs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs afs updates from Christian Brauner: "This contains the work for afs for this cycle: - Fix an occasional hang that's only really encountered when rmmod'ing the kafs module - Remove the "-o autocell" mount option. This is obsolete with the dynamic root and removing it makes the next patch slightly easier - Change how the dynamic root mount is constructed. Currently, the root directory is (de)populated when it is (un)mounted if there are cells already configured and, further, pairs of automount points have to be created/removed each time a cell is added/deleted This is changed so that readdir on the root dir lists all the known cell automount pairs plus the @cell symlinks and the inodes and dentries are constructed by lookup on demand. This simplifies the cell management code - A few improvements to the afs_volume and afs_server tracepoints - Pass trace info into the afs_lookup_cell() function to allow the trace log to indicate the purpose of the lookup - Remove the 'net' parameter from afs_unuse_cell() as it's superfluous - In rxrpc, allow a kernel app (such as kafs) to store a word of information on rxrpc_peer records - Use the information stored on the rxrpc_peer record to point to the afs_server record. This allows the server address lookup to be done away with - Simplify the afs_server ref/activity accounting to make each one self-contained and not garbage collected from the cell management work item - Simplify the afs_cell ref/activity accounting to make each one of these also self-contained and not driven by a central management work item The current code was intended to make it such that a single timer for the namespace and one work item per cell could do all the work required to maintain these records. This, however, made for some sequencing problems when cleaning up these records. Further, the attempt to pass refs along with timers and work items made getting it right rather tricky when the timer or work item already had a ref attached and now a ref had to be got rid of" * tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Simplify cell record handling afs: Fix afs_server ref accounting afs: Use the per-peer app data provided by rxrpc rxrpc: Allow the app to store private data on peer structs afs: Drop the net parameter from afs_unuse_cell() afs: Make afs_lookup_cell() take a trace note afs: Improve server refcount/active count tracing afs: Improve afs_volume tracing to display a debug ID afs: Change dynroot to create contents on demand afs: Remove the "autocell" mount option
4 daystcp/dccp: Remove inet_connection_sock_af_ops.addr2sockaddr().Kuniyuki Iwashima
inet_connection_sock_af_ops.addr2sockaddr() hasn't been used at all in the git era. $ git grep addr2sockaddr $(git rev-list HEAD | tail -n 1) Let's remove it. Note that there was a 4 bytes hole after sockaddr_len and now it's 6 bytes, so the binary layout is not changed. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250318060112.3729-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 daystcp: move icsk_clean_acked to a better locationEric Dumazet
As a followup of my presentation in Zagreb for netdev 0x19: icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE is enabled from tcp_ack(). Rename it to tcp_clean_acked, move it to tcp_sock structure in the tcp_sock_read_rx for better cache locality in TCP fast path. Define this field only when CONFIG_TLS_DEVICE is enabled saving 8 bytes on configs not using it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250317085313.2023214-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 daysax25: Remove broken autobindMurad Masimov
Binding AX25 socket by using the autobind feature leads to memory leaks in ax25_connect() and also refcount leaks in ax25_release(). Memory leak was detected with kmemleak: ================================================================ unreferenced object 0xffff8880253cd680 (size 96): backtrace: __kmalloc_node_track_caller_noprof (./include/linux/kmemleak.h:43) kmemdup_noprof (mm/util.c:136) ax25_rt_autobind (net/ax25/ax25_route.c:428) ax25_connect (net/ax25/af_ax25.c:1282) __sys_connect_file (net/socket.c:2045) __sys_connect (net/socket.c:2064) __x64_sys_connect (net/socket.c:2067) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) ================================================================ When socket is bound, refcounts must be incremented the way it is done in ax25_bind() and ax25_setsockopt() (SO_BINDTODEVICE). In case of autobind, the refcounts are not incremented. This bug leads to the following issue reported by Syzkaller: ================================================================ ax25_connect(): syz-executor318 uses autobind, please contact jreuter@yaina.de ------------[ cut here ]------------ refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 5317 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 Modules linked in: CPU: 0 UID: 0 PID: 5317 Comm: syz-executor318 Not tainted 6.14.0-rc4-syzkaller-00278-gece144f151ac #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 ... Call Trace: <TASK> __refcount_dec include/linux/refcount.h:336 [inline] refcount_dec include/linux/refcount.h:351 [inline] ref_tracker_free+0x6af/0x7e0 lib/ref_tracker.c:236 netdev_tracker_free include/linux/netdevice.h:4302 [inline] netdev_put include/linux/netdevice.h:4319 [inline] ax25_release+0x368/0x960 net/ax25/af_ax25.c:1080 __sock_release net/socket.c:647 [inline] sock_close+0xbc/0x240 net/socket.c:1398 __fput+0x3e9/0x9f0 fs/file_table.c:464 __do_sys_close fs/open.c:1580 [inline] __se_sys_close fs/open.c:1565 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1565 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> ================================================================ Considering the issues above and the comments left in the code that say: "check if we can remove this feature. It is broken."; "autobinding in this may or may not work"; - it is better to completely remove this feature than to fix it because it is broken and leads to various kinds of memory bugs. Now calling connect() without first binding socket will result in an error (-EINVAL). Userspace software that relies on the autobind feature might get broken. However, this feature does not seem widely used with this specific driver as it was not reliable at any point of time, and it is already broken anyway. E.g. ax25-tools and ax25-apps packages for popular distributions do not use the autobind feature for AF_AX25. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+33841dc6aa3e1d86b78a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33841dc6aa3e1d86b78a Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
7 daysnet: mctp: Remove unnecessary cast in mctp_cbHerbert Xu
The void * cast in mctp_cb is unnecessary as it's already been done at the start of the function. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/Z9PwOQeBSYlgZlHq@gondor.apana.org.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 daysnetfilter: fib: avoid lookup if socket is availableFlorian Westphal
In case the fib match is used from the input hook we can avoid the fib lookup if early demux assigned a socket for us: check that the input interface matches sk-cached one. Rework the existing 'lo bypass' logic to first check sk, then for loopback interface type to elide the fib lookup. This speeds up fib matching a little, before: 93.08 GBit/s (no rules at all) 75.1 GBit/s ("fib saddr . iif oif missing drop" in prerouting) 75.62 GBit/s ("fib saddr . iif oif missing drop" in input) After: 92.48 GBit/s (no rules at all) 75.62 GBit/s (fib rule in prerouting) 90.37 GBit/s (fib rule in input). Numbers for the 'no rules' and 'prerouting' are expected to closely match in-between runs, the 3rd/input test case exercises the the 'avoid lookup if cached ifindex in sk matches' case. Test used iperf3 via veth interface, lo can't be used due to existing loopback test. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc8). Conflict: tools/testing/selftests/net/Makefile 03544faad761 ("selftest: net: add proc_net_pktgen") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") tools/testing/selftests/net/config: 85cb3711acb8 ("selftests: net: Add test cases for link and peer netns") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") Adjacent commits: tools/testing/selftests/net/Makefile c935af429ec2 ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY") 355d940f4d5a ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 daysmptcp: sysctl: add available_path_managersGeliang Tang
Similarly to net.mptcp.available_schedulers, this patch adds a new one net.mptcp.available_path_managers to list the available path managers. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-11-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 daysmptcp: pm: define struct mptcp_pm_opsGeliang Tang
In order to allow users to develop their own BPF-based path manager, this patch defines a struct ops "mptcp_pm_ops" for an MPTCP path manager, which contains a set of interfaces. Currently only init() and release() interfaces are included, subsequent patches will add others step by step. Add a set of functions to register, unregister, find and validate a given path manager struct ops. "list" is used to add this path manager to mptcp_pm_list list when it is registered. "name" is used to identify this path manager. mptcp_pm_find() uses "name" to find a path manager on the list. mptcp_pm_unregister is not used in this set, but will be invoked in .unreg of struct bpf_struct_ops. mptcp_pm_validate() will be invoked in .validate of struct bpf_struct_ops. That's why they are exported. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-6-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
9 daysMerge tag 'for-net-2025-03-14' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_event: Fix connection regression between LE and non-LE adapters - Fix error code in chan_alloc_skb_cb() * tag 'for-net-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_event: Fix connection regression between LE and non-LE adapters Bluetooth: Fix error code in chan_alloc_skb_cb() ==================== Link: https://patch.msgid.link/20250314163847.110069-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daystcp: cache RTAX_QUICKACK metric in a hot cache lineEric Dumazet
tcp_in_quickack_mode() is called from input path for small packets. It calls __sk_dst_get() which reads sk->sk_dst_cache which has been put in sock_read_tx group (for good reasons). Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses. Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull these cache lines for the cases a delayed ACK is scheduled. After this patch TCP receive path does not longer access sock_read_tx group. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250312083907.1931644-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daysinet: frags: change inet_frag_kill() to defer refcount updatesEric Dumazet
In the following patch, we no longer assume inet_frag_kill() callers own a reference. Consuming two refcounts from inet_frag_kill() would lead in UAF. Propagate the pointer to the refs that will be consumed later by the final inet_frag_putn() call. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250312082250.1803501-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daysinet: frags: add inet_frag_putn() helperEric Dumazet
inet_frag_putn() can release multiple references in one step. Use it in inet_frags_free_cb(). Replace inet_frag_put(X) with inet_frag_putn(X, 1) Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250312082250.1803501-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daysudp_tunnel: use static call for GRO hooks when possiblePaolo Abeni
It's quite common to have a single UDP tunnel type active in the whole system. In such a case we can replace the indirect call for the UDP tunnel GRO callback with a static call. Add the related accounting in the control path and switch to static call when possible. To keep the code simple use a static array for the registered tunnel types, and size such array based on the kernel config. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/6fd1f9c7651151493ecab174e7b8386a1534170d.1741718157.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daysudp_tunnel: create a fastpath GRO lookup.Paolo Abeni
Most UDP tunnels bind a socket to a local port, with ANY address, no peer and no interface index specified. Additionally it's quite common to have a single tunnel device per namespace. Track in each namespace the UDP tunnel socket respecting the above. When only a single one is present, store a reference in the netns. When such reference is not NULL, UDP tunnel GRO lookup just need to match the incoming packet destination port vs the socket local port. The tunnel socket never sets the reuse[port] flag[s]. When bound to no address and interface, no other socket can exist in the same netns matching the specified local port. Matching packets with non-local destination addresses will be aggregated, and eventually segmented as needed - no behavior changes intended. Note that the UDP tunnel socket reference is stored into struct netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep all the fastpath-related netns fields in the same struct and allow cacheline-based optimization. Currently both the IPv4 and IPv6 socket pointer share the same cacheline as the `udp_table` field. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/4d5c319c4471161829f50cb8436841de81a5edae.1741718157.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daysnet: mana: Support holes in device list reply msgHaiyang Zhang
According to GDMA protocol, holes (zeros) are allowed at the beginning or middle of the gdma_list_devices_resp message. The existing code cannot properly handle this, and may miss some devices in the list. To fix, scan the entire list until the num_of_devs are found, or until the end of the list. Cc: stable@vger.kernel.org Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Long Li <longli@microsoft.com> Reviewed-by: Shradha Gupta <shradhagupta@microsoft.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1741723974-1534-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 daysMerge net-next/main to resolve conflictsJohannes Berg
There are a few conflicts between the work that went into wireless and that's here now, resolve them. Signed-off-by: Johannes Berg <johannes.berg@intel.com>