diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2022-05-25 12:22:58 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2022-05-25 12:22:58 -0700 |
commit | 7e062cda7d90543ac8c7700fc7c5527d0c0f22ad (patch) | |
tree | 2f1602595d9416be41cc2e88a659ba4c145734b9 /net/core/sock.c | |
parent | 5d1772b1739b085721431eef0c0400f3aff01abf (diff) | |
parent | 57d7becda9c9e612e6b00676f2eecfac3e719e88 (diff) | |
download | lwn-7e062cda7d90543ac8c7700fc7c5527d0c0f22ad.tar.gz lwn-7e062cda7d90543ac8c7700fc7c5527d0c0f22ad.zip |
Merge tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding to
very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space to
remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections that
have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting. This
makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core from
Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection"
* tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits)
ptp: ocp: Add firmware header checks
ptp: ocp: fix PPS source selector debugfs reporting
ptp: ocp: add .init function for sma_op vector
ptp: ocp: vectorize the sma accessor functions
ptp: ocp: constify selectors
ptp: ocp: parameterize input/output sma selectors
ptp: ocp: revise firmware display
ptp: ocp: add Celestica timecard PCI ids
ptp: ocp: Remove #ifdefs around PCI IDs
ptp: ocp: 32-bit fixups for pci start address
Revert "net/smc: fix listen processing for SMC-Rv2"
ath6kl: Use cc-disable-warning to disable -Wdangling-pointer
selftests/bpf: Dynptr tests
bpf: Add dynptr data slices
bpf: Add bpf_dynptr_read and bpf_dynptr_write
bpf: Dynptr support for ring buffers
bpf: Add bpf_dynptr_from_mem for local dynptrs
bpf: Add verifier support for dynptrs
bpf: Suppress 'passing zero to PTR_ERR' warning
bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
...
Diffstat (limited to 'net/core/sock.c')
-rw-r--r-- | net/core/sock.c | 126 |
1 files changed, 101 insertions, 25 deletions
diff --git a/net/core/sock.c b/net/core/sock.c index 1180a0cb0110..2ff40dd0a7a6 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -141,9 +141,14 @@ #include <linux/ethtool.h> +#include "dev.h" + static DEFINE_MUTEX(proto_list_mutex); static LIST_HEAD(proto_list); +static void sock_def_write_space_wfree(struct sock *sk); +static void sock_def_write_space(struct sock *sk); + /** * sk_ns_capable - General socket capability test * @sk: Socket to use a capability on or through @@ -503,17 +508,35 @@ int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) } EXPORT_SYMBOL(__sock_queue_rcv_skb); -int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +int sock_queue_rcv_skb_reason(struct sock *sk, struct sk_buff *skb, + enum skb_drop_reason *reason) { + enum skb_drop_reason drop_reason; int err; err = sk_filter(sk, skb); - if (err) - return err; - - return __sock_queue_rcv_skb(sk, skb); + if (err) { + drop_reason = SKB_DROP_REASON_SOCKET_FILTER; + goto out; + } + err = __sock_queue_rcv_skb(sk, skb); + switch (err) { + case -ENOMEM: + drop_reason = SKB_DROP_REASON_SOCKET_RCVBUFF; + break; + case -ENOBUFS: + drop_reason = SKB_DROP_REASON_PROTO_MEM; + break; + default: + drop_reason = SKB_NOT_DROPPED_YET; + break; + } +out: + if (reason) + *reason = drop_reason; + return err; } -EXPORT_SYMBOL(sock_queue_rcv_skb); +EXPORT_SYMBOL(sock_queue_rcv_skb_reason); int __sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested, unsigned int trim_cap, bool refcounted) @@ -612,7 +635,9 @@ static int sock_bindtoindex_locked(struct sock *sk, int ifindex) if (ifindex < 0) goto out; - sk->sk_bound_dev_if = ifindex; + /* Paired with all READ_ONCE() done locklessly. */ + WRITE_ONCE(sk->sk_bound_dev_if, ifindex); + if (sk->sk_prot->rehash) sk->sk_prot->rehash(sk); sk_dst_reset(sk); @@ -690,10 +715,11 @@ static int sock_getbindtodevice(struct sock *sk, char __user *optval, { int ret = -ENOPROTOOPT; #ifdef CONFIG_NETDEVICES + int bound_dev_if = READ_ONCE(sk->sk_bound_dev_if); struct net *net = sock_net(sk); char devname[IFNAMSIZ]; - if (sk->sk_bound_dev_if == 0) { + if (bound_dev_if == 0) { len = 0; goto zero; } @@ -702,7 +728,7 @@ static int sock_getbindtodevice(struct sock *sk, char __user *optval, if (len < IFNAMSIZ) goto out; - ret = netdev_get_name(net, devname, sk->sk_bound_dev_if); + ret = netdev_get_name(net, devname, bound_dev_if); if (ret) goto out; @@ -1291,6 +1317,15 @@ set_sndbuf: __sock_set_mark(sk, val); break; + case SO_RCVMARK: + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) && + !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { + ret = -EPERM; + break; + } + + sock_valbool_flag(sk, SOCK_RCVMARK, valbool); + break; case SO_RXQ_OVFL: sock_valbool_flag(sk, SOCK_RXQ_OVFL, valbool); @@ -1717,6 +1752,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname, v.val = sk->sk_mark; break; + case SO_RCVMARK: + v.val = sock_flag(sk, SOCK_RCVMARK); + break; + case SO_RXQ_OVFL: v.val = sock_flag(sk, SOCK_RXQ_OVFL); break; @@ -1825,7 +1864,7 @@ int sock_getsockopt(struct socket *sock, int level, int optname, break; case SO_BINDTOIFINDEX: - v.val = sk->sk_bound_dev_if; + v.val = READ_ONCE(sk->sk_bound_dev_if); break; case SO_NETNS_COOKIE: @@ -2062,9 +2101,6 @@ void sk_destruct(struct sock *sk) { bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE); - WARN_ON_ONCE(!llist_empty(&sk->defer_list)); - sk_defer_free_flush(sk); - if (rcu_access_pointer(sk->sk_reuseport_cb)) { reuseport_detach_sock(sk); use_call_rcu = true; @@ -2260,6 +2296,19 @@ void sk_free_unlock_clone(struct sock *sk) } EXPORT_SYMBOL_GPL(sk_free_unlock_clone); +static void sk_trim_gso_size(struct sock *sk) +{ + if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE) + return; +#if IS_ENABLED(CONFIG_IPV6) + if (sk->sk_family == AF_INET6 && + sk_is_tcp(sk) && + !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) + return; +#endif + sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE; +} + void sk_setup_caps(struct sock *sk, struct dst_entry *dst) { u32 max_segs = 1; @@ -2279,6 +2328,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst) sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM; /* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */ sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size); + sk_trim_gso_size(sk); sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1); /* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */ max_segs = max_t(u32, READ_ONCE(dst->dev->gso_max_segs), 1); @@ -2300,8 +2350,20 @@ void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; unsigned int len = skb->truesize; + bool free; if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { + if (sock_flag(sk, SOCK_RCU_FREE) && + sk->sk_write_space == sock_def_write_space) { + rcu_read_lock(); + free = refcount_sub_and_test(len, &sk->sk_wmem_alloc); + sock_def_write_space_wfree(sk); + rcu_read_unlock(); + if (unlikely(free)) + __sk_free(sk); + return; + } + /* * Keep a reference on sk_wmem_alloc, this will be released * after sk_write_space() call @@ -2611,13 +2673,6 @@ failure: } EXPORT_SYMBOL(sock_alloc_send_pskb); -struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size, - int noblock, int *errcode) -{ - return sock_alloc_send_pskb(sk, size, 0, noblock, errcode, 0); -} -EXPORT_SYMBOL(sock_alloc_send_skb); - int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, struct sockcm_cookie *sockc) { @@ -3174,20 +3229,42 @@ static void sock_def_write_space(struct sock *sk) /* Do not wake up a writer until he can make "significant" * progress. --DaveM */ - if ((refcount_read(&sk->sk_wmem_alloc) << 1) <= READ_ONCE(sk->sk_sndbuf)) { + if (sock_writeable(sk)) { wq = rcu_dereference(sk->sk_wq); if (skwq_has_sleeper(wq)) wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND); /* Should agree with poll, otherwise some programs break */ - if (sock_writeable(sk)) - sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT); + sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT); } rcu_read_unlock(); } +/* An optimised version of sock_def_write_space(), should only be called + * for SOCK_RCU_FREE sockets under RCU read section and after putting + * ->sk_wmem_alloc. + */ +static void sock_def_write_space_wfree(struct sock *sk) +{ + /* Do not wake up a writer until he can make "significant" + * progress. --DaveM + */ + if (sock_writeable(sk)) { + struct socket_wq *wq = rcu_dereference(sk->sk_wq); + + /* rely on refcount_sub from sock_wfree() */ + smp_mb__after_atomic(); + if (wq && waitqueue_active(&wq->wait)) + wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT | + EPOLLWRNORM | EPOLLWRBAND); + + /* Should agree with poll, otherwise some programs break */ + sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT); + } +} + static void sock_def_destruct(struct sock *sk) { } @@ -3486,8 +3563,7 @@ int sock_common_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int addr_len = 0; int err; - err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, - flags & ~MSG_DONTWAIT, &addr_len); + err = sk->sk_prot->recvmsg(sk, msg, size, flags, &addr_len); if (err >= 0) msg->msg_namelen = addr_len; return err; |