Age | Commit message (Collapse) | Author |
|
During reload-reinit, all entities except for params, resources, regions
and health reporter should be removed and re-added. Add a warning to
be triggered in case the driver behaves differently.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will support arbitrary SYN Cookie with BPF, and then kfunc at
TC will preallocate reqsk and initialise some fields that should
not be overwritten later by cookie_v[46]_check().
To simplify the flow in cookie_v[46]_check(), we move such fields'
initialisation to cookie_tcp_reqsk_alloc() and factorise non-BPF
SYN Cookie handling into cookie_tcp_check(), where we validate the
cookie and allocate reqsk, as done by kfunc later.
Note that we set ireq->ecn_ok in two steps, the latter of which will
be shared by the BPF case. As cookie_ecn_ok() is one-liner, now
it's inlined.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will support arbitrary SYN Cookie with BPF, and then some reqsk fields
are initialised in kfunc, and others are done in cookie_v[46]_check().
This patch factorises the common part as cookie_tcp_reqsk_init() and
calls it in cookie_tcp_reqsk_alloc() to minimise the discrepancy between
cookie_v[46]_check().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We initialise treq->af_specific in cookie_tcp_reqsk_alloc() so that
we can look up a key later in tcp_create_openreq_child().
Initially, that change was added for MD5 by commit ba5a4fdd63ae ("tcp:
make sure treq->af_specific is initialized"), but it has not been used
since commit d0f2b7a9ca0a ("tcp: Disable header prediction for MD5
flow.").
Now, treq->af_specific is used only by TCP-AO, so, we can move that
initialisation into tcp_ao_syncookie().
In addition to that, l3index in cookie_v[46]_check() is only used for
tcp_ao_syncookie(), so let's move it as well.
While at it, we move down tcp_ao_syncookie() in cookie_v4_check() so
that it will be called after security_inet_conn_request() to make
functions order consistent with cookie_v6_check().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we create a full socket from SYN Cookie, we initialise
tcp_sk(sk)->tsoffset redundantly in tcp_get_cookie_sock() as
the field is inherited from tcp_rsk(req)->ts_off.
cookie_v[46]_check
|- treq->ts_off = 0
`- tcp_get_cookie_sock
|- tcp_v[46]_syn_recv_sock
| `- tcp_create_openreq_child
| `- newtp->tsoffset = treq->ts_off
`- tcp_sk(child)->tsoffset = tsoff
Let's initialise tcp_rsk(req)->ts_off with the correct offset
and remove the second initialisation of tcp_sk(sk)->tsoffset.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_hdr(skb) and SYN Cookie are passed to __cookie_v[46]_check(), but
none of the callers passes cookie other than ntohl(th->ack_seq) - 1.
Let's fetch it in __cookie_v[46]_check() instead of passing the cookie
over and over.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will support arbitrary SYN Cookie with BPF, and then reqsk
will be preallocated before cookie_v[46]_check().
Depending on how validation fails, we send RST or just drop skb.
To make the error handling easier, let's clean up goto labels.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sock_net(sk) is used repeatedly in cookie_v[46]_check().
Let's cache it in a variable.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will grow and cut the xmas tree in cookie_v[46]_check().
This patch cleans it up to make later patches tidy.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231129022924.96156-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The value of 'err' will not be only '-EINVAL', but can be '0' in some
cases.
So it's better to rename the label 'remove_err' to 'out' to avoid
confusions.
Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-6-8d6b94150f6b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the initial subflow has been removed, we cannot know without checking
other counters, e.g. ss -ti <filter> | grep -c tcp-ulp-mptcp or
getsockopt(SOL_MPTCP, MPTCP_FULL_INFO, ...) (or others except MPTCP_INFO
of course) and then check mptcp_subflow_data->num_subflows to get the
total amount of subflows.
This patch adds a new counter mptcpi_subflows_total in mptcpi_flags to
store the total amount of subflows, including the initial one. A new
helper __mptcp_has_initial_subflow() is added to check whether the
initial subflow has been removed or not. With this helper, we can then
compute the total amount of subflows from mptcp_info by doing something
like:
mptcpi_subflows_total = mptcpi_subflows +
__mptcp_has_initial_subflow(msk).
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/428
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-1-8d6b94150f6b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Mute the periodic "stalled pool shutdown" warning if the page pool
is visible to user space. Rolling out a driver using page pools
to just a few hundred hosts at Meta surfaces applications which
fail to reap their broken sockets. Obviously it's best if the
applications are fixed, but we don't generally print warnings
for application resource leaks. Admins can now depend on the
netlink interface for getting page pool info to detect buggy
apps.
While at it throw in the ID of the pool into the message,
in rare cases (pools from destroyed netns) this will make
finding the pool with a debugger easier.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Dump the stats into netlink. More clever approaches
like dumping the stats per-CPU for each CPU individually
to see where the packets get consumed can be implemented
in the future.
A trimmed example from a real (but recently booted system):
$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-stats-get
[{'info': {'id': 19, 'ifindex': 2},
'alloc-empty': 48,
'alloc-fast': 3024,
'alloc-refill': 0,
'alloc-slow': 48,
'alloc-slow-high-order': 0,
'alloc-waive': 0,
'recycle-cache-full': 0,
'recycle-cached': 0,
'recycle-released-refcnt': 0,
'recycle-ring': 0,
'recycle-ring-full': 0},
{'info': {'id': 18, 'ifindex': 2},
'alloc-empty': 66,
'alloc-fast': 11811,
'alloc-refill': 35,
'alloc-slow': 66,
'alloc-slow-high-order': 0,
'alloc-waive': 0,
'recycle-cache-full': 1145,
'recycle-cached': 6541,
'recycle-released-refcnt': 0,
'recycle-ring': 1275,
'recycle-ring-full': 0},
{'info': {'id': 17, 'ifindex': 2},
'alloc-empty': 73,
'alloc-fast': 62099,
'alloc-refill': 413,
...
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Report when page pool was destroyed. Together with the inflight
/ memory use reporting this can serve as a replacement for the
warning about leaked page pools we currently print to dmesg.
Example output for a fake leaked page pool using some hacks
in netdevsim (one "live" pool, and one "leaked" on the same dev):
$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 2, 'ifindex': 3},
{'id': 1, 'ifindex': 3, 'destroyed': 133, 'inflight': 1}]
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Advanced deployments need the ability to check memory use
of various system components. It makes it possible to make informed
decisions about memory allocation and to find regressions and leaks.
Report memory use of page pools. Report both number of references
and bytes held.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Generate netlink notifications about page pool state changes.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Expose the very basic page pool information via netlink.
Example using ynl-py for a system with 9 queues:
$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 19, 'ifindex': 2, 'napi-id': 147},
{'id': 18, 'ifindex': 2, 'napi-id': 146},
{'id': 17, 'ifindex': 2, 'napi-id': 145},
{'id': 16, 'ifindex': 2, 'napi-id': 144},
{'id': 15, 'ifindex': 2, 'napi-id': 143},
{'id': 14, 'ifindex': 2, 'napi-id': 142},
{'id': 13, 'ifindex': 2, 'napi-id': 141},
{'id': 12, 'ifindex': 2, 'napi-id': 140},
{'id': 11, 'ifindex': 2, 'napi-id': 139},
{'id': 10, 'ifindex': 2, 'napi-id': 138}]
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
To avoid any issues with race conditions on accessing napi
and having to think about the lifetime of NAPI objects
in netlink GET - stash the napi_id to which page pool
was linked at creation time.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Link the page pools with netdevs. This needs to be netns compatible
so we have two options. Either we record the pools per netns and
have to worry about moving them as the netdev gets moved.
Or we record them directly on the netdev so they move with the netdev
without any extra work.
Implement the latter option. Since pools may outlast netdev we need
a place to store orphans. In time honored tradition use loopback
for this purpose.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
To give ourselves the flexibility of creating netlink commands
and ability to refer to page pool instances in uAPIs create
IDs for page pools.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We'll soon (next change in the series) need a fuller unwind path
in page_pool_create() so create the inverse of page_pool_init().
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.8
The first features pull request for v6.8. Not so big in number of
commits but we removed quite a few ancient drivers: libertas 16-bit
PCMCIA support, atmel, hostap, zd1201, orinoco, ray_cs, wl3501 and
rndis_wlan.
Major changes:
cfg80211/mac80211
- extend support for scanning while Multi-Link Operation (MLO) connected
* tag 'wireless-next-2023-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (68 commits)
wifi: nl80211: Documentation update for NL80211_CMD_PORT_AUTHORIZED event
wifi: mac80211: Extend support for scanning while MLO connected
wifi: cfg80211: Extend support for scanning while MLO connected
wifi: ieee80211: fix PV1 frame control field name
rfkill: return ENOTTY on invalid ioctl
MAINTAINERS: update iwlwifi maintainers
wifi: rtw89: 8922a: read efuse content from physical map
wifi: rtw89: 8922a: read efuse content via efuse map struct from logic map
wifi: rtw89: 8852c: read RX gain offset from efuse for 6GHz channels
wifi: rtw89: mac: add to access efuse for WiFi 7 chips
wifi: rtw89: mac: use mac_gen pointer to access about efuse
wifi: rtw89: 8922a: add 8922A basic chip info
wifi: rtlwifi: drop unused const_amdpci_aspm
wifi: mwifiex: mwifiex_process_sleep_confirm_resp(): remove unused priv variable
wifi: rtw89: regd: update regulatory map to R65-R44
wifi: rtw89: regd: handle policy of 6 GHz according to BIOS
wifi: rtw89: acpi: process 6 GHz band policy from DSM
wifi: rtlwifi: simplify rtl_action_proc() and rtl_tx_agg_start()
wifi: rtw89: pci: update interrupt mitigation register for 8922AE
wifi: rtw89: pci: correct interrupt mitigation register for 8852CE
...
====================
Link: https://lore.kernel.org/r/20231127180056.0B48DC433C8@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
- If the scan request includes a link ID, validate that it is
one of the active links. Otherwise, if the scan request doesn't
include a valid link ID, select one of the active links.
- When reporting the TSF for a BSS entry, use the link ID information
from the Rx status or the scan request to set the parent BSSID.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231113112844.68564692c404.Iae9605cbb7f9d52e00ce98260b3559a34cf18341@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
To extend the support of TSF accounting in scan results for MLO
connections, allow to indicate in the scan request the link ID
corresponding to the BSS whose TSF should be used for the TSF
accounting.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231113112844.d4490bcdefb1.I8fcd158b810adddef4963727e9153096416b30ce@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
For unknown ioctls the correct error is
ENOTTY "Inappropriate ioctl for device".
ENOSYS as returned before should only be used to
indicate that a syscall is not available at all.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20231101-rfkill-ioctl-enosys-v1-1-5bf374fabffe@weissschuh.net
[in theory this breaks userspace API, but it was discussed and
researched, and nothing found relying on the current behaviour]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The commit dcd2cf5f2fc0 ("net/smc: add autocorking support") adds an
atomic variable tx_pushing in smc_connection to make sure only one can
send to let it cork more and save CDC slot. since smc_tx_pending can be
called in the soft IRQ without checking sock_owned_by_user() at that
time, which would cause a race condition because bh_lock_sock() did
not honor sock_lock()
After commit 6b88af839d20 ("net/smc: don't send in the BH context if
sock_owned_by_user"), the transmission is deferred to when sock_lock()
is held by the user. Therefore, we no longer need tx_pending to hold
message.
So remove atomic variable tx_pushing and its operation, and
smc_tx_sndbuf_nonempty becomes a wrapper of __smc_tx_sndbuf_nonempty,
so rename __smc_tx_sndbuf_nonempty back to smc_tx_sndbuf_nonempty
Suggested-by: Alexandra Winter <wintera@linux.ibm.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
diff v4: remove atomic variable tx_pushing
diff v3: improvements in the commit body and comments
diff v2: fix a typo in commit body and add net-next subject-prefix
net/smc/smc.h | 1 -
net/smc/smc_tx.c | 30 +-----------------------------
2 files changed, 1 insertion(+), 30 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new sysctl: net.smc.smcr_max_conns_per_lgr, which is
used to control the preferred max connections per lgr for
SMC-R v2.1. The default value of this sysctl is 255, and
the acceptable value ranges from 16 to 255.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new sysctl: net.smc.smcr_max_links_per_lgr, which is
used to control the preferred max links per lgr for SMC-R
v2.1. The default value of this sysctl is 2, and the acceptable
value ranges from 1 to 2.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/intel/ice/ice_main.c
c9663f79cd82 ("ice: adjust switchdev rebuild path")
7758017911a4 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/
Adjacent changes:
kernel/bpf/verifier.c
bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations")
5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzkaller discovered that if tls_sw_splice_eof() is executed as part of
sendfile() when the plaintext/ciphertext sk_msg are empty, the send path
gets confused because the empty ciphertext buffer does not have enough
space for the encryption overhead. This causes tls_push_record() to go on
the `split = true` path (which is only supposed to be used when interacting
with an attached BPF program), and then get further confused and hit the
tls_merge_open_record() path, which then assumes that there must be at
least one populated buffer element, leading to a NULL deref.
It is possible to have empty plaintext/ciphertext buffers if we previously
bailed from tls_sw_sendmsg_locked() via the tls_trim_both_msgs() path.
tls_sw_push_pending_record() already handles this case correctly; let's do
the same check in tls_sw_splice_eof().
Fixes: df720d288dbb ("tls/sw: Use splice_eof() to flush")
Cc: stable@vger.kernel.org
Reported-by: syzbot+40d43509a099ea756317@syzkaller.appspotmail.com
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20231122214447.675768-1-jannh@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We found a data corruption issue during testing of SMC-R on Redis
applications.
The benchmark has a low probability of reporting a strange error as
shown below.
"Error: Protocol error, got "\xe2" as reply type byte"
Finally, we found that the retrieved error data was as follows:
0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C
0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2
It is quite obvious that this is a SMC DECLINE message, which means that
the applications received SMC protocol message.
We found that this was caused by the following situations:
client server
¦ clc proposal
------------->
¦ clc accept
<-------------
¦ clc confirm
------------->
wait llc confirm
send llc confirm
¦failed llc confirm
¦ x------
(after 2s)timeout
wait llc confirm rsp
wait decline
(after 1s) timeout
(after 2s) timeout
¦ decline
-------------->
¦ decline
<--------------
As a result, a decline message was sent in the implementation, and this
message was read from TCP by the already-fallback connection.
This patch double the client timeout as 2x of the server value,
With this simple change, the Decline messages should never cross or
collide (during Confirm link timeout).
This issue requires an immediate solution, since the protocol updates
involve a more long-term solution.
Fixes: 0fb0b02bd6fd ("net/smc: adapt SMC client code to use the LLC flow")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When MC (multicast) list is updated by the networking layer due to a
user command and as well as when allmulti flag is set, it needs to be
passed to the enslaved Ethernet devices. This patch allows this
to happen by implementing ndo_change_rx_flags() and ndo_set_rx_mode()
API calls that in turns pass it to the slave devices using
existing API calls.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To fully benefit from previous commit add one byte of state
in the first cache line recording if we need to look at
the slow part.
The packing isn't all that impressive right now, we create
a 7B hole. I'm expecting Olek's rework will reshuffle this,
anyway.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20231121000048.789613-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct page_pool is rather performance critical and we use
16B of the first cache line to store 2 pointers used only
by test code. Future patches will add more informational
(non-fast path) attributes.
It's convenient for the user of the API to not have to worry
which fields are fast and which are slow path. Use struct
groups to split the params into the two categories internally.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231121000048.789613-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-11-21
We've added 19 non-merge commits during the last 4 day(s) which contain
a total of 18 files changed, 1043 insertions(+), 416 deletions(-).
The main changes are:
1) Fix BPF verifier to validate callbacks as if they are called an unknown
number of times in order to fix not detecting some unsafe programs,
from Eduard Zingerman.
2) Fix bpf_redirect_peer() handling which missed proper stats accounting
for veth and netkit and also generally fix missing stats for the latter,
from Peilin Ye, Daniel Borkmann et al.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: check if max number of bpf_loop iterations is tracked
bpf: keep track of max number of bpf_loop callback iterations
selftests/bpf: test widening for iterating callbacks
bpf: widening for callback iterators
selftests/bpf: tests for iterating callbacks
bpf: verify callbacks as if they are called unknown number of times
bpf: extract setup_func_entry() utility function
bpf: extract __check_reg_arg() utility function
selftests/bpf: fix bpf_loop_bench for new callback verification scheme
selftests/bpf: track string payload offset as scalar in strobemeta
selftests/bpf: track tcp payload offset as scalar in xdp_synproxy
selftests/bpf: Add netkit to tc_redirect selftest
selftests/bpf: De-veth-ize the tc_redirect test case
bpf, netkit: Add indirect call wrapper for fetching peer dev
bpf: Fix dev's rx stats for bpf_redirect_peer traffic
veth: Use tstats per-CPU traffic counters
netkit: Add tstats per-CPU traffic counters
net: Move {l,t,d}stats allocation to core and convert veth & vrf
net, vrf: Move dstats structure to core
====================
Link: https://lore.kernel.org/r/20231121193113.11796-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Networking supports changing netdevice's netns and name
at the same time. This allows avoiding name conflicts
and having to rename the interface in multiple steps.
E.g. netns1={eth0, eth1}, netns2={eth1} - we want
to move netns1:eth1 to netns2 and call it eth0 there.
If we can't rename "in flight" we'd need to (1) rename
eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0.
To rename the underlying struct device we have to call
device_rename(). The rename()'s MOVE event, however, doesn't
"belong" to either the old or the new namespace.
If there are conflicts on both sides it's actually impossible
to issue a real MOVE (old name -> new name) without confusing
user space. And Daniel reports that such confusions do in fact
happen for systemd, in real life.
Since we already issue explicit REMOVE and ADD events
manually - suppress the MOVE event completely. Move
the ADD after the rename, so that the REMOVE uses
the old name, and the ADD the new one.
If there is no rename this changes the picture as follows:
Before:
old ns | KERNEL[213.399289] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401302] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401397] move /devices/virtual/net/eth0 (net)
After:
old ns | KERNEL[266.774257] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[266.774509] add /devices/virtual/net/eth0 (net)
If there is a rename and a conflict (using the exact eth0/eth1
example explained above) we get this:
Before:
old ns | KERNEL[224.316833] remove /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.318551] add /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.319662] move /devices/virtual/net/eth0 (net)
After:
old ns | KERNEL[333.033166] remove /devices/virtual/net/eth1 (net)
new ns | KERNEL[333.035098] add /devices/virtual/net/eth0 (net)
Note that "in flight" rename is only performed when needed.
If there is no conflict for old name in the target netns -
the rename will be performed separately by dev_change_name(),
as if the rename was a different command, and there will still
be a MOVE event for the rename:
Before:
old ns | KERNEL[194.416429] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418809] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418869] move /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.420866] move /devices/virtual/net/eth1 (net)
After:
old ns | KERNEL[71.917520] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.919155] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.920729] move /devices/virtual/net/eth1 (net)
If deleting the MOVE event breaks some user space we should insert
an explicit kobject_uevent(MOVE) after the ADD, like this:
@@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
netdev_adjacent_add_links(dev);
+ /* User space wants an explicit MOVE event, issue one unless
+ * dev_change_name() will get called later and issue one.
+ */
+ if (!pat || new_name[0])
+ kobject_uevent(&dev->dev.kobj, KOBJ_MOVE);
+
/* Adapt owner in case owning user namespace of target network
* namespace is different from the original one.
*/
Reported-by: Daniel Gröber <dxld@darkboxed.org>
Link: https://lore.kernel.org/all/20231010121003.x3yi6fihecewjy4e@House.clients.dxld.at/
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20231120184140.578375-1-kuba@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types)
net/ipv4/route.c:783:46: expected unsigned int [usertype] key
net/ipv4/route.c:783:46: got restricted __be32 [usertype] new_gw
Fixes: 969447f226b4 ("ipv4: use new_gw for redirect neigh lookup")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
indirect call wrapper and therefore optimize the bpf_redirect_peer()
internal handling a bit. Add a small skb_get_peer_dev() wrapper which
utilizes the INDIRECT_CALL_1() macro instead of open coding.
Future work could potentially add a peer pointer directly into struct
net_device in future and convert veth and netkit over to use it so
that eventually ndo_get_peer_dev can be removed.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
is not accounted for in the RX stats of supported devices (that is, veth
and netkit), confusing user space metrics collectors such as cAdvisor [0],
as reported by Youlun.
Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
@tstats per-CPU counters (instead of @lstats, or @dstats).
To make this more fool-proof, error out when ndo_get_peer_dev is set but
@tstats are not selected.
[0] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
Reported-by: Youlun Zhang <zhangyoulun@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to all the sock diag modules in one fell swoop.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
if a PF has 256 or more VFs, ip link command will allocate an order 3
memory or more, and maybe trigger OOM due to memory fragment,
the VFs needed memory size is computed in rtnl_vfinfo_size.
so introduce nlmsg_new_large which calls netlink_alloc_large_skb in
which vmalloc is used for large memory, to avoid the failure of
allocating memory
ip invoked oom-killer: gfp_mask=0xc2cc0(GFP_KERNEL|__GFP_NOWARN|\
__GFP_COMP|__GFP_NOMEMALLOC), order=3, oom_score_adj=0
CPU: 74 PID: 204414 Comm: ip Kdump: loaded Tainted: P OE
Call Trace:
dump_stack+0x57/0x6a
dump_header+0x4a/0x210
oom_kill_process+0xe4/0x140
out_of_memory+0x3e8/0x790
__alloc_pages_slowpath.constprop.116+0x953/0xc50
__alloc_pages_nodemask+0x2af/0x310
kmalloc_large_node+0x38/0xf0
__kmalloc_node_track_caller+0x417/0x4d0
__kmalloc_reserve.isra.61+0x2e/0x80
__alloc_skb+0x82/0x1c0
rtnl_getlink+0x24f/0x370
rtnetlink_rcv_msg+0x12c/0x350
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x1b2/0x280
netlink_sendmsg+0x355/0x4a0
sock_sendmsg+0x5b/0x60
____sys_sendmsg+0x1ea/0x250
___sys_sendmsg+0x88/0xd0
__sys_sendmsg+0x5e/0xa0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f95a65a5b70
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20231115120108.3711-1-lirongqing@baidu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Proper refcounts will always warn splat when something goes wrong,
be it underflow, saturation or object resurrection. As these are always
a source of bugs, use it in cls_u32 as a safeguard to prevent/catch issues.
Another benefit is that the refcount API self documents the code, making
clear when transitions to dead are expected.
For such an update we had to make minor adaptations on u32 to fit the refcount
API. First we set explicitly to '1' when objects are created, then the
objects are alive until a 1 -> 0 happens, which is then released appropriately.
The above made clear some redundant operations in the u32 code
around the root_ht handling that were removed. The root_ht is created
with a refcnt set to 1. Then when it's associated with tcf_proto it increments the refcnt to 2.
Throughout the entire code the root_ht is an exceptional case and can never be referenced,
therefore the refcnt never incremented/decremented.
Its lifetime is always bound to tcf_proto, meaning if you delete tcf_proto
the root_ht is deleted as well. The code made up for the fact that root_ht refcnt is 2 and did
a double decrement to free it, which is not a fit for the refcount API.
Even though refcount_t is implemented using atomics, we should observe
a negligible control plane impact.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231114141856.974326-2-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Revert following commits:
commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC")
commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command")
They need more time for reviews.
Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.open-mesh.org/linux-merge
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Implement new multicast packet type, including its transmission,
forwarding and parsing, by Linus Lüssing (3 patches)
- Switch to new headers for sprintf and array size,
by Sven Eckelmann (2 patches)
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add an assert to verify that the device lock is always held throughout
reload operations.
Tested the following flows with netdevsim and mlxsw while lockdep is
enabled:
netdevsim:
# echo "10 1" > /sys/bus/netdevsim/new_device
# devlink dev reload netdevsim/netdevsim10
# ip netns add bla
# devlink dev reload netdevsim/netdevsim10 netns bla
# ip netns del bla
# echo 10 > /sys/bus/netdevsim/del_device
mlxsw:
# devlink dev reload pci/0000:01:00.0
# ip netns add bla
# devlink dev reload pci/0000:01:00.0 netns bla
# ip netns del bla
# echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
# echo 1 > /sys/bus/pci/rescan
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().
Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.
However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.
The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.
To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload.
Do that when reload is explicitly triggered by user space by specifying
the 'DEVLINK_NL_FLAG_NEED_DEV_LOCK' flag in the pre_doit and post_doit
operations of the reload command.
A previous patch already handled the case where reload is invoked as
part of netns dismantle.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce a new private flag ('DEVLINK_NL_FLAG_NEED_DEV_LOCK') to allow
netlink commands to specify that they need to acquire the device lock in
their pre_doit operation and release it in their post_doit operation.
The reload command will use this flag in the subsequent patch.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, private flags (e.g., 'DEVLINK_NL_FLAG_NEED_PORT') are only
used in pre_doit operations, but a subsequent patch will need to
conditionally lock and unlock the device lock in pre and post doit
operations, respectively.
As a preparation, enable the use of private flags in post_doit
operations in a similar fashion to how it is done for pre_doit
operations.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().
Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.
However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.
The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.
To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload.
For now, only do that when reload is triggered as part of netns
dismantle. Subsequent patches will handle the case where reload is
explicitly triggered by user space.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|